No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a variable with a known correlation with existing variables


    My colleagues and I are trying to solve a simulation problem in which we are trying generate a new variable with a known correlation to given variables.

    Imagine we have two variables, x1 and x2, which have a given correlation. I want to generate a normally distributed variable y which will have a specified correlation with each of x1 and x2. For example, let’s say that x1 and x2 are correlated .25 and I want y to be correlated with x1 at .2 and x2 at .5. x1 and x2 could have a number of different distributions in different scenarios.

    Our simulation requires that we keep x1 and x2 and generate a new variable y. Note that corr2data, as I understand it, can only generate new variables which follow a normal distribution. That is, corr2data cannot take an existing variable and generate new variables correlated with that variable.

    We can generate a y with a chosen correlation with either x1 or x2. But we haven’t figured out how to generate a y that has specified correlations with both x1 and x2.

    Any help or guidance is much appreciated.

  • #2
    So it's a little complicated, but doable. First, it is simpler to work with standardized variables. So calculate z1 and z2 as standardized (mean 0, variance 1) versions of x1 and x2. Calculate the correlation between z1 and z2: call it r12. Let me denote the correlations you are targeting with x1 and x2 respectively as r1 and r2, respectively. Then the variable y you seek can be gotten as follows:

    Solve this system of linear equations for a1 and a2.  (Either by hand and use the results directly if this is a one-time task, or set up some matrix algebra to do this if you will need to do this repeatedly.)
    a1 + r12*a2= r1
    r12*a1 + a2 = r2
    and store the results a1 and a2 as scalars
    scalar a3 = sqrt(1-a1^2-a2^2)
    gen e = rnormal(0, 1)
    gen y = scalar(a1)*z1 + scalar(a2)*z2 + scalar(a3)*e
    Then y will be a standard normal variable having the desired correlations with z1 and z2 (and hence also with x1 and x2). If you need to re-center or re-scale y, that is, of course, an easy task at the end.

    Note that this is not possible for all values of r1, r2, and r. (Intuitively, if r is very close to 1, then there is some upper limit to how different r1 and r2 can be.) That is, it may be that after solving for a1 and a2, you get a1^2 + a2^2 > 1. In that case, there is no real value for a3, and the problem has no solution. (Meaning that the actual problem you seek to solve has no solution, not just that this approach fails.)


    • #3
      After several recent exchanges with Clyde offline, in which he generously took the time to write a derivation of his procedure for me, I discovered a small error in what he presented above, and I'm writing now to correct it for anyone who cares to use this.

      What he write as:
      scalar a3 = sqrt(1-a1^2-a2^2)
      should be
      scalar a3 = sqrt(1-a1^2-a2^2 -2 * a1 * a2 * r12)
      This correction comes from the fact that a3 is being obtained from the variance of a sum, and Clyde's initial version omitted the covariance of two correlated components of the sum.

      Regards, Mike


      • #4
        Thanks for finding and fixing that error!


        • #5
          I need to do something similar to this, but I need to do it repeatedly (using different values of r1 and r2) and am unsure of how to setup the matrix algebra to solve for a1 and a2.



          • #6
            I need to do something similar to this, but I need to do it repeatedly (using different values of r1 and r2) and am unsure of how to setup the matrix algebra to solve for a1 and a2.
            matrix RR = (1, r12 \ r12 1)
            matrix R = (r1 \ r2)
            matrix A = syminv(RR)*R
            scalar a1 = A[1, 1]
            scalar a2 = A[2, 1]
            (Replace r12, r1, and r2 by the corresponding correlations)


            • #7

              how would the code look like, if I have only one existing variable, but want to generate more than one (e.g. three or four) additional variable. As in the previous answers, I want to vary the correlations.