Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a variable with a known correlation with existing variables

    Greetings,

    My colleagues and I are trying to solve a simulation problem in which we are trying generate a new variable with a known correlation to given variables.

    Imagine we have two variables, x1 and x2, which have a given correlation. I want to generate a normally distributed variable y which will have a specified correlation with each of x1 and x2. For example, let’s say that x1 and x2 are correlated .25 and I want y to be correlated with x1 at .2 and x2 at .5. x1 and x2 could have a number of different distributions in different scenarios.

    Our simulation requires that we keep x1 and x2 and generate a new variable y. Note that corr2data, as I understand it, can only generate new variables which follow a normal distribution. That is, corr2data cannot take an existing variable and generate new variables correlated with that variable.

    We can generate a y with a chosen correlation with either x1 or x2. But we haven’t figured out how to generate a y that has specified correlations with both x1 and x2.

    Any help or guidance is much appreciated.

  • #2
    So it's a little complicated, but doable. First, it is simpler to work with standardized variables. So calculate z1 and z2 as standardized (mean 0, variance 1) versions of x1 and x2. Calculate the correlation between z1 and z2: call it r12. Let me denote the correlations you are targeting with x1 and x2 respectively as r1 and r2, respectively. Then the variable y you seek can be gotten as follows:

    Code:
    Solve this system of linear equations for a1 and a2.  (Either by hand and use the results directly if this is a one-time task, or set up some matrix algebra to do this if you will need to do this repeatedly.)
    a1 + r12*a2= r1
    r12*a1 + a2 = r2
    and store the results a1 and a2 as scalars
    
    scalar a3 = sqrt(1-a1^2-a2^2)
    
    gen e = rnormal(0, 1)
    
    gen y = scalar(a1)*z1 + scalar(a2)*z2 + scalar(a3)*e
    Then y will be a standard normal variable having the desired correlations with z1 and z2 (and hence also with x1 and x2). If you need to re-center or re-scale y, that is, of course, an easy task at the end.

    Note that this is not possible for all values of r1, r2, and r. (Intuitively, if r is very close to 1, then there is some upper limit to how different r1 and r2 can be.) That is, it may be that after solving for a1 and a2, you get a1^2 + a2^2 > 1. In that case, there is no real value for a3, and the problem has no solution. (Meaning that the actual problem you seek to solve has no solution, not just that this approach fails.)

    Comment


    • #3
      After several recent exchanges with Clyde offline, in which he generously took the time to write a derivation of his procedure for me, I discovered a small error in what he presented above, and I'm writing now to correct it for anyone who cares to use this.

      What he write as:
      Code:
      scalar a3 = sqrt(1-a1^2-a2^2)
      should be
      Code:
      scalar a3 = sqrt(1-a1^2-a2^2 -2 * a1 * a2 * r12)
      This correction comes from the fact that a3 is being obtained from the variance of a sum, and Clyde's initial version omitted the covariance of two correlated components of the sum.

      Regards, Mike

      Comment


      • #4
        Thanks for finding and fixing that error!

        Comment


        • #5
          I need to do something similar to this, but I need to do it repeatedly (using different values of r1 and r2) and am unsure of how to setup the matrix algebra to solve for a1 and a2.

          IYH

          Comment


          • #6
            I need to do something similar to this, but I need to do it repeatedly (using different values of r1 and r2) and am unsure of how to setup the matrix algebra to solve for a1 and a2.
            Code:
            matrix RR = (1, r12 \ r12 1)
            matrix R = (r1 \ r2)
            matrix A = syminv(RR)*R
            scalar a1 = A[1, 1]
            scalar a2 = A[2, 1]
            (Replace r12, r1, and r2 by the corresponding correlations)

            Comment


            • #7
              Hi,

              how would the code look like, if I have only one existing variable, but want to generate more than one (e.g. three or four) additional variable. As in the previous answers, I want to vary the correlations.

              Regards,
              Steffen

              Comment


              • #8
                Hi,

                I encountered a similar problem. I have followed Clyde and Mike's suggestions and successfully generated the observed y. But actually, I am not very clear about why we should do that. Can anyone here give me some further explanations in theory, please? Many thanks in advance!

                Best wishes,

                Sheng
                Last edited by Sheng Dai; 25 Sep 2020, 07:26.

                Comment


                • #9
                  Well, it would be very difficult to write out all the algebra here. Let me try to describe it in sentences.

                  The goal is to find a variable y that has specified correlations r1 and r2 with given variables x1 and x2, which, themselves have correlation r12. Since correlation coefficients are invariant under changes of scale and location, we can replace x1 and x2 with their standardized equivalents z1 and z2, and that will simplify things.

                  Let's see if we can find a variable y that is a linear combination of z1 and z2 plus some noise that fulfills our requirements. Let's call the coefficients a1 and a2. So y = a1*z1 + a2*z2 + a3*e, where e is a standard normal variable independent of z1 and z2. The task now is to find values of a1 and a2 that will produce the desired correlations. We will also take y to be standardized (because, again, changing the scale won't change any of the correlations).

                  Now, what is the correlation of y with z1? With standardized variables, the correlation coefficient is the expected value of their product. So the correlation of y with z1 will be E(z1(a1*z1+a2*z2 + a3*e)) = a1*E(z12) + a2*E(z1*z2) + a3*E(z1*e). Now E(z12) = Var(z1) (because E(z1) == 0), and since z1 is standardized, this is 1. And E(z1*z2) is the correlation of z1 and z2, or r12. Finally, E(z1*e) is the correlation of z1 with e, which is zero. So the correlation of y with z1 will be a1 + a2*r12. So for a1 and a2 to give the desired results, we must have a1 + a2*r12 = r1, which was the first equation in #2. A similar derivation of the correlation of y with z2 gives the second equation. We now have two linear equations in two unknowns (a1 and a2), which we can solve for a1 and a2. All that remains is to find a3. The value of a3 is constrained by the requirement that y be standardized. That is, we must have Var(y) = 1. Var(y) = Var(a1*z1 + a2*z2 + a3*e). Now the variance of the sum of variables is the sum of the variances plus twice the sum of the covariances. And the variance of a constant times a variable is the square of the constant times the variance of the variable. So, for example, the variance of a1*z1 is a1*Var(z1) = a12*1 = a12. The covariance of z1 with z2 is r122, and the covariance of anything in this expression with e is zero (because e was chosen to be independent of z1 and z2). So when you put all of the terms together and set the sum to 1, and solve for a3 you get the result Mike Lacy shows in #3 (where he corrected my error in #2, where I overlooked the covariance terms).

                  Last edited by Clyde Schechter; 25 Sep 2020, 11:33.

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    Well, it would be very difficult to write out all the algebra here. Let me try to describe it in sentences.

                    The goal is to find a variable y that has specified correlations r1 and r2 with given variables x1 and x2, which, themselves have correlation r12. Since correlation coefficients are invariant under changes of scale and location, we can replace x1 and x2 with their standardized equivalents z1 and z2, and that will simplify things.

                    Let's see if we can find a variable y that is a linear combination of z1 and z2 plus some noise that fulfills our requirements. Let's call the coefficients a1 and a2. So y = a1*z1 + a2*z2 + a3*e, where e is a standard normal variable independent of z1 and z2. The task now is to find values of a1 and a2 that will produce the desired correlations. We will also take y to be standardized (because, again, changing the scale won't change any of the correlations).

                    Now, what is the correlation of y with z1? With standardized variables, the correlation coefficient is the expected value of their product. So the correlation of y with z1 will be E(z1(a1*z1+a2*z2 + a3*e)) = a1*E(z12) + a2*E(z1*z2) + a3*E(z1*e). Now E(z12) = Var(z1) (because E(z1) == 0), and since z1 is standardized, this is 1. And E(z1*z2) is the correlation of z1 and z2, or r12. Finally, E(z1*e) is the correlation of z1 with e, which is zero. So the correlation of y with z1 will be a1 + a2*r12. So for a1 and a2 to give the desired results, we must have a1 + a2*r12 = r1, which was the first equation in #2. A similar derivation of the correlation of y with z2 gives the second equation. We now have two linear equations in two unknowns (a1 and a2), which we can solve for a1 and a2. All that remains is to find a3. The value of a3 is constrained by the requirement that y be standardized. That is, we must have Var(y) = 1. Var(y) = Var(a1*z1 + a2*z2 + a3*e). Now the variance of the sum of variables is the sum of the variances plus twice the sum of the covariances. And the variance of a constant times a variable is the square of the constant times the variance of the variable. So, for example, the variance of a1*z1 is a1*Var(z1) = a12*1 = a12. The covariance of z1 with z2 is r122, and the covariance of anything in this expression with e is zero (because e was chosen to be independent of z1 and z2). So when you put all of the terms together and set the sum to 1, and solve for a3 you get the result Mike Lacy shows in #3 (where he corrected my error in #2, where I overlooked the covariance terms).
                    Thank you so much for your further detailed explanations. I get it and It is so clear to me now.

                    Comment

                    Working...
                    X