Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulating bivariate normal distributed data using conditional normal distribution

    Dear all,

    Although I am quite aware of Stata's powerful drawnorm function to generate bivariate normal distributed data, I tried using a property of bivariate normal distributions concerning the conditional distribution of X2 given X1 = x1.

    If X1 and X2 have a bivariate normal distribution with means m1 and m2, variances s12 and s22 and correlation r, then the conditional distribution of X2 given X1 = x1 is itself normal distributed with mean = m2 + r(s2/s1)(x1 - m1) and variance = (1 - r2)s22 (see e.g. Bickel and Doksum, 1977, page 26). In the case that the marginal distributions are standard normal distributions (i.e. m1 = m2 = 0 and s1 = s2 = 1) this implies that the conditional distribution of X2 given X1 = x1 is normal distributed with mean = rx1 and variance = 1 - r2.

    However, when using the following code:
    Code:
    clear
    set seed 123
    set obs 1000000
    local rho=0.7
    gen double x1 = rnormal(0,1)
    gen double x2 = rnormal(`rho' * x1, 1-`rho'^2)
    summ x*
    pwcorr x1 x2
    I end up with X2 having standard deviation of 0.87 (instead of 1) and X1 and X2 having a correlation of 0.81 (instead of 0.7).
    Am I missing something? Can anyone explain what is wrong here?

    Kind regards, Adriaan Hoogendoorn


    REFERENCE: Bickel, Peter J & Doksum, Kjell A (1977) Mathematical Statistics: Basic Ideas and Selected Topics, Holden-Day Inc., Oakland, California.

  • #2
    The second parameter of the -rnormal()- function is the SD, not the variance.

    Comment


    • #3
      Thank you, Clyde!

      The code:
      Code:
      clear
      set seed 123
      set obs 1000000
      local rho=0.7
      gen double x1 = rnormal(0,1)
      gen double x2 = rnormal(`rho' * x1, sqrt(1-`rho'^2))
      summ x*
      pwcorr x1 x2
      works just fine!

      Comment


      • #4
        Is there a simple extension for the multivariate case? For example, I have x1 and want to generate x2, x3 and x4 with a specific correlation matrix.

        Comment


        • #5
          Yes, you can use the Cholesky decomposition R = CC' where C is lower-diagonal. Here is an example with three variables
          Code:
           clear
          
          . set seed 6201
          
          . matrix define R = (1, 0.7, 0.5 \ 0.7, 1, 0.4 \ 0.5, 0.4, 1)
          
          . matrix C = cholesky(R)
          
          . matrix list C
          
          C[3,3]
                     c1         c2         c3
          r1          1          0          0
          r2         .7  .71414284          0
          r3         .5    .070014  .86319062
          
          . set obs 10000
          number of observations (_N) was 0, now 10,000
          
          . gen x1 = C[1,1] * rnormal(0,1)
          
          . gen x2 = C[2,1] * x1 + C[2,2] * rnormal(0,1)
          
          . gen x3 = C[3,1] * x1 + C[3,2] * x2  + C[3,3] * rnormal(0,1)
          
          . corr x1 x2 x3
          (obs=10,000)
          
                       |       x1       x2       x3
          -------------+---------------------------
                    x1 |   1.0000
                    x2 |   0.7009   1.0000
                    x3 |   0.5320   0.4132   1.0000
          Or you can do the same thing in Mata:
          Code:
          . mata:
          ------------------------------------------------- mata (type end to exit) ------
          : R = (1, 0.7, 0.5 \ 0.7, 1, 0.4 \ 0.5,  0.4, 1)
          
          : C = cholesky(R)
          
          : rseed(3415)
          
          : Z = rnormal(10000, 3, 0, 1)
          
          : corr(variance(Z*C'))
          [symmetric]
                           1             2             3
              +-------------------------------------------+
            1 |            1                              |
            2 |  .7025435966             1                |
            3 |  .5024883249   .3999212716             1  |
              +-------------------------------------------+
          
          : end
          --------------------------------------------------------------------------------
          Added: when generating x1 the coefficient C[1,1] is 1 for correlation matrices, but I am leaving the general case for clarity.
          Last edited by German Rodriguez; 26 Oct 2017, 16:22.

          Comment


          • #6
            A correction to the above post #5: the Stata code should be multiplying standard normals by C, just like the Mata code does. Here's a revised version
            Code:
            . set seed 3415
            
            . set obs 10000
            number of observations (_N) was 0, now 10,000
            
            . matrix define R = (1, 0.7, 0.5 \ 0.7, 1, 0.4 \ 0.5, 0.4, 1)
            
            . matrix C = cholesky(R)
            
            . matrix list C
            
            C[3,3]
                       c1         c2         c3
            r1          1          0          0
            r2         .7  .71414284          0
            r3         .5    .070014  .86319062
            
            . set obs 10000
            number of observations (_N) was 10,000, now 10,000
            
            . gen z1 = rnormal(0,1)
            
            . gen z2 = rnormal(0,1)
            
            . gen z3 = rnormal(0,1)
            
            . gen x1 = C[1,1] * z1
            
            . gen x2 = C[2,1] * z1 + C[2,2] * z2
            
            . gen x3 = C[3,1] * z1 + C[3,2] * z2  + C[3,3] * z3
            
            . corr x1 x2 x3
            (obs=10,000)
            
                        |       x1       x2       x3
            -------------+---------------------------
                      x1 |   1.0000
                      x2 |   0.7079   1.0000
                      x3 |   0.5027   0.3982   1.0000
            I was also wondering why the Stata and Mata simulations were not identical.I think Mata fills a matrix of random numbers by rows, because generating a 3 by 10000 matrix and then transposing (instead of generating a 10000 by 3 matrix) yields exactly the same result as Stata:

            Code:
            . mata:
            ------------------------------------------------- mata (type end to exit) ------
            : rseed(3415)
            
            : R = (1, 0.7, 0.5 \ 0.7, 1, 0.4 \ 0.5, 0.4, 1)
            
            : C = cholesky(R)
            
            : Z = rnormal(3, 10000, 0, 1)'
            
            : corr(variance(Z*C'))
            [symmetric]
                             1             2             3
                +-------------------------------------------+
              1 |            1                              |
              2 |  .7079454626             1                |
              3 |  .5026927404   .3982160631             1  |
                +-------------------------------------------+
            
            : end
            --------------------------------------------------------------------------------
            I think all is well now.

            Last edited by German Rodriguez; 26 Oct 2017, 17:46.

            Comment


            • #7
              Dear German,

              Thank you so much for your tip. That will be extremely useful in my current work.

              All the best,

              Tiago

              Comment


              • #8
                Glad you find it useful, Tiago. I went the Cholesky way because it is the easiest, but it is also possible to work with conditional distributions y1, y2|y1 and y3|y1,y2 (and so on). I put a note on this alternative approach on my website at http://data.princeton.edu/stata/conditionals.

                Comment

                Working...
                X