Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Simulated Regression Data

    Hi all --

    I've been trying to simulate power analysis for regression and had good luck using corr2data and drawnorm for generating data. I wanted to create data by constructing an equation with random variables rather than use corr2data or drawnorm. I create three random, standardized (m=0, sd=1) variables: z1, z2, and ze. I then create my regression equation as

    generate y = .30*z1 + .40*z2 + ze

    I expected that the correlation between Y and z1 would be .30 and Y and z2 to be .40, However, when I correlate these data, the correlations are always, on average, lower than what I expect at r = .26 (instead of .30) and r = .35 (instead of .40).

    I set replications between 500 and 5000 and still I get correlations of the magnitude noted. Changed seed every time, still similar results.

    No doubt that I am overlooking something that will be obvious to others, but I'm not seeing the problem,

    Any help that can be offered would be welcomed.

    Below is my simulation code. Using Version 12.1 of Stata.

    Bryan


    ***********************************
    clear all
    set more off
    set seed 1681
    local alpha = .05
    local samplesize = 92
    local reps = 1500

    postfile coefsv1 n_size beta1 se1 t1 pv1 beta2 se2 t2 pv2 bcons Model_R2 power_pv1 power_pv2 cor_z1z2 cor_yz1 cor_yz2 fz_z1z2 fz_yz1 fz_yz2 using version1, replace every(1)
    tempname betamatv1
    local rep = 0

    quietly forv i=1/`reps'{
    drop _all

    set obs `samplesize'

    generate x1= invnorm(uniform())
    egen z1 = std(x1)
    generate x2= invnorm(uniform())
    egen z2 = std(x2)
    generate e = invnorm(uniform())
    egen ze = std(e)
    generate y = .30*z1 + .40*z2 + ze

    /* Ger correlations of this magnitude consistently */
    *r y Z1 = .26
    *r y Z2 = .35

    reg y z1 z2
    matrix betamatv1=e(b)
    matrix semat=e(V)
    local beta1 = betamatv1[1,1]
    local se1 = semat[1,1]^.5
    local t1 = `beta1'/`se1'
    local pv1 = ttail(e(df_r),`t1')*2
    local beta2 = betamatv1[1,2]
    local se2 = semat[2,2]^.5
    local t2 = `beta2'/`se2'
    local pv2 = ttail(e(df_r),`t2')*2
    local bcons = betamatv1[1,3]
    local Model_R2 = e(r2)
    local n_size = e(N)

    if `pv1' < `alpha' {
    local power_pv1 = 1
    }
    else {
    local power_pv1 = 0
    }


    if `pv2' < `alpha' {
    local power_pv2 = 1
    }
    else {
    local power_pv2 = 0
    }

    corr y z1 z2 ze
    mat C = r(C)
    matrix list r(C)
    local cor_z1z2 = C[2,3]
    local cor_yz1 = C[1,2]
    local cor_yz2 = C[1,3]
    local fz_z1z2 = atanh(C[2,3])
    local fz_yz1 = atanh(C[1,2])
    local fz_yz2 = atanh(C[1,3])

    noisily if `rep' == 50 {
    display "Replications completed: " `i'
    local rep = 1
    }
    noisily else if `rep' == 0 {
    display " "
    display " "
    display " "
    display "Replications commencing..."
    local rep = 2
    }
    else {
    local rep = `rep' +1
    }


    post coefsv1 (`n_size') (`beta1') (`se1') (`t1') (`pv1') (`beta2') (`se2') (`t2') (`pv2') (`bcons') (`Model_R2') (`power_pv1') (`power_pv2') (`cor_z1z2') (`cor_yz1') (`cor_yz2') (`fz_z1z2') (`fz_yz1') (`fz_yz2')

    }

    postclose coefsv1
    use version1, clear
    summ

  • #2
    The problem is not with your code (not that I have looked at it) but with what you expect of the correlation.

    To simplify the notation, let's just set z3 = 0.4z2 + ze. Then y = 0.3z1 + z3. z3 has mean 0, but its variance is 1 + 0.4^2 = 1.16. And the variance of y is then 1.16 + 0.3^2 = 1.25. Cov(y, z1) = 0.3, to be sure. But when you go from there to correlation you have to divide by sqrt(var(y)*var(z1)) = sqrt(1.25). So r = 0.3/sqrt(1.25) = 0.268. The simulated value of 0.26 you got is quite in accord with that. Similar reasoning applies to z2.

    Comment


    • #3
      there are existing user-written routines for power by simulation; -search- for powersim and for simsam

      Comment


      • #4
        Clyde, thank you for your response. I am not sure I agree, however, and here’s why. If three standardized variables correlate as follows

        . corr y z1 z2
        (obs=60000)

        | y z1 z2
        -------------+---------------------------
        y | 1.0000
        z1 | 0.3000 1.0000
        z2 | 0.4000 0.0000 1.0000

        The both unstandardized and standardized regression coefficient should be .3 and .4 for z1 and z2, respectively. For example:

        . reg y z1 z2

        Source | SS df MS Number of obs = 60000
        -------------+------------------------------ F( 2, 59997) = 9999.50
        Model | 14999.75 2 7499.87502 Prob > F = 0.0000
        Residual | 44999.25 59997 .750025001 R-squared = 0.2500
        -------------+------------------------------ Adj R-squared = 0.2500
        Total | 59999 59999 1 Root MSE = .86604

        ------------------------------------------------------------------------------
        y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        z1 | .3 .0035356 84.85 0.000 .2930702 .3069298
        z2 | .4 .0035356 113.13 0.000 .3930702 .4069298
        _cons | 9.08e-09 .0035356 0.00 1.000 -.0069298 .0069298
        ------------------------------------------------------------------------------

        and

        . reg y z1 z2, beta

        Source | SS df MS Number of obs = 60000
        -------------+------------------------------ F( 2, 59997) = 9999.50
        Model | 14999.75 2 7499.87502 Prob > F = 0.0000
        Residual | 44999.25 59997 .750025001 R-squared = 0.2500
        -------------+------------------------------ Adj R-squared = 0.2500
        Total | 59999 59999 1 Root MSE = .86604

        ------------------------------------------------------------------------------
        y | Coef. Std. Err. t P>|t| Beta
        -------------+----------------------------------------------------------------
        z1 | .3 .0035356 84.85 0.000 .3
        z2 | .4 .0035356 113.13 0.000 .4
        _cons | 9.08e-09 .0035356 0.00 1.000 .
        ------------------------------------------------------------------------------

        As this example shows, if z1 and z2 have no correlation, then the standardized coefficient is the same as the zero-order correlation with y.

        One potential problem with my simulation file is that I create y using this command:

        generate y = .30*z1 + .40*z2 + ze

        Instead, I should use the standardized value of y, like this

        generate y = .30*z1 + .40*z2 + ze
        egen zy = std(y)

        when I run my simulation now with zy rather than y, I get the following (2500 reps, sample size per rep = 92):

        Regression
        B1 = .263 (expected .30)
        B2 = .354 (expected .40)

        Corr
        zY z1 = .263 (expected .30)
        zy z2 = .354 (expected .354)
        z1 z2 = .003 (expected .000)

        So as these results show, the regression coefficients equal the correlation values, but both sets of values are below what I expected given the equation used to construct y and zy.

        I thought maybe the lowered correlation values resulted from partial effects since y is constructed from both z1 and z2, but since there is no correlation between z1 and z2, there should be no adjustment or the adjustment should increase the correlation, i.e.,

        Partial Correlation between zy and z1 = .30, partial correlation = .327, semipartial correlation = .30.

        So, I still don’t understand why this equation is producing lower regression coefficients and correlations than expected, especially considering a regression with variables correlated at .3 and .4 demonstrated above produce coefficients of .3 and .4. No doubt it is something basic, and I thought it may be due to partially (but it is not), now I am thinking it must be related to how y and z(y) is computed.

        Bryan

        Comment


        • #5
          It is something basic. It is exactly what I wrote in my earlier post--the basic algebra of the calculation of correlations. You are correct that because z2 and z1 are uncorrelated, there are no partial effects at play here, as you note. That is not the issue.

          If you regress your y = 0.3*z1 + 0.4*z2 + ze against z1, you will get an unstandardized coefficient of 0.3. But that unstandardized coefficient is not equal to the correlation between y and z1 for the reasons I outlined earlier. If you replace y by a standardized version of y, the unstandardized regression coefficient will change, because y itself will be scaled down by its standard deviation, sqrt(1.25). But the standardized coefficient and the correlation coefficient are not affected by that transformation, and for those you will still get 0.26..

          Comment


          • #6
            Clyde is right. If you want the slope coefficients to be the same as the correlations then you have to construct y so that its variance is also one.

            Code:
            . corr2data z1 z2 ze, n(5000) clear
            (obs 5000)
            
            . gen y = .3*z1 + .4*z2 + sqrt(.75) * ze
            
            . corr, means
            (obs=5000)
            
                Variable |         Mean    Std. Dev.          Min          Max
            -------------+----------------------------------------------------
                      z1 |    -5.63e-09            1    -3.569643     3.354228
                      z2 |     7.04e-09            1    -3.379169     3.344513
                      ze |    -9.34e-09            1     -3.46508     3.640815
                       y |    -6.69e-09            1    -3.863025      3.69272
            
            
                         |       z1       z2       ze        y
            -------------+------------------------------------
                      z1 |   1.0000
                      z2 |  -0.0000   1.0000
                      ze |  -0.0000   0.0000   1.0000
                       y |   0.3000   0.4000   0.8660   1.0000
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            StataNow Version: 19.5 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Thanks Clyde and Richard for the clarification.

              Bryan

              Comment

              Working...
              X