Simulated Regression Data

Bryan Griffin

Join Date: Apr 2014

Posts: 3
#1

Simulated Regression Data

31 Oct 2014, 13:10

Hi all --

I've been trying to simulate power analysis for regression and had good luck using corr2data and drawnorm for generating data. I wanted to create data by constructing an equation with random variables rather than use corr2data or drawnorm. I create three random, standardized (m=0, sd=1) variables: z1, z2, and ze. I then create my regression equation as

generate y = .30*z1 + .40*z2 + ze

I expected that the correlation between Y and z1 would be .30 and Y and z2 to be .40, However, when I correlate these data, the correlations are always, on average, lower than what I expect at r = .26 (instead of .30) and r = .35 (instead of .40).

I set replications between 500 and 5000 and still I get correlations of the magnitude noted. Changed seed every time, still similar results.

No doubt that I am overlooking something that will be obvious to others, but I'm not seeing the problem,

Any help that can be offered would be welcomed.

Below is my simulation code. Using Version 12.1 of Stata.

Bryan

***********************************
clear all
set more off
set seed 1681
local alpha = .05
local samplesize = 92
local reps = 1500

postfile coefsv1 n_size beta1 se1 t1 pv1 beta2 se2 t2 pv2 bcons Model_R2 power_pv1 power_pv2 cor_z1z2 cor_yz1 cor_yz2 fz_z1z2 fz_yz1 fz_yz2 using version1, replace every(1)
tempname betamatv1
local rep = 0

quietly forv i=1/`reps'{
drop _all

set obs `samplesize'

generate x1= invnorm(uniform())
egen z1 = std(x1)
generate x2= invnorm(uniform())
egen z2 = std(x2)
generate e = invnorm(uniform())
egen ze = std(e)
generate y = .30*z1 + .40*z2 + ze

/* Ger correlations of this magnitude consistently */
*r y Z1 = .26
*r y Z2 = .35

reg y z1 z2
matrix betamatv1=e(b)
matrix semat=e(V)
local beta1 = betamatv1[1,1]
local se1 = semat[1,1]^.5
local t1 = `beta1'/`se1'
local pv1 = ttail(e(df_r),`t1')*2
local beta2 = betamatv1[1,2]
local se2 = semat[2,2]^.5
local t2 = `beta2'/`se2'
local pv2 = ttail(e(df_r),`t2')*2
local bcons = betamatv1[1,3]
local Model_R2 = e(r2)
local n_size = e(N)

if `pv1' < `alpha' {
local power_pv1 = 1
}
else {
local power_pv1 = 0
}

if `pv2' < `alpha' {
local power_pv2 = 1
}
else {
local power_pv2 = 0
}

corr y z1 z2 ze
mat C = r(C)
matrix list r(C)
local cor_z1z2 = C[2,3]
local cor_yz1 = C[1,2]
local cor_yz2 = C[1,3]
local fz_z1z2 = atanh(C[2,3])
local fz_yz1 = atanh(C[1,2])
local fz_yz2 = atanh(C[1,3])

noisily if `rep' == 50 {
display "Replications completed: " `i'
local rep = 1
}
noisily else if `rep' == 0 {
display " "
display " "
display " "
display "Replications commencing..."
local rep = 2
}
else {
local rep = `rep' +1
}

post coefsv1 (`n_size') (`beta1') (`se1') (`t1') (`pv1') (`beta2') (`se2') (`t2') (`pv2') (`bcons') (`Model_R2') (`power_pv1') (`power_pv2') (`cor_z1z2') (`cor_yz1') (`cor_yz2') (`fz_z1z2') (`fz_yz1') (`fz_yz2')

}

postclose coefsv1
use version1, clear
summ
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#2

31 Oct 2014, 13:25

The problem is not with your code (not that I have looked at it) but with what you expect of the correlation.

To simplify the notation, let's just set z3 = 0.4z2 + ze. Then y = 0.3z1 + z3. z3 has mean 0, but its variance is 1 + 0.4^2 = 1.16. And the variance of y is then 1.16 + 0.3^2 = 1.25. Cov(y, z1) = 0.3, to be sure. But when you go from there to correlation you have to divide by sqrt(var(y)*var(z1)) = sqrt(1.25). So r = 0.3/sqrt(1.25) = 0.268. The simulated value of 0.26 you got is quite in accord with that. Similar reasoning applies to z2.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4457
#3

31 Oct 2014, 13:47

there are existing user-written routines for power by simulation; -search- for powersim and for simsam
Comment
Bryan Griffin

Join Date: Apr 2014

Posts: 3
#4

31 Oct 2014, 15:35

Clyde, thank you for your response. I am not sure I agree, however, and here’s why. If three standardized variables correlate as follows

. corr y z1 z2
(obs=60000)

| y z1 z2
-------------+---------------------------
y | 1.0000
z1 | 0.3000 1.0000
z2 | 0.4000 0.0000 1.0000

The both unstandardized and standardized regression coefficient should be .3 and .4 for z1 and z2, respectively. For example:

. reg y z1 z2

Source | SS df MS Number of obs = 60000
-------------+------------------------------ F( 2, 59997) = 9999.50
Model | 14999.75 2 7499.87502 Prob > F = 0.0000
Residual | 44999.25 59997 .750025001 R-squared = 0.2500
-------------+------------------------------ Adj R-squared = 0.2500
Total | 59999 59999 1 Root MSE = .86604

------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
z1 | .3 .0035356 84.85 0.000 .2930702 .3069298
z2 | .4 .0035356 113.13 0.000 .3930702 .4069298
_cons | 9.08e-09 .0035356 0.00 1.000 -.0069298 .0069298
------------------------------------------------------------------------------

and

. reg y z1 z2, beta

Source | SS df MS Number of obs = 60000
-------------+------------------------------ F( 2, 59997) = 9999.50
Model | 14999.75 2 7499.87502 Prob > F = 0.0000
Residual | 44999.25 59997 .750025001 R-squared = 0.2500
-------------+------------------------------ Adj R-squared = 0.2500
Total | 59999 59999 1 Root MSE = .86604

------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
z1 | .3 .0035356 84.85 0.000 .3
z2 | .4 .0035356 113.13 0.000 .4
_cons | 9.08e-09 .0035356 0.00 1.000 .
------------------------------------------------------------------------------

As this example shows, if z1 and z2 have no correlation, then the standardized coefficient is the same as the zero-order correlation with y.

One potential problem with my simulation file is that I create y using this command:

generate y = .30*z1 + .40*z2 + ze

Instead, I should use the standardized value of y, like this

generate y = .30*z1 + .40*z2 + ze
egen zy = std(y)

when I run my simulation now with zy rather than y, I get the following (2500 reps, sample size per rep = 92):

Regression
B1 = .263 (expected .30)
B2 = .354 (expected .40)

Corr
zY z1 = .263 (expected .30)
zy z2 = .354 (expected .354)
z1 z2 = .003 (expected .000)

So as these results show, the regression coefficients equal the correlation values, but both sets of values are below what I expected given the equation used to construct y and zy.

I thought maybe the lowered correlation values resulted from partial effects since y is constructed from both z1 and z2, but since there is no correlation between z1 and z2, there should be no adjustment or the adjustment should increase the correlation, i.e.,

Partial Correlation between zy and z1 = .30, partial correlation = .327, semipartial correlation = .30.

So, I still don’t understand why this equation is producing lower regression coefficients and correlations than expected, especially considering a regression with variables correlated at .3 and .4 demonstrated above produce coefficients of .3 and .4. No doubt it is something basic, and I thought it may be due to partially (but it is not), now I am thinking it must be related to how y and z(y) is computed.

Bryan
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#5

31 Oct 2014, 16:10

It is something basic. It is exactly what I wrote in my earlier post--the basic algebra of the calculation of correlations. You are correct that because z2 and z1 are uncorrelated, there are no partial effects at play here, as you note. That is not the issue.

If you regress your y = 0.3*z1 + 0.4*z2 + ze against z1, you will get an unstandardized coefficient of 0.3. But that unstandardized coefficient is not equal to the correlation between y and z1 for the reasons I outlined earlier. If you replace y by a standardized version of y, the unstandardized regression coefficient will change, because y itself will be scaled down by its standard deviation, sqrt(1.25). But the standardized coefficient and the correlation coefficient are not affected by that transformation, and for those you will still get 0.26..
Comment

Richard Williams

Join Date: Apr 2014
Posts: 4983

31 Oct 2014, 19:40

Clyde is right. If you want the slope coefficients to be the same as the correlations then you have to construct y so that its variance is also one.

Code:

. corr2data z1 z2 ze, n(5000) clear
(obs 5000)

. gen y = .3*z1 + .4*z2 + sqrt(.75) * ze

. corr, means
(obs=5000)

    Variable |         Mean    Std. Dev.          Min          Max
-------------+----------------------------------------------------
          z1 |    -5.63e-09            1    -3.569643     3.354228
          z2 |     7.04e-09            1    -3.379169     3.344513
          ze |    -9.34e-09            1     -3.46508     3.640815
           y |    -6.69e-09            1    -3.863025      3.69272


             |       z1       z2       ze        y
-------------+------------------------------------
          z1 |   1.0000
          z2 |  -0.0000   1.0000
          ze |  -0.0000   0.0000   1.0000
           y |   0.3000   0.4000   0.8660   1.0000

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Bryan Griffin

Join Date: Apr 2014

Posts: 3
#7

31 Oct 2014, 20:28

Thanks Clyde and Richard for the clarification.

Bryan
Comment

Announcement

Simulated Regression Data

Comment

Comment

Comment

Comment

Comment

Comment