Manual SE adjustment for 2SLS with clustering

Mitch Downey

Join Date: Oct 2014
Posts: 24

Manual SE adjustment for 2SLS with clustering

08 Mar 2019, 02:50

Hello,

I'd like to manually estimate a two-stage least squares regression, first running the first stage then running the second stage with the predicted X. Doing this, the standard errors need to be adjusted to account for predicted X being a simulated regressor.

I did so using code from a Stata.com post and a Kit Baum Statalist post. With clustering, however, the degrees of freedom adjustment isn't quite right and I can't figure out how to do it. I can get "close" but my manual SE's still don't match those from ivregress (although they do when I don't cluster).

Can someone correct my SE adjustment so that the manual two-stage can recover the SE's from ivregress?

Below is code to generate an illustrative dataset, to generate the accurate correction without clustering (the code is very similar to that from the links above but slightly more automated), to get an "almost accurate" correction with clustering, and to do the whole thing with reghdfe in addition to reg (which has to be done slightly differently because of differences in how predict works; I'm including it because I thought it might be useful to others).

Thank you,
Mitch

Code:

set seed 30819

******************
/*    DATA SETUP    */
******************

clear

* Generate an unbalanced panel of 1000 observations
set obs 1000
gen n = _n
gen i = ceil(100*runiform())
sort i n
by i: gen t = _n
*tab i, sort
*tab t

* Create need for i fixed effects
gen temp1 = rnormal()
egen a = mean(temp1), by(i)
drop temp1

* Create need for t fixed effects
gen temp2 = t + rnormal()
egen d = mean(temp2), by(t)
drop temp2

gen u1 = rnormal()
gen u2 = rnormal()*2
gen v = rnormal()
gen z = rnormal()

gen x = a + d + z/3 + u1 + v

gen y = a + d + x/2 + u1 + u2

**********************************
/*    REG VERSION WITH DUMMIES    */
/*        NO CLUSTERING            */
**********************************

* First stage
qui: reg x z i.i i.t
predict xfit, xb

* Second stage
qui: reg y xfit i.i i.t
local st2se = _se[xfit]
local st2rmse = `e(rmse)'
di `e(df_r)'
local dfr = `e(df_r)'
di `dfr'

* Getting "corrected" residuals as true X's and IV-estimated coefficients
replace xfit = x
predict cst2e, resid

* Getting "corrected" sum of squared errors
gen cst2e2 = cst2e^2
qui: sum cst2e2
local csse = `r(sum)'
di `csse'

* Original SE's with no correction for simulated variables
di `st2se'
* Actual IV standard errors
qui: ivregress 2sls y i.i i.t (x = z)
di _se[x]
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')
* Actual IV standard errors with small sample correction
qui: ivregress 2sls y i.i i.t (x = z), small
di _se[x]

drop xfit cst2e cst2e2


**********************************
/*    REG VERSION WITH DUMMIES    */
/*        WITH CLUSTERING            */
**********************************

* Note: Clustering matters
qui: reg x z i.i i.t
di _se[z]
qui: reg x z i.i i.t, cluster(i)
di _se[z]

* First stage
qui: reg x z i.i i.t, cluster(i)
predict xfit, xb

* Second stage
qui: reg y xfit i.i i.t, cluster(i)
local st2se = _se[xfit]
local st2rmse = `e(rmse)'
di `e(df_r)'
local dfr = (`e(N)' - `e(df_m)' - `e(df_r)')
di `dfr'

* Getting "corrected" residuals as true X's and IV-estimated coefficients
replace xfit = x
predict cst2e, resid

* Getting "corrected" sum of squared errors
gen cst2e2 = cst2e^2
qui: sum cst2e2
local csse = `r(sum)'
di `csse'

* Original SE's with no correction for simulated variables
di `st2se'
* Actual IV standard errors
qui: ivregress 2sls y i.i i.t (x = z), cluster(i)
di _se[x]
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')
* Actual IV standard errors with small sample correction
qui: ivregress 2sls y i.i i.t (x = z), cluster(i) small
di _se[x]

drop xfit cst2e cst2e2

**********************
/*    REGHDFE VERSION    */
/*    NO CLUSTERING    */
**********************

* First stage
qui: reghdfe x z, absorb(i t, savefe) resid
predict xfit, xbd

* Second stage
qui: reghdfe y xfit, absorb(i t, savefe) resid
predict st2e, resid
local st2se = _se[xfit]
local st2rmse = `e(rmse)'
local dfr = `e(df_r)'

* Getting "corrected" residuals as true X's and IV-estimated coefficients
* You have to do this in a strange way because predict doesn't work if you change the X values
gen cst2e = st2e + (xfit - x)*_b[xfit]

* Getting "corrected" sum of squared errors
gen cst2e2 = cst2e^2
qui: sum cst2e2
local csse = `r(sum)'

* Original SE's with no correction for simulated variables
di `st2se'
* Actual IV standard errors
qui: ivreghdfe y (x = z), absorb(i t)
di _se[x]
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')

drop xfit st2e cst2e cst2e2

**********************
/*    REGHDFE VERSION    */
/*    WITH CLUSTERING    */
**********************

* First stage
qui: reghdfe x z, absorb(i t, savefe) resid cluster(i)
predict xfit, xbd

* Second stage
qui: reghdfe y xfit, absorb(i t, savefe) resid cluster(i)
predict st2e, resid
local st2se = _se[xfit]
local st2rmse = `e(rmse)'
local dfr = (`e(N)' - `e(df_m)' - `e(df_r)')

* Getting "corrected" residuals as true X's and IV-estimated coefficients
* You have to do this in a strange way because predict doesn't work if you change the X values
gen cst2e = st2e + (xfit - x)*_b[xfit]

* Getting "corrected" sum of squared errors
gen cst2e2 = cst2e^2
qui: sum cst2e2
local csse = `r(sum)'

* Original SE's with no correction for simulated variables
di `st2se'
* Actual IV standard errors
qui: ivreghdfe y (x = z), absorb(i t) cluster(i)
di _se[x]
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')

Mitch Downey, Grad student, UCSD Economics

Tags: None

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

11 Mar 2019, 13:13

You didn't get a quick answer. You're asking for a lot of debugging especially when you don't provide data on which to run the model. Why you would need to do this is not clear - we don't knowingly help with homework.
Comment
Mitch Downey

Join Date: Oct 2014

Posts: 24
#3

12 Mar 2019, 00:34

Hi Phil,

I'll be honest, your reply seems totally unfair to me.

Regarding

You're asking for a lot of debugging

I actually don't think that's true. I'm asking the same question that's been asked multiple times, but the existing code doesn't work when you cluster. So ultimately, the question is about Stata formulas and degrees of freedom adjustments with clustered standard errors, which seems well within the bounds of what people normally ask on Statalist.

Regarding

especially when you don't provide data on which to run the model

As stated in the original post, the code simulates the data for the regression and thus constitutes a minimum working example. I suppose I could have gone with webuse income instead (as is more standard), but that hardly seems like a first class offense.

Regarding

Why you would need to do this is not clear

I'm always happy to explain my research projects on Statalist but try to avoid including too much irrelevant information. I want to run 16 IV regressions with 16 different dependent variables but the same first stage. Since it's a lot of data and the regressions are slow, instead of running 16 IV regressions (in which Stata is implicitly running 16 first stages + 16 second stages = 32 regressions), I'd rather run 1 first stage + 16 second stages (with appropriate DOF adjustments) = 17 regressions. Improving code efficiency with big data again seems well within the purview of Statalist. It's worth reiterating that this question has gotten its own Stata.com post and has been asked multiple times in the past (with different variations like non-linear first stages), but that those solutions don't work with clustered standard errors. Thus, while it may not be clear to you, it certainly seems like lots of people need to do this.

Regarding

we don't knowingly help with homework

Be honest: This is just rude, isn't it? Don't you think this is a pretty dismissive and condescending thing to say to someone on Statalist? I spent a lot of time writing a very clear post that I felt was both precise and concise. I made sure to annotate my code to make it easy for others to follow. I made sure that code showed exactly what the problem was. And I made sure to link to directly relevant information. Doesn't this feel like a fairly disrespectful conclusion to jump to? Do you really see this post as that amateurish?

I understand that Statalist replies often focus primarily on best practices of posting and identifying violations that were in the original post. I understand that there is value to doing that. But looking at what I originally wrote (which it seems you didn't read very carefully) and at your reply, this seems different to me. I worked very hard to follow best practices, and your core objections to my post seem to ignore what I actually wrote.

Thank you for your contributions to Statalist. I genuinely do appreciate what you do and what you've added to the forum/list in the 10 years I've been using Stata.

Sincerely,
Mitch

Mitch Downey, Grad student, UCSD Economics
1 like
Comment

Mitch Downey

Join Date: Oct 2014
Posts: 24

10 Mar 2022, 03:53

Four people have now separately reached out to me by email and asked if I found a solution to this problem. Fortunately, one of them was kind enough to solve the problem and share his code with me. He did not post it to Statalist, presumably because Statalist has become a toxic environment where no one actually solves anyone's problems and instead just criticizes people for having usernames that don't reflect their actual names (important exception: Sergio Correia is great and everyone knows that). It's a shame what Statalist has become over the last 10 years.

Here's the code (which I did not write). Hope it works for you if you decide you have the same "homework assignment" that I did three years ago.

Code:

 * First stage  
qui: reg x z $individual_controls , cluster(id)
cap drop xfit
predict xfit, xb

* Second stage
 reg y xfit  $individual_controls    , cluster(id)
local st2se = _se[xfit]
* Degree of freedom
local dof = (`e(N)'-`e(df_m)')
di `dof'

tempvar esave           
qui estsave, gen(`esave')
qui estsave, from(`esave')

tempname V 
matrix `V' = e(V)
matrix b=e(b)
 
* computing residuals by hand
cap drop eps2_2 y_fit
predict y_fit   
gen double eps2_2 = (y-y_fit)^2
qui: sum eps2_2
local csse2 = `r(sum)'
* get root mean squared error
local st2rmse =  sqrt(`csse2'/(`dof'))  
di `st2rmse'
local st2r  =   (`csse2'/(`dof'))  

* Getting "corrected" residuals as true X's and IV-estimated coefficients
cap drop   eps2 y_fit2
replace xfit= x
predict y_fit2  
gen double eps2 = (y -y_fit2)^2
 

* Getting "corrected" sum of squared errors 
qui: sum eps2
local csse = `r(sum)' 

* Original SE's with no correction for simulated variables
di `st2se' 
* Manually calculated/adjusted SE's
di `st2se'*(sqrt(`csse'/(`dof'))/`st2rmse')

* Correct variance-covariance matrix
matrix `V'=`V'*((`csse'/(`dof'))/`st2r')
erepost V = `V'

Mitch Downey, Grad student, UCSD Economics

Comment

Chinh Hoang-Duc

Join Date: Aug 2023

Posts: 3
#5

05 Jan 2024, 03:11

Thank you so much, Mitch, for posting all this!
Comment

Announcement

Manual SE adjustment for 2SLS with clustering

Comment

Comment

Comment

Comment