Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Manual SE adjustment for 2SLS with clustering

    Hello,

    I'd like to manually estimate a two-stage least squares regression, first running the first stage then running the second stage with the predicted X. Doing this, the standard errors need to be adjusted to account for predicted X being a simulated regressor.

    I did so using code from a Stata.com post and a Kit Baum Statalist post. With clustering, however, the degrees of freedom adjustment isn't quite right and I can't figure out how to do it. I can get "close" but my manual SE's still don't match those from ivregress (although they do when I don't cluster).

    Can someone correct my SE adjustment so that the manual two-stage can recover the SE's from ivregress?

    Below is code to generate an illustrative dataset, to generate the accurate correction without clustering (the code is very similar to that from the links above but slightly more automated), to get an "almost accurate" correction with clustering, and to do the whole thing with reghdfe in addition to reg (which has to be done slightly differently because of differences in how predict works; I'm including it because I thought it might be useful to others).

    Thank you,
    Mitch

    Code:
    set seed 30819
    
    ******************
    /*    DATA SETUP    */
    ******************
    
    clear
    
    * Generate an unbalanced panel of 1000 observations
    set obs 1000
    gen n = _n
    gen i = ceil(100*runiform())
    sort i n
    by i: gen t = _n
    *tab i, sort
    *tab t
    
    * Create need for i fixed effects
    gen temp1 = rnormal()
    egen a = mean(temp1), by(i)
    drop temp1
    
    * Create need for t fixed effects
    gen temp2 = t + rnormal()
    egen d = mean(temp2), by(t)
    drop temp2
    
    gen u1 = rnormal()
    gen u2 = rnormal()*2
    gen v = rnormal()
    gen z = rnormal()
    
    gen x = a + d + z/3 + u1 + v
    
    gen y = a + d + x/2 + u1 + u2
    
    **********************************
    /*    REG VERSION WITH DUMMIES    */
    /*        NO CLUSTERING            */
    **********************************
    
    * First stage
    qui: reg x z i.i i.t
    predict xfit, xb
    
    * Second stage
    qui: reg y xfit i.i i.t
    local st2se = _se[xfit]
    local st2rmse = `e(rmse)'
    di `e(df_r)'
    local dfr = `e(df_r)'
    di `dfr'
    
    * Getting "corrected" residuals as true X's and IV-estimated coefficients
    replace xfit = x
    predict cst2e, resid
    
    * Getting "corrected" sum of squared errors
    gen cst2e2 = cst2e^2
    qui: sum cst2e2
    local csse = `r(sum)'
    di `csse'
    
    * Original SE's with no correction for simulated variables
    di `st2se'
    * Actual IV standard errors
    qui: ivregress 2sls y i.i i.t (x = z)
    di _se[x]
    * Manually calculated/adjusted SE's
    di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')
    * Actual IV standard errors with small sample correction
    qui: ivregress 2sls y i.i i.t (x = z), small
    di _se[x]
    
    drop xfit cst2e cst2e2
    
    
    **********************************
    /*    REG VERSION WITH DUMMIES    */
    /*        WITH CLUSTERING            */
    **********************************
    
    * Note: Clustering matters
    qui: reg x z i.i i.t
    di _se[z]
    qui: reg x z i.i i.t, cluster(i)
    di _se[z]
    
    * First stage
    qui: reg x z i.i i.t, cluster(i)
    predict xfit, xb
    
    * Second stage
    qui: reg y xfit i.i i.t, cluster(i)
    local st2se = _se[xfit]
    local st2rmse = `e(rmse)'
    di `e(df_r)'
    local dfr = (`e(N)' - `e(df_m)' - `e(df_r)')
    di `dfr'
    
    * Getting "corrected" residuals as true X's and IV-estimated coefficients
    replace xfit = x
    predict cst2e, resid
    
    * Getting "corrected" sum of squared errors
    gen cst2e2 = cst2e^2
    qui: sum cst2e2
    local csse = `r(sum)'
    di `csse'
    
    * Original SE's with no correction for simulated variables
    di `st2se'
    * Actual IV standard errors
    qui: ivregress 2sls y i.i i.t (x = z), cluster(i)
    di _se[x]
    * Manually calculated/adjusted SE's
    di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')
    * Actual IV standard errors with small sample correction
    qui: ivregress 2sls y i.i i.t (x = z), cluster(i) small
    di _se[x]
    
    drop xfit cst2e cst2e2
    
    **********************
    /*    REGHDFE VERSION    */
    /*    NO CLUSTERING    */
    **********************
    
    * First stage
    qui: reghdfe x z, absorb(i t, savefe) resid
    predict xfit, xbd
    
    * Second stage
    qui: reghdfe y xfit, absorb(i t, savefe) resid
    predict st2e, resid
    local st2se = _se[xfit]
    local st2rmse = `e(rmse)'
    local dfr = `e(df_r)'
    
    * Getting "corrected" residuals as true X's and IV-estimated coefficients
    * You have to do this in a strange way because predict doesn't work if you change the X values
    gen cst2e = st2e + (xfit - x)*_b[xfit]
    
    * Getting "corrected" sum of squared errors
    gen cst2e2 = cst2e^2
    qui: sum cst2e2
    local csse = `r(sum)'
    
    * Original SE's with no correction for simulated variables
    di `st2se'
    * Actual IV standard errors
    qui: ivreghdfe y (x = z), absorb(i t)
    di _se[x]
    * Manually calculated/adjusted SE's
    di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')
    
    drop xfit st2e cst2e cst2e2
    
    **********************
    /*    REGHDFE VERSION    */
    /*    WITH CLUSTERING    */
    **********************
    
    * First stage
    qui: reghdfe x z, absorb(i t, savefe) resid cluster(i)
    predict xfit, xbd
    
    * Second stage
    qui: reghdfe y xfit, absorb(i t, savefe) resid cluster(i)
    predict st2e, resid
    local st2se = _se[xfit]
    local st2rmse = `e(rmse)'
    local dfr = (`e(N)' - `e(df_m)' - `e(df_r)')
    
    * Getting "corrected" residuals as true X's and IV-estimated coefficients
    * You have to do this in a strange way because predict doesn't work if you change the X values
    gen cst2e = st2e + (xfit - x)*_b[xfit]
    
    * Getting "corrected" sum of squared errors
    gen cst2e2 = cst2e^2
    qui: sum cst2e2
    local csse = `r(sum)'
    
    * Original SE's with no correction for simulated variables
    di `st2se'
    * Actual IV standard errors
    qui: ivreghdfe y (x = z), absorb(i t) cluster(i)
    di _se[x]
    * Manually calculated/adjusted SE's
    di `st2se'*(sqrt(`csse'/`dfr')/`st2rmse')
    Mitch Downey, Grad student, UCSD Economics

  • #2
    You didn't get a quick answer. You're asking for a lot of debugging especially when you don't provide data on which to run the model. Why you would need to do this is not clear - we don't knowingly help with homework.

    Comment


    • #3
      Hi Phil,

      I'll be honest, your reply seems totally unfair to me.

      Regarding
      You're asking for a lot of debugging
      I actually don't think that's true. I'm asking the same question that's been asked multiple times, but the existing code doesn't work when you cluster. So ultimately, the question is about Stata formulas and degrees of freedom adjustments with clustered standard errors, which seems well within the bounds of what people normally ask on Statalist.

      Regarding
      especially when you don't provide data on which to run the model
      As stated in the original post, the code simulates the data for the regression and thus constitutes a minimum working example. I suppose I could have gone with webuse income instead (as is more standard), but that hardly seems like a first class offense.

      Regarding
      Why you would need to do this is not clear
      I'm always happy to explain my research projects on Statalist but try to avoid including too much irrelevant information. I want to run 16 IV regressions with 16 different dependent variables but the same first stage. Since it's a lot of data and the regressions are slow, instead of running 16 IV regressions (in which Stata is implicitly running 16 first stages + 16 second stages = 32 regressions), I'd rather run 1 first stage + 16 second stages (with appropriate DOF adjustments) = 17 regressions. Improving code efficiency with big data again seems well within the purview of Statalist. It's worth reiterating that this question has gotten its own Stata.com post and has been asked multiple times in the past (with different variations like non-linear first stages), but that those solutions don't work with clustered standard errors. Thus, while it may not be clear to you, it certainly seems like lots of people need to do this.

      Regarding
      we don't knowingly help with homework
      Be honest: This is just rude, isn't it? Don't you think this is a pretty dismissive and condescending thing to say to someone on Statalist? I spent a lot of time writing a very clear post that I felt was both precise and concise. I made sure to annotate my code to make it easy for others to follow. I made sure that code showed exactly what the problem was. And I made sure to link to directly relevant information. Doesn't this feel like a fairly disrespectful conclusion to jump to? Do you really see this post as that amateurish?

      I understand that Statalist replies often focus primarily on best practices of posting and identifying violations that were in the original post. I understand that there is value to doing that. But looking at what I originally wrote (which it seems you didn't read very carefully) and at your reply, this seems different to me. I worked very hard to follow best practices, and your core objections to my post seem to ignore what I actually wrote.

      Thank you for your contributions to Statalist. I genuinely do appreciate what you do and what you've added to the forum/list in the 10 years I've been using Stata.

      Sincerely,
      Mitch
      Mitch Downey, Grad student, UCSD Economics

      Comment


      • #4
        Four people have now separately reached out to me by email and asked if I found a solution to this problem. Fortunately, one of them was kind enough to solve the problem and share his code with me. He did not post it to Statalist, presumably because Statalist has become a toxic environment where no one actually solves anyone's problems and instead just criticizes people for having usernames that don't reflect their actual names (important exception: Sergio Correia is great and everyone knows that). It's a shame what Statalist has become over the last 10 years.

        Here's the code (which I did not write). Hope it works for you if you decide you have the same "homework assignment" that I did three years ago.

        Code:
         * First stage  
        qui: reg x z $individual_controls , cluster(id)
        cap drop xfit
        predict xfit, xb
        
        * Second stage
         reg y xfit  $individual_controls    , cluster(id)
        local st2se = _se[xfit]
        * Degree of freedom
        local dof = (`e(N)'-`e(df_m)')
        di `dof'
        
        tempvar esave           
        qui estsave, gen(`esave')
        qui estsave, from(`esave')
        
        tempname V 
        matrix `V' = e(V)
        matrix b=e(b)
         
        * computing residuals by hand
        cap drop eps2_2 y_fit
        predict y_fit   
        gen double eps2_2 = (y-y_fit)^2
        qui: sum eps2_2
        local csse2 = `r(sum)'
        * get root mean squared error
        local st2rmse =  sqrt(`csse2'/(`dof'))  
        di `st2rmse'
        local st2r  =   (`csse2'/(`dof'))  
        
        * Getting "corrected" residuals as true X's and IV-estimated coefficients
        cap drop   eps2 y_fit2
        replace xfit= x
        predict y_fit2  
        gen double eps2 = (y -y_fit2)^2
         
        
        * Getting "corrected" sum of squared errors 
        qui: sum eps2
        local csse = `r(sum)' 
        
        * Original SE's with no correction for simulated variables
        di `st2se' 
        * Manually calculated/adjusted SE's
        di `st2se'*(sqrt(`csse'/(`dof'))/`st2rmse')
        
        * Correct variance-covariance matrix
        matrix `V'=`V'*((`csse'/(`dof'))/`st2r')
        erepost V = `V'
        Mitch Downey, Grad student, UCSD Economics

        Comment


        • #5
          Thank you so much, Mitch, for posting all this!

          Comment

          Working...
          X