Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computing correct standard errors for within estimator

    I need to estimate an individual-fixed effects regression but due to the computational burden I cannot use reghdfe or include indicators/dummies for the fixed effects. I have managed to get the correct point estimates by doing a within-transformation and running an OLS regression on the demeaned variables. However my standard errors are off. How can I get the correct standard errors i.e., the ones reported in reghdfe? Below is a MWE.


    clear all
    set obs 10000

    gen x = runiform()
    gen y = (2+runiform()) * x
    gen indiv_id = round(runiform(1,1000))

    * demean variables at indiv_id level
    foreach v in x y{
    egen meantemp = mean(`v'), by(indiv_id)
    gen `v'_demeaned = `v' - meantemp
    drop meantemp
    }

    reghdfe y x, absorb(indiv_id)
    reg y_demeaned x_demeaned

  • #2
    Use -xtreg, fe-. Since you are only trying to absorb a single id variable, you can make use of it. It works by demeaning, but produces standard errors that match those of -reghdfe-.

    Code:
    . reghdfe y x, absorb(indiv_id)
    (MWFE estimator converged in 1 iterations)
    
    HDFE Linear regression                            Number of obs   =     10,000
    Absorbing 1 HDFE group                            F(   1,   8999) =  171600.86
                                                      Prob > F        =     0.0000
                                                      R-squared       =     0.9549
                                                      Adj R-squared   =     0.9499
                                                      Within R-sq.    =     0.9502
                                                      Root MSE        =     0.1666
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
               x |   2.501669   .0060391   414.25   0.000     2.489831    2.513507
           _cons |   .0008676   .0034529     0.25   0.802    -.0059009    .0076361
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
        indiv_id |      1000           0        1000     |
    -----------------------------------------------------+
    
    . reg y_demeaned x_demeaned
    
          Source |       SS           df       MS      Number of obs   =    10,000
    -------------+----------------------------------   F(1, 9998)      >  99999.00
           Model |  4763.05735         1  4763.05735   Prob > F        =    0.0000
        Residual |  249.781693     9,998  .024983166   R-squared       =    0.9502
    -------------+----------------------------------   Adj R-squared   =    0.9502
           Total |  5012.83904     9,999  .501334038   Root MSE        =    .15806
    
    ------------------------------------------------------------------------------
      y_demeaned | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
      x_demeaned |   2.501669   .0057294   436.64   0.000     2.490438      2.5129
           _cons |   4.29e-10   .0015806     0.00   1.000    -.0030983    .0030983
    ------------------------------------------------------------------------------
    
    . xtset indiv_id
    
    Panel variable: indiv_id (unbalanced)
    
    . xtreg y x, fe
    
    Fixed-effects (within) regression               Number of obs     =     10,000
    Group variable: indiv_id                        Number of groups  =      1,000
    
    R-squared:                                      Obs per group:
         Within  = 0.9502                                         min =          2
         Between = 0.9439                                         avg =       10.0
         Overall = 0.9498                                         max =         22
    
                                                    F(1,8999)         =  171600.86
    corr(u_i, Xb) = 0.0071                          Prob > F          =     0.0000
    
    ------------------------------------------------------------------------------
               y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
               x |   2.501669   .0060391   414.25   0.000     2.489831    2.513507
           _cons |   .0008676   .0034529     0.25   0.802    -.0059009    .0076361
    -------------+----------------------------------------------------------------
         sigma_u |  .05671267
         sigma_e |  .16660314
             rho |  .10384316   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    F test that all u_i=0: F(999, 8999) = 1.01                   Prob > F = 0.3991
    
    .

    Comment


    • #3
      When you use the traditional (nonrobust) standard errors, pooled OLS estimation on the within transformed variables does not properly estimate the error variance: it assumes too many degrees of freedom. In a balanced panel, the proper DF is N(T - 1) - K rather than NT - K.

      Having said that, you should be using vce(cluster id) [equivalent to vce(robust)] in almost all applications. Then, the within standard errors are just fine, and should be very close to reghdfe with the vce(cluster id) option.

      Comment


      • #4
        Thanks I tried your suggestions using fake data on 10 million observations with 100K individual ids. The panel is unbalanced.

        -reghdfe- is faster than both of the alternatives you mention i.e., -xtreg, fe- and -reg- on the transformed vars with vce(cluster id).

        -reg- on the transformed variables but without clustering is by far the fastest. It is almost 10 times as fast as reghdfe. I wonder if I there is a way to run -reg- and correct the SEs that it outputs perhaps by adjusting the DF and recomputing the SEs manually.

        Comment


        • #5
          The variables need to be demeaned every time I call -xtreg, fe-. This is very expensive because in addition to the individual FEs I have roughly 300 additional variables which include fixed effects for year, age, cities and numerous continuous controls. Ideally I would demean everything once and save the demeaned variables in a dataset. I could then just run regressions on this transformed dataset in the future without having to transform it every time I want to run a regression.


          Originally posted by Clyde Schechter View Post
          Use -xtreg, fe-. Since you are only trying to absorb a single id variable, you can make use of it. It works by demeaning, but produces standard errors that match those of -reghdfe-.
          Last edited by Sharad Kumar; 05 Oct 2022, 04:14.

          Comment


          • #6
            I don't understand your point in #4. In #1 you said that you were unable to use -reghdfe- with your data set. If you want -reghdfe-'s calculations and you can use it, then just use it. Yes, it is faster than the equivalent -xtreg, fe-. I was proposing -xtreg, fe- only because you led me to believe that -reghdfe- is not a possibility for you because of the number of absorbed variables. So I don't understand what you were asking for in the first place.

            Re #5:
            Ideally I would demean everything once and save the demeaned variables in a dataset.
            Yes, there is a Stata command to do precisely that. See -help xtdata-. It will create the demeaned data set for you. You will, of course, have to also write a command to save the result. Then you can go ahead and use that for repeated analyses using -regress-.

            But again, -reghdfe- is faster, so why not just use it?

            Comment


            • #7
              Sorry for the confusion. I was trying to strip down the problem and only present the relevant items in the original post to avoid confusion, but it appears I have created even more confusion.

              My issue is that I have limited processing power. But I need to estimate many specifications/versions of a model with ~10 million observations, 100K individual fixed effects, and some 300 other control variables (some of which are fixed effects in other dimensions like age). The fully specified -reghdfe- takes approximately 5 hours to converge. I would prefer to reduce this time considerably if possible. To do so I tried demeaning everything at the individual level (there are 100k individuals) and this gave me the correct point estimates, but the SEs are off as discussed above. The entire process of demeaning and running the regression on the demeaned variables took no more than 10 minutes which is a considerable improvement in efficiency. I guess I am puzzled why -reghdfe- and the various -xt- commands take many hours longer than the simple demean-and-regress process. And more importantly if it is possible to get the time down for reghdfe to something more similar to regressing on the demeaned variables.


              Originally posted by Clyde Schechter View Post
              I don't understand your point in #4. In #1 you said that you were unable to use -reghdfe- with your data set. If you want -reghdfe-'s calculations and you can use it, then just use it. Yes, it is faster than the equivalent -xtreg, fe-. I was proposing -xtreg, fe- only because you led me to believe that -reghdfe- is not a possibility for you because of the number of absorbed variables. So I don't understand what you were asking for in the first place.

              Re #5:
              Yes, there is a Stata command to do precisely that. See -help xtdata-. It will create the demeaned data set for you. You will, of course, have to also write a command to save the result. Then you can go ahead and use that for repeated analyses using -regress-.

              But again, -reghdfe- is faster, so why not just use it?

              Comment


              • #8
                OK. As Jeff Wooldridge pointed out in #2, should probably be clustering your standard errors for these models anyway. If you do that, -reghdfe-, -xtreg, fe-, and -regress- applied to demeaned data all produce identical results:
                Code:
                . sysuse auto, clear
                (1978 automobile data)
                
                .
                . reghdfe price mpg, absorb(rep78) vce(cluster rep78)
                (MWFE estimator converged in 1 iterations)
                
                HDFE Linear regression                            Number of obs   =         69
                Absorbing 1 HDFE group                            F(   1,      4) =       8.55
                Statistics robust to heteroskedasticity           Prob > F        =     0.0431
                                                                  R-squared       =     0.2584
                                                                  Adj R-squared   =     0.1995
                                                                  Within R-sq.    =     0.2475
                Number of clusters (rep78)   =          5         Root MSE        =  2605.7822
                
                                                  (Std. err. adjusted for 5 clusters in rep78)
                ------------------------------------------------------------------------------
                             |               Robust
                       price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                         mpg |  -280.2615   95.87178    -2.92   0.043    -546.4442   -14.07875
                       _cons |   12112.77   2041.096     5.93   0.004     6445.778    17779.76
                ------------------------------------------------------------------------------
                
                Absorbed degrees of freedom:
                -----------------------------------------------------+
                 Absorbed FE | Categories  - Redundant  = Num. Coefs |
                -------------+---------------------------------------|
                       rep78 |         5           5           0    *|
                -----------------------------------------------------+
                * = FE nested within cluster; treated as redundant for DoF computation
                
                .
                . xtset rep78
                
                Panel variable: rep78 (unbalanced)
                
                . xtreg price mpg, fe vce(cluster rep78)
                
                Fixed-effects (within) regression               Number of obs     =         69
                Group variable: rep78                           Number of groups  =          5
                
                R-squared:                                      Obs per group:
                     Within  = 0.2475                                         min =          2
                     Between = 0.0014                                         avg =       13.8
                     Overall = 0.2079                                         max =         30
                
                                                                F(1,4)            =       8.55
                corr(u_i, Xb) = -0.4351                         Prob > F          =     0.0431
                
                                                  (Std. err. adjusted for 5 clusters in rep78)
                ------------------------------------------------------------------------------
                             |               Robust
                       price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                         mpg |  -280.2615   95.87178    -2.92   0.043    -546.4442   -14.07875
                       _cons |   12112.77   2041.096     5.93   0.004     6445.778    17779.76
                -------------+----------------------------------------------------------------
                     sigma_u |  1152.8545
                     sigma_e |  2605.7822
                         rho |  .16369566   (fraction of variance due to u_i)
                ------------------------------------------------------------------------------
                
                .
                . xtdata price mpg, fe clear
                
                . regress price mpg, vce(cluster rep78)
                
                Linear regression                               Number of obs     =         69
                                                                F(0, 4)           =          .
                                                                Prob > F          =          .
                                                                R-squared         =     0.2475
                                                                Root MSE          =     2526.8
                
                                                  (Std. err. adjusted for 5 clusters in rep78)
                ------------------------------------------------------------------------------
                             |               Robust
                       price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                -------------+----------------------------------------------------------------
                         mpg |  -280.2615   95.87178    -2.92   0.043    -546.4442   -14.07875
                       _cons |   12112.77   2041.096     5.93   0.004     6445.778    17779.76
                ------------------------------------------------------------------------------
                So, demean your data once with -xtdata-, and save it. Then do all your runs using -regress, vce(cluster indiv_id)- and you will get the standard errors you need and the fastest possible execution time.

                Comment


                • #9
                  Hi sharad
                  i have another suggestion
                  some time ago I wrote a command available on ssc called regxfe
                  this command works almost the same as reghdfe. It is slower the first time, as you may find out, but allows you to break down the problem into smaller steps
                  demeaning
                  calculating degrees of freedom
                  estimating the model
                  Look into the help file and the paper on Stata journal
                  (contact me if you can’t access it)
                  best wishes
                  fernando

                  Comment


                  • #10
                    As Jeff said, regress on the demeaned data gets the standard errors wrong due to incorrect degrees of freedom. This can be fixed (in balanced panels) by rescaling the standard errors, i.e. multiplying the wrong standard errors with the following factor:

                    \( \sqrt{\frac{NT - K}{N(T-1) - K}} \)

                    Better of course: Use (cluster-)robust standard errors.
                    Last edited by Sebastian Kripfganz; 05 Oct 2022, 15:39.
                    https://www.kripfganz.de/stata/

                    Comment

                    Working...
                    X