Computing correct standard errors for within estimator

Sharad Kumar

Join Date: Aug 2017

Posts: 16
#1

Computing correct standard errors for within estimator

04 Oct 2022, 16:36

I need to estimate an individual-fixed effects regression but due to the computational burden I cannot use reghdfe or include indicators/dummies for the fixed effects. I have managed to get the correct point estimates by doing a within-transformation and running an OLS regression on the demeaned variables. However my standard errors are off. How can I get the correct standard errors i.e., the ones reported in reghdfe? Below is a MWE.

clear all
set obs 10000

gen x = runiform()
gen y = (2+runiform()) * x
gen indiv_id = round(runiform(1,1000))

* demean variables at indiv_id level
foreach v in x y{
egen meantemp = mean(`v'), by(indiv_id)
gen `v'_demeaned = `v' - meantemp
drop meantemp
}

reghdfe y x, absorb(indiv_id)
reg y_demeaned x_demeaned
Tags: fixed effects, panel data, reghdfe, standard errors

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

04 Oct 2022, 17:08

Use -xtreg, fe-. Since you are only trying to absorb a single id variable, you can make use of it. It works by demeaning, but produces standard errors that match those of -reghdfe-.

Code:

. reghdfe y x, absorb(indiv_id)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =     10,000
Absorbing 1 HDFE group                            F(   1,   8999) =  171600.86
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.9549
                                                  Adj R-squared   =     0.9499
                                                  Within R-sq.    =     0.9502
                                                  Root MSE        =     0.1666

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   2.501669   .0060391   414.25   0.000     2.489831    2.513507
       _cons |   .0008676   .0034529     0.25   0.802    -.0059009    .0076361
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
    indiv_id |      1000           0        1000     |
-----------------------------------------------------+

. reg y_demeaned x_demeaned

      Source |       SS           df       MS      Number of obs   =    10,000
-------------+----------------------------------   F(1, 9998)      >  99999.00
       Model |  4763.05735         1  4763.05735   Prob > F        =    0.0000
    Residual |  249.781693     9,998  .024983166   R-squared       =    0.9502
-------------+----------------------------------   Adj R-squared   =    0.9502
       Total |  5012.83904     9,999  .501334038   Root MSE        =    .15806

------------------------------------------------------------------------------
  y_demeaned | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  x_demeaned |   2.501669   .0057294   436.64   0.000     2.490438      2.5129
       _cons |   4.29e-10   .0015806     0.00   1.000    -.0030983    .0030983
------------------------------------------------------------------------------

. xtset indiv_id

Panel variable: indiv_id (unbalanced)

. xtreg y x, fe

Fixed-effects (within) regression               Number of obs     =     10,000
Group variable: indiv_id                        Number of groups  =      1,000

R-squared:                                      Obs per group:
     Within  = 0.9502                                         min =          2
     Between = 0.9439                                         avg =       10.0
     Overall = 0.9498                                         max =         22

                                                F(1,8999)         =  171600.86
corr(u_i, Xb) = 0.0071                          Prob > F          =     0.0000

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           x |   2.501669   .0060391   414.25   0.000     2.489831    2.513507
       _cons |   .0008676   .0034529     0.25   0.802    -.0059009    .0076361
-------------+----------------------------------------------------------------
     sigma_u |  .05671267
     sigma_e |  .16660314
         rho |  .10384316   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(999, 8999) = 1.01                   Prob > F = 0.3991

.

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2158
#3

04 Oct 2022, 20:23

When you use the traditional (nonrobust) standard errors, pooled OLS estimation on the within transformed variables does not properly estimate the error variance: it assumes too many degrees of freedom. In a balanced panel, the proper DF is N(T - 1) - K rather than NT - K.

Having said that, you should be using vce(cluster id) [equivalent to vce(robust)] in almost all applications. Then, the within standard errors are just fine, and should be very close to reghdfe with the vce(cluster id) option.
2 likes
Comment
Sharad Kumar

Join Date: Aug 2017

Posts: 16
#4

05 Oct 2022, 03:56

Thanks I tried your suggestions using fake data on 10 million observations with 100K individual ids. The panel is unbalanced.

-reghdfe- is faster than both of the alternatives you mention i.e., -xtreg, fe- and -reg- on the transformed vars with vce(cluster id).

-reg- on the transformed variables but without clustering is by far the fastest. It is almost 10 times as fast as reghdfe. I wonder if I there is a way to run -reg- and correct the SEs that it outputs perhaps by adjusting the DF and recomputing the SEs manually.
1 like
Comment
Sharad Kumar

Join Date: Aug 2017

Posts: 16
#5

05 Oct 2022, 04:12

The variables need to be demeaned every time I call -xtreg, fe-. This is very expensive because in addition to the individual FEs I have roughly 300 additional variables which include fixed effects for year, age, cities and numerous continuous controls. Ideally I would demean everything once and save the demeaned variables in a dataset. I could then just run regressions on this transformed dataset in the future without having to transform it every time I want to run a regression.

Originally posted by Clyde Schechter View Post

Use -xtreg, fe-. Since you are only trying to absorb a single id variable, you can make use of it. It works by demeaning, but produces standard errors that match those of -reghdfe-.

Last edited by Sharad Kumar; 05 Oct 2022, 04:14.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#6

05 Oct 2022, 09:00

I don't understand your point in #4. In #1 you said that you were unable to use -reghdfe- with your data set. If you want -reghdfe-'s calculations and you can use it, then just use it. Yes, it is faster than the equivalent -xtreg, fe-. I was proposing -xtreg, fe- only because you led me to believe that -reghdfe- is not a possibility for you because of the number of absorbed variables. So I don't understand what you were asking for in the first place.

Re #5:

Ideally I would demean everything once and save the demeaned variables in a dataset.

Yes, there is a Stata command to do precisely that. See -help xtdata-. It will create the demeaned data set for you. You will, of course, have to also write a command to save the result. Then you can go ahead and use that for repeated analyses using -regress-.

But again, -reghdfe- is faster, so why not just use it?
Comment
Sharad Kumar

Join Date: Aug 2017

Posts: 16
#7

05 Oct 2022, 09:33

Sorry for the confusion. I was trying to strip down the problem and only present the relevant items in the original post to avoid confusion, but it appears I have created even more confusion.

My issue is that I have limited processing power. But I need to estimate many specifications/versions of a model with ~10 million observations, 100K individual fixed effects, and some 300 other control variables (some of which are fixed effects in other dimensions like age). The fully specified -reghdfe- takes approximately 5 hours to converge. I would prefer to reduce this time considerably if possible. To do so I tried demeaning everything at the individual level (there are 100k individuals) and this gave me the correct point estimates, but the SEs are off as discussed above. The entire process of demeaning and running the regression on the demeaned variables took no more than 10 minutes which is a considerable improvement in efficiency. I guess I am puzzled why -reghdfe- and the various -xt- commands take many hours longer than the simple demean-and-regress process. And more importantly if it is possible to get the time down for reghdfe to something more similar to regressing on the demeaned variables.

Originally posted by Clyde Schechter View Post

I don't understand your point in #4. In #1 you said that you were unable to use -reghdfe- with your data set. If you want -reghdfe-'s calculations and you can use it, then just use it. Yes, it is faster than the equivalent -xtreg, fe-. I was proposing -xtreg, fe- only because you led me to believe that -reghdfe- is not a possibility for you because of the number of absorbed variables. So I don't understand what you were asking for in the first place.

Re #5:
Yes, there is a Stata command to do precisely that. See -help xtdata-. It will create the demeaned data set for you. You will, of course, have to also write a command to save the result. Then you can go ahead and use that for repeated analyses using -regress-.

But again, -reghdfe- is faster, so why not just use it?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30095

05 Oct 2022, 11:53

OK. As Jeff Wooldridge pointed out in #2, should probably be clustering your standard errors for these models anyway. If you do that, -reghdfe-, -xtreg, fe-, and -regress- applied to demeaned data all produce identical results:

Code:

. sysuse auto, clear
(1978 automobile data)

.
. reghdfe price mpg, absorb(rep78) vce(cluster rep78)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =         69
Absorbing 1 HDFE group                            F(   1,      4) =       8.55
Statistics robust to heteroskedasticity           Prob > F        =     0.0431
                                                  R-squared       =     0.2584
                                                  Adj R-squared   =     0.1995
                                                  Within R-sq.    =     0.2475
Number of clusters (rep78)   =          5         Root MSE        =  2605.7822

                                  (Std. err. adjusted for 5 clusters in rep78)
------------------------------------------------------------------------------
             |               Robust
       price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -280.2615   95.87178    -2.92   0.043    -546.4442   -14.07875
       _cons |   12112.77   2041.096     5.93   0.004     6445.778    17779.76
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
       rep78 |         5           5           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

.
. xtset rep78

Panel variable: rep78 (unbalanced)

. xtreg price mpg, fe vce(cluster rep78)

Fixed-effects (within) regression               Number of obs     =         69
Group variable: rep78                           Number of groups  =          5

R-squared:                                      Obs per group:
     Within  = 0.2475                                         min =          2
     Between = 0.0014                                         avg =       13.8
     Overall = 0.2079                                         max =         30

                                                F(1,4)            =       8.55
corr(u_i, Xb) = -0.4351                         Prob > F          =     0.0431

                                  (Std. err. adjusted for 5 clusters in rep78)
------------------------------------------------------------------------------
             |               Robust
       price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -280.2615   95.87178    -2.92   0.043    -546.4442   -14.07875
       _cons |   12112.77   2041.096     5.93   0.004     6445.778    17779.76
-------------+----------------------------------------------------------------
     sigma_u |  1152.8545
     sigma_e |  2605.7822
         rho |  .16369566   (fraction of variance due to u_i)
------------------------------------------------------------------------------

.
. xtdata price mpg, fe clear

. regress price mpg, vce(cluster rep78)

Linear regression                               Number of obs     =         69
                                                F(0, 4)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.2475
                                                Root MSE          =     2526.8

                                  (Std. err. adjusted for 5 clusters in rep78)
------------------------------------------------------------------------------
             |               Robust
       price | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -280.2615   95.87178    -2.92   0.043    -546.4442   -14.07875
       _cons |   12112.77   2041.096     5.93   0.004     6445.778    17779.76
------------------------------------------------------------------------------

So, demean your data once with -xtdata-, and save it. Then do all your runs using -regress, vce(cluster indiv_id)- and you will get the standard errors you need and the fastest possible execution time.

Comment

FernandoRios

Join Date: Apr 2014

Posts: 2469
#9

05 Oct 2022, 15:05

Hi sharad
i have another suggestion
some time ago I wrote a command available on ssc called regxfe
this command works almost the same as reghdfe. It is slower the first time, as you may find out, but allows you to break down the problem into smaller steps
demeaning
calculating degrees of freedom
estimating the model
Look into the help file and the paper on Stata journal
(contact me if you can’t access it)
best wishes
fernando
1 like
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2593
#10

05 Oct 2022, 15:37

As Jeff said, regress on the demeaned data gets the standard errors wrong due to incorrect degrees of freedom. This can be fixed (in balanced panels) by rescaling the standard errors, i.e. multiplying the wrong standard errors with the following factor:

\( \sqrt{\frac{NT - K}{N(T-1) - K}} \)

Better of course: Use (cluster-)robust standard errors.

Last edited by Sebastian Kripfganz; 05 Oct 2022, 15:39.

https://www.kripfganz.de/stata/
1 like
Comment

Announcement