Number of observations dropped after including country industry and year fixed effects, using reghdfe

Yongda Liu

Join Date: Jun 2024
Posts: 5

Number of observations dropped after including country industry and year fixed effects, using reghdfe

17 Feb 2025, 06:02

Hi All,

I am currently facing a tricky problem while using reghdfe. I have noticed that the number of observations and firms is slightly different when including versus not including fixed effects (Please see the case below). However, I have no idea which observations are dropped and why, or how I can find them.

I would appreciate any help you can provide.

Many thanks

Code:

. reghdfe dependant independent if dev == 0, noabsorb vce (cluster firmid)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression                            Number of obs   =     35,826
Absorbing 1 HDFE group                            F(   1,   7540) =      17.05
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.0014
                                                  Adj R-squared   =     0.0014
                                                  Within R-sq.    =     0.0014
Number of clusters (firmid)  =      7,541         Root MSE        =     3.9658

                             (Std. err. adjusted for 7,541 clusters in firmid)
------------------------------------------------------------------------------
             |               Robust
   dependant | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
 independent |  -.1597525   .0386879    -4.13   0.000    -.2355915   -.0839135
       _cons |   5.797309   .0543657   106.64   0.000     5.690737    5.903881
------------------------------------------------------------------------------

. 
. reghdfe dependant independent if dev == 0, absorb(country industry year) vce (cluster firmid)
(MWFE estimator converged in 7 iterations)

HDFE Linear regression                            Number of obs   =     35,824
Absorbing 3 HDFE groups                           F(   1,   7539) =      50.53
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.2034
                                                  Adj R-squared   =     0.2019
                                                  Within R-sq.    =     0.0035
Number of clusters (firmid)  =      7,540         Root MSE        =     3.5456

                             (Std. err. adjusted for 7,540 clusters in firmid)
------------------------------------------------------------------------------
             |               Robust
   dependant | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
 independent |  -.2340908   .0329301    -7.11   0.000     -.298643   -.1695386
       _cons |   5.873444   .0449965   130.53   0.000     5.785238    5.961649
------------------------------------------------------------------------------

Tags: None

Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

17 Feb 2025, 06:26

after estimation, one of the save results is "e(sample" which tells you which observations were included in the analysis; so, after each estimation, you want to save a new variable that is equal to the "e(sample)" for that estimation and then compare the variables; those that have a code of "0" for one estimate and "1" for the other estimate are excluded from the first but included in the second - you can then examine those two cases; e.g.,

Code:

qui reghdfe dependant independent if dev == 0, noabsorb vce (cluster firmid) gen newvar1=e(sample) qui reghdfe dependant independent if dev == 0, absorb(country industry year) vce (cluster firmid) gen newvar2=e(sample)

you can edit/browse those observations that differ on newvar1 and newvar2
obviously, you should choose names that are more meaningful to you than "newvar1" and "newvar2"
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10198

17 Feb 2025, 06:38

reghdfe is from https://github.com/sergiocorreia/reghdfe (FAQ Advice #12).

Originally posted by Yongda Liu View Post

Code:

. reghdfe dependant independent if dev == 0, noabsorb vce (cluster firmid)
(MWFE estimator converged in 1 iterations)

HDFE Linear regression Number of obs = 35,826
Absorbing 1 HDFE group F( 1, 7540) = 17.05
Statistics robust to heteroskedasticity Prob > F = 0.0000
R-squared = 0.0014
Adj R-squared = 0.0014
Within R-sq. = 0.0014
Number of clusters (firmid) = 7,541 Root MSE = 3.9658

(Std. err. adjusted for 7,541 clusters in firmid)
------------------------------------------------------------------------------
| Robust
dependant | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
independent | -.1597525 .0386879 -4.13 0.000 -.2355915 -.0839135
_cons | 5.797309 .0543657 106.64 0.000 5.690737 5.903881
------------------------------------------------------------------------------

.
. reghdfe dependant independent if dev == 0, absorb(country industry year) vce (cluster firmid)
(MWFE estimator converged in 7 iterations)

HDFE Linear regression Number of obs = 35,824
Absorbing 3 HDFE groups F( 1, 7539) = 50.53
Statistics robust to heteroskedasticity Prob > F = 0.0000
R-squared = 0.2034
Adj R-squared = 0.2019
Within R-sq. = 0.0035
Number of clusters (firmid) = 7,540 Root MSE = 3.5456

(Std. err. adjusted for 7,540 clusters in firmid)
------------------------------------------------------------------------------
| Robust
dependant | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
independent | -.2340908 .0329301 -7.11 0.000 -.298643 -.1695386
_cons | 5.873444 .0449965 130.53 0.000 5.785238 5.961649
------------------------------------------------------------------------------

One firm has a single observation from the differences in the number of clusters (referred to as a singleton), and the second singleton observation may come from either the industry or the year. By default, reghdfe drops these observations. For a detailed explanation of why singletons should be excluded, see this paper.

Compare with:

Code:

reghdfe dependant independent if dev == 0, keepsingletons absorb(country industry year) vce (cluster firmid)

Last edited by Andrew Musau; 17 Feb 2025, 06:44.

Comment

Yongda Liu

Join Date: Jun 2024

Posts: 5
#4

17 Feb 2025, 07:30

Originally posted by Rich Goldstein View Post

after estimation, one of the save results is "e(sample" which tells you which observations were included in the analysis; so, after each estimation, you want to save a new variable that is equal to the "e(sample)" for that estimation and then compare the variables; those that have a code of "0" for one estimate and "1" for the other estimate are excluded from the first but included in the second - you can then examine those two cases; e.g.,

Code:

qui reghdfe dependant independent if dev == 0, noabsorb vce (cluster firmid) gen newvar1=e(sample) qui reghdfe dependant independent if dev == 0, absorb(country industry year) vce (cluster firmid) gen newvar2=e(sample)

you can edit/browse those observations that differ on newvar1 and newvar2
obviously, you should choose names that are more meaningful to you than "newvar1" and "newvar2"

Thank you so much, Rich! I find the reason of the difference following your suggestions.
Comment
Yongda Liu

Join Date: Jun 2024

Posts: 5
#5

17 Feb 2025, 07:32

Originally posted by Andrew Musau View Post

reghdfe is from https://github.com/sergiocorreia/reghdfe (FAQ Advice #12).

One firm has a single observation from the differences in the number of clusters (referred to as a singleton), and the second singleton observation may come from either the industry or the year. By default, reghdfe drops these observations. For a detailed explanation of why singletons should be excluded, see this paper.

Compare with:

Code:

reghdfe dependant independent if dev == 0, keepsingletons absorb(country industry year) vce (cluster firmid)

Thank you, Andrew. I think it is not driven by singleton observations. I have found the reason using Rich's method. Thank you anyway!
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10198
#6

17 Feb 2025, 10:14

Originally posted by Yongda Liu View Post

I think it is not driven by singleton observations.

If not that, then what? Assuming that the variables are the same across both regressions and the only difference is that you are absorbing the fixed effects, it is difficult to imagine what else would cause the discrepancy. However, in this case, you don't have to guess. I gave you a command that will tell you whether singleton observations are the issue.

Code:

reghdfe dependant independent if dev == 0, keepsingletons absorb(country industry year) vce (cluster firmid)
Comment

Announcement

Number of observations dropped after including country industry and year fixed effects, using reghdfe

Comment

Comment

Comment

Comment

Comment