Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Number of observations dropped after including country industry and year fixed effects, using reghdfe

    Hi All,

    I am currently facing a tricky problem while using reghdfe. I have noticed that the number of observations and firms is slightly different when including versus not including fixed effects (Please see the case below). However, I have no idea which observations are dropped and why, or how I can find them.

    I would appreciate any help you can provide.

    Many thanks


    Code:
    . reghdfe dependant independent if dev == 0, noabsorb vce (cluster firmid)
    (MWFE estimator converged in 1 iterations)
    
    HDFE Linear regression                            Number of obs   =     35,826
    Absorbing 1 HDFE group                            F(   1,   7540) =      17.05
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.0014
                                                      Adj R-squared   =     0.0014
                                                      Within R-sq.    =     0.0014
    Number of clusters (firmid)  =      7,541         Root MSE        =     3.9658
    
                                 (Std. err. adjusted for 7,541 clusters in firmid)
    ------------------------------------------------------------------------------
                 |               Robust
       dependant | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
     independent |  -.1597525   .0386879    -4.13   0.000    -.2355915   -.0839135
           _cons |   5.797309   .0543657   106.64   0.000     5.690737    5.903881
    ------------------------------------------------------------------------------
    
    . 
    . reghdfe dependant independent if dev == 0, absorb(country industry year) vce (cluster firmid)
    (MWFE estimator converged in 7 iterations)
    
    HDFE Linear regression                            Number of obs   =     35,824
    Absorbing 3 HDFE groups                           F(   1,   7539) =      50.53
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.2034
                                                      Adj R-squared   =     0.2019
                                                      Within R-sq.    =     0.0035
    Number of clusters (firmid)  =      7,540         Root MSE        =     3.5456
    
                                 (Std. err. adjusted for 7,540 clusters in firmid)
    ------------------------------------------------------------------------------
                 |               Robust
       dependant | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
     independent |  -.2340908   .0329301    -7.11   0.000     -.298643   -.1695386
           _cons |   5.873444   .0449965   130.53   0.000     5.785238    5.961649
    ------------------------------------------------------------------------------

  • #2
    after estimation, one of the save results is "e(sample" which tells you which observations were included in the analysis; so, after each estimation, you want to save a new variable that is equal to the "e(sample)" for that estimation and then compare the variables; those that have a code of "0" for one estimate and "1" for the other estimate are excluded from the first but included in the second - you can then examine those two cases; e.g.,
    Code:
    qui reghdfe dependant independent if dev == 0, noabsorb vce (cluster firmid)
    gen newvar1=e(sample)
    qui reghdfe dependant independent if dev == 0, absorb(country industry year) vce (cluster firmid)
    gen newvar2=e(sample)
    you can edit/browse those observations that differ on newvar1 and newvar2
    obviously, you should choose names that are more meaningful to you than "newvar1" and "newvar2"

    Comment


    • #3
      reghdfe is from https://github.com/sergiocorreia/reghdfe (FAQ Advice #12).


      Originally posted by Yongda Liu View Post

      Code:
      . reghdfe dependant independent if dev == 0, noabsorb vce (cluster firmid)
      (MWFE estimator converged in 1 iterations)
      
      HDFE Linear regression Number of obs = 35,826
      Absorbing 1 HDFE group F( 1, 7540) = 17.05
      Statistics robust to heteroskedasticity Prob > F = 0.0000
      R-squared = 0.0014
      Adj R-squared = 0.0014
      Within R-sq. = 0.0014
      Number of clusters (firmid) = 7,541 Root MSE = 3.9658
      
      (Std. err. adjusted for 7,541 clusters in firmid)
      ------------------------------------------------------------------------------
      | Robust
      dependant | Coefficient std. err. t P>|t| [95% conf. interval]
      -------------+----------------------------------------------------------------
      independent | -.1597525 .0386879 -4.13 0.000 -.2355915 -.0839135
      _cons | 5.797309 .0543657 106.64 0.000 5.690737 5.903881
      ------------------------------------------------------------------------------
      
      .
      . reghdfe dependant independent if dev == 0, absorb(country industry year) vce (cluster firmid)
      (MWFE estimator converged in 7 iterations)
      
      HDFE Linear regression Number of obs = 35,824
      Absorbing 3 HDFE groups F( 1, 7539) = 50.53
      Statistics robust to heteroskedasticity Prob > F = 0.0000
      R-squared = 0.2034
      Adj R-squared = 0.2019
      Within R-sq. = 0.0035
      Number of clusters (firmid) = 7,540 Root MSE = 3.5456
      
      (Std. err. adjusted for 7,540 clusters in firmid)
      ------------------------------------------------------------------------------
      | Robust
      dependant | Coefficient std. err. t P>|t| [95% conf. interval]
      -------------+----------------------------------------------------------------
      independent | -.2340908 .0329301 -7.11 0.000 -.298643 -.1695386
      _cons | 5.873444 .0449965 130.53 0.000 5.785238 5.961649
      ------------------------------------------------------------------------------
      One firm has a single observation from the differences in the number of clusters (referred to as a singleton), and the second singleton observation may come from either the industry or the year. By default, reghdfe drops these observations. For a detailed explanation of why singletons should be excluded, see this paper.

      Compare with:

      Code:
      reghdfe dependant independent if dev == 0, keepsingletons absorb(country industry year) vce (cluster firmid)
      Last edited by Andrew Musau; 17 Feb 2025, 06:44.

      Comment


      • #4
        Originally posted by Rich Goldstein View Post
        after estimation, one of the save results is "e(sample" which tells you which observations were included in the analysis; so, after each estimation, you want to save a new variable that is equal to the "e(sample)" for that estimation and then compare the variables; those that have a code of "0" for one estimate and "1" for the other estimate are excluded from the first but included in the second - you can then examine those two cases; e.g.,
        Code:
        qui reghdfe dependant independent if dev == 0, noabsorb vce (cluster firmid)
        gen newvar1=e(sample)
        qui reghdfe dependant independent if dev == 0, absorb(country industry year) vce (cluster firmid)
        gen newvar2=e(sample)
        you can edit/browse those observations that differ on newvar1 and newvar2
        obviously, you should choose names that are more meaningful to you than "newvar1" and "newvar2"
        Thank you so much, Rich! I find the reason of the difference following your suggestions.

        Comment


        • #5
          Originally posted by Andrew Musau View Post
          reghdfe is from https://github.com/sergiocorreia/reghdfe (FAQ Advice #12).




          One firm has a single observation from the differences in the number of clusters (referred to as a singleton), and the second singleton observation may come from either the industry or the year. By default, reghdfe drops these observations. For a detailed explanation of why singletons should be excluded, see this paper.

          Compare with:

          Code:
          reghdfe dependant independent if dev == 0, keepsingletons absorb(country industry year) vce (cluster firmid)
          Thank you, Andrew. I think it is not driven by singleton observations. I have found the reason using Rich's method. Thank you anyway!

          Comment


          • #6
            Originally posted by Yongda Liu View Post
            I think it is not driven by singleton observations.
            If not that, then what? Assuming that the variables are the same across both regressions and the only difference is that you are absorbing the fixed effects, it is difficult to imagine what else would cause the discrepancy. However, in this case, you don't have to guess. I gave you a command that will tell you whether singleton observations are the issue.

            Code:
            reghdfe dependant independent if dev == 0, keepsingletons absorb(country industry year) vce (cluster firmid)

            Comment

            Working...
            X