Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable omitted when using absorb() in reghdfe but not when using indicators

    Hi Everyone,

    I am running regressions using individual and state times year fixed effects. The data is at the individual (person) - year level. The dependent variable is a binary variable that varies by individual - year and the variable of interest is continuous and varies annually. When using the absorb option with reghdfe, the variable of interest gets omitted when adding the state times year fixed effects. However, when I use indicators for state times year combinations, the variable is not omitted. Could someone help me understand what the difference is in these two approaches, specifically why the variable of interest is omitted in the first case and not the second? The code and results are posted below:

    Code:
    egen state_time_fe=group(state year)
    
    //METHOD 1
    reghdfe hstock ffr inflation age educ inc_win_sc  wealth_exstock_sc disp_inc_g_xtile [pw=wgt],///
    a(id state_time_fe) cluster(sample_stratum)
    
    HDFE Linear regression                            Number of obs   =    340,008
    Absorbing 2 HDFE groups                           F(   4,     62) =      37.50
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.5542
                                                      Adj R-squared   =     0.4960
                                                      Within R-sq.    =     0.0125
    Number of clusters (sample_stratum) =         63  Root MSE        =     0.2826
    
                                 (Std. err. adjusted for 63 clusters in sample_stratum)
    -----------------------------------------------------------------------------------
                      |               Robust
               hstock | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    ------------------+----------------------------------------------------------------
                  ffr |          0  (omitted)
            inflation |          0  (omitted)
                  age |   .0020165   .0002516     8.01   0.000     .0015135    .0025196
                 educ |   .0076792   .0013636     5.63   0.000     .0049535    .0104049
           inc_win_sc |   .6274015   .0896416     7.00   0.000     .4482105    .8065924
    wealth_exstock_sc |   .0183634   .0060017     3.06   0.003     .0063661    .0303606
     disp_inc_g_xtile |          0  (omitted)
                _cons |  -.0274445   .0241652    -1.14   0.260    -.0757501    .0208611
    -----------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -------------------------------------------------------+
       Absorbed FE | Categories  - Redundant  = Num. Coefs |
    ---------------+---------------------------------------|
                id |     38673       38673           0    *|
     state_time_fe |       573           0         573     |
    -------------------------------------------------------+
    * = FE nested within cluster; treated as redundant for DoF computation
    
    //METHOD 2
    reghdfe hstock ffr inflation age educ inc_win_sc  wealth_exstock_sc disp_inc_g_xtile i.state_time_fe [pw=wgt] ,///
    a(id) cluster(sample_stratum)
    
    HDFE Linear regression                            Number of obs   =    340,008
    Absorbing 1 HDFE group                            F( 576,     62) =          .
    Statistics robust to heteroskedasticity           Prob > F        =          .
                                                      R-squared       =     0.5542
                                                      Adj R-squared   =     0.4960
                                                      Within R-sq.    =     0.0488
    Number of clusters (sample_stratum) =         63  Root MSE        =     0.2826
    
                                 (Std. err. adjusted for 63 clusters in sample_stratum)
    -----------------------------------------------------------------------------------
                      |               Robust
               hstock | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    ------------------+----------------------------------------------------------------
                  ffr |    .033729   .0197572     1.71   0.093     -.005765     .073223
            inflation |   11.40619    23.6248     0.48   0.631     -35.8191    58.63148
                  age |   .0020165   .0002516     8.01   0.000     .0015135    .0025196
                 educ |   .0076792   .0013636     5.63   0.000     .0049535    .0104049
           inc_win_sc |   .6274015   .0896416     7.00   0.000     .4482105    .8065924
    wealth_exstock_sc |   .0183634   .0060017     3.06   0.003     .0063661    .0303606
     disp_inc_g_xtile |  -.3981373    1.93887    -0.21   0.838    -4.273882    3.477608
                      |
        state_time_fe |
                   2  |  -.0113447   .1488725    -0.08   0.940    -.3089365    .2862471
                   3  |   .1940478   .1602658     1.21   0.231    -.1263189    .5144145
                   4  |  -.1372389   .2069215    -0.66   0.510     -.550869    .2763912
                   5  |  -.2478486   .1370385    -1.81   0.075    -.5217848    .0260875
                   6  |   .2692664   .5907969     0.46   0.650    -.9117197    1.450252
                   7  |  -.1412744   .1361649    -1.04   0.304    -.4134641    .1309153
                   8  |   .0990564   .2175213     0.46   0.650    -.3357626    .5338753
                   9  |   .2491288    .609815     0.41   0.684     -.969874    1.468132
    ... 
    NOTE: 573 values of state_time_fe (3 omitted because of collinearity)

  • #2
    What you have are two different parameterizations of the same model. If you were to use -predict- after either of them, the results would be the same (up to tiny rounding errors). Note also that R2, adjusted R2, and RMSE are identical in the two models as well.

    Now, looking at your first output, it is apparent that ffr, inflation, and disp_inc_g_xtile are all colinear with the the state#time fixed effects. (Which is to say that they are constants within any combination of state and year. To break that colinearity, the first -reghdfe- command removes them from the model (equivalently, you could say it constrains their coefficients to be 0). The removal of those 3 variables from the colinear set consisting of ffr, inflation, and disp_inc_g_xtile breaks the colinearity, thereby leaving an identified model which gets estimated.

    In the second version, pay attention to the message: NOTE: 573 values of state_time_fe (3 omitted because of collinearity) [emphasis added]. Overall, this -reghdfe- is dealing with exactly the same set of variables as the first, and encounters exactly the same problem of a set of 576 colinear variables, of which three, any three, must be eliminated in order to break the colinearity and identify the model. This time it chooses to retain ffr, inflation, and disp_inc_g_xtile and omit three of the state#time fixed effects instead. Why did it do it a different way? Apparently -reghdfe- prefers to retain absorbed effects and omit other variables when faced with this problem. When you took those fixed effects out of -absorb()- and put them in as explicit regressors, -reghdfe- had a wider range of choices of which explicit regressors to remove, and settled on three of the fixed effects. In many Stata commands, when it is necessary to omit one or more explicit regressors to break a colinearity, those that appear last in the list of regressors are usually selected. -reghdfe- seems to follow that pattern.

    As I mentioned in my first paragraph, these are two different ways to identify an unidentifiable model. For overall model statistics, the results are the same either way. The coefficients of the variables involved in the colinearity are different. But that is also unimportant because the coefficients of variables that are involved in a colinearity are arbitrary and meaningless, precisely because they are artifacts of the particular way in which the colinearity is broken. (Mathematically it can be proved that you can pick any arbitrary values for these coefficients and attain them by some choice of omissions/constraints.)

    The important takeaway here is that it is not possible to estimate the effects of state-year constants like ffr, inflation, and disp_inc_g_xtile in a model that contains firm-year fixed effects. This kind of question comes up fairly often on Statalist, and I usually end my remarks by saying that it also doesn't matter because the variables involved are usually just included to adjust ("control") for their confounding effects. But you have said in your post that one of these is the variable of interest in your model. So that's a serious problem. You simply cannot use a model with state-year fixed effects when you need to estimate the effect of a variable that is constant within state-year combinations.

    You have to use a different model. I have no expertise in econometrics, so I am a bit reluctant to advise you about which other model to choose, but I suspect that if instead of using state#year fixed effects you used state and year separately, i.e. -absorb(id state year)- that these variables might no longer be colinear with the absorbed effects and would resolve this problem. But somebody who understands the econometrics would need to judge whether that is an adequate specification of the model in other respects.

    Comment


    • #3
      Hi Clyde,

      Thank you so much for this very informative response! Interesting to learn how Stata treats these cases differently. My original inclination was to not use state x year fixed effects because of the apparent collinearity between my variables of interest and those fixed effects. Based on your advice, I will use a model that includes id and state fixed effects (i.e. absorb(id state) with additional controls that vary at the state-year level. Including year fixed effects is an unidentifiable model as well (I think) since ffr, inflation and disp_inc_g_xtile are constant across individuals within year.

      Thank you again for providing some much-needed clarity. Cheers!

      Comment

      Working...
      X