Category omitted from fixed effect when running regression

Maya Ward

Join Date: Aug 2021

Posts: 9
#1

Category omitted from fixed effect when running regression

11 Mar 2023, 18:24

I am having a confusing problem -- when I ran my code last week, everything was normal. Now, not so much.

I have a crossectional data set (DHS for Colombia), and am running a diff-in-diff specification. I have three years of data (2005, 2010, 2015), and tabbing my "year" variable confirms this. My "post" variable is equal to 1 if year == 2015. I have a "policy" variable which is equal to one for certain geographic departments in the treated group, and my difference in difference variable is did = post*policy.

When I run a regression, I am including department and year fixed effects. As I also have to cluster at the department level, the power for this group goes away, which should leave three degrees of freedom, one for each year. Now there are only two, and I'm not sure why. This first appeared when I re-ran all of my code from the start of my data cleaning file through to the regressions, but I'm not sure why this would change. When I ran it last week, there were three df, which I noticed when I plotted an event study (all three years showed up, now only two show up).

Why would this happen? Please let me know if more explanation or code is needed. I am using reghdfe; the following is the basic format: reghdfe y did $controls [pweight = wtvar], absorb(dept year) cluster(dept)

I actually have seven different outcomes, so I am representing them just by "y". Thank you all!
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17699
#2

12 Mar 2023, 03:22

Maya:
as per FAQ, please post:
1) what you typed and what Atata gave you back;
2) an example/excerpt of your dataset via -dataex-.
Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maya Ward

Join Date: Aug 2021

Posts: 9
#3

12 Mar 2023, 20:10

Note: I am

Last edited by Maya Ward; 12 Mar 2023, 20:13.
Comment

Maya Ward

Join Date: Aug 2021
Posts: 9

12 Mar 2023, 20:12

Note: I am using Stata version 15.1

Here is a sample using dataex (it wouldn't let me include all of my control variables, as several are categorical):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(fsay_hcR fsay_lgpchR fsay_hhR fsay_famR fsay_cookR fsay_ownwageR pc1R did post policy) byte dept float year str15 caseid
1 0 1 1 1 1  1.4320685 0 0 0 44 2005 "    00010102 02"
1 0 0 0 0 1  -1.200748 0 0 0 44 2005 "    00010201 04"
1 1 1 1 0 1   1.934896 0 0 0 44 2005 "    00010201 05"
1 0 0 0 0 1  -1.200748 0 0 0 44 2005 "    00010301 02"
1 1 1 1 1 1   2.597703 0 0 0 44 2005 "    00010301 06"
1 1 1 1 1 1   2.597703 0 0 0 44 2005 "    00010501 02"
1 1 1 1 0 1   1.934896 0 0 0 44 2005 "    00010601 01"
1 0 1 1 1 1  1.4320685 0 0 0 44 2005 "    00010901 04"
1 1 1 1 0 1   1.934896 0 0 0 44 2005 "    00011101 02"
0 0 0 0 0 1 -1.9910043 0 0 0 44 2005 "    00020301 03"
1 1 0 0 0 1 -.03511305 0 0 0 44 2005 "    00020401 02"
1 1 0 0 0 1 -.03511305 0 0 0 44 2005 "    00020601 03"
1 1 1 1 1 1   2.597703 0 0 0 44 2005 "    00020701 01"
1 0 0 0 0 1  -1.200748 0 0 0 44 2005 "    00020701 03"
1 0 0 1 0 1  -.3033985 0 0 0 44 2005 "    00020701 05"
end
label values dept dept_names
label def dept_names 44 "La Guajira", modify

What I typed:

Code:

. tab year

       year |      Freq.     Percent        Cum.
------------+-----------------------------------
       2005 |     29,849       43.18       43.18
       2010 |     22,526       32.59       75.77
       2015 |     16,753       24.23      100.00
------------+-----------------------------------
      Total |     69,128      100.00

g post = 0;
replace post = 1 if year == 2015 ;

g policy = 0;
// replace policy = 1 if policytype == 1;
replace policy = 1 if
    dept == 11 | // Bogotá
    dept == 5  | // Antioquia    
    dept == 54 | // Norte de Santander, AECID only
    dept == 68 | // Santander
    dept == 50 | // Meta
    dept == 41 | // Huila
    dept == 88 | // San Andrés
    //dept == 47 | // Magdalena
    //dept == 18 | // Caquetá
    dept == 76      // Valle
    ;

reghdfe y did $controls [pweight = wtvar], absorb(dept year) cluster(dept)

The following is the output for when y = fsay_hcR (Respondent has final say on their own healthcare):

Code:

. reghdfe fsay_hcR did $controls [pweight = wtvar], absorb(dept year) cluster(dept) allbaselevels
note: current_union is probably collinear with the fixed effects (all partialled-out values are close to zero;
>  tol = 1.0e-09)
(MWFE estimator converged in 3 iterations)
note: current_union omitted because of collinearity

HDFE Linear regression                            Number of obs   =     39,279
Absorbing 2 HDFE groups                           F(  15,     32) =     190.89
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.0470
                                                  Adj R-squared   =     0.0458
                                                  Within R-sq.    =     0.0290
Number of clusters (dept)    =         33         Root MSE        =     0.4117

                                   (Std. Err. adjusted for 33 clusters in dept)
-------------------------------------------------------------------------------
              |               Robust
     fsay_hcR |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
          did |  -.0712436   .0222167    -3.21   0.003    -.1164976   -.0259895
          age |    .011764   .0028608     4.11   0.000     .0059367    .0175912
         age2 |  -.0001951   .0000411    -4.74   0.000    -.0002789   -.0001113
              |
    wealthinx |
     poorest  |          0  (base)
      poorer  |   .0428757   .0083682     5.12   0.000     .0258302    .0599211
      middle  |   .0465615   .0137567     3.38   0.002     .0185401     .074583
      richer  |   .0496003   .0161687     3.07   0.004     .0166658    .0825349
     richest  |   .0748075   .0226024     3.31   0.002     .0287679    .1208472
              |
        urban |    .032728   .0124933     2.62   0.013     .0072799     .058176
       eduyrs |   .0105277   .0021442     4.91   0.000       .00616    .0148954
              |
       edulvl |
no education  |          0  (base)
     primary  |   .0222386   .0196484     1.13   0.266    -.0177838     .062261
   secondary  |   .0299738   .0231699     1.29   0.205    -.0172217    .0771692
      higher  |   .0243469   .0298719     0.82   0.421    -.0365002     .085194
              |
      numkids |   .0069555   .0033087     2.10   0.043      .000216     .013695
       jobnow |   .0297152   .0059258     5.01   0.000     .0176447    .0417857
current_union |          0  (omitted)
    ethnicity |   .0089527   .0035276     2.54   0.016     .0017673    .0161381
        _cons |   .3461327   .0507949     6.81   0.000      .242667    .4495985
-------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
        dept |        33          33           0    *|
        year |         2           0           2     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Where I first noticed the issue was in the regression for the event study, code also provided below:

Code:


g year_policy = policy*year

fvset base 2010 year
fvset base 0 policy
fvset base 2010 year_policy

// fvset base 1 con_groups
// fvset base 1 treat_groups

#delimit ;
label define coef_treat
    0 "Control"
    2005 "2005"
    2010 "2010"
    2015 "2015" ;
label values year_policy coef_treat ;
label var year_policy "Treatment" ;



. reghdfe pc1R i.year_policy $controls [pweight = wtvar], absorb(dept year)
>         cluster(dept) baselevels;
note: 0bn.year_policy is probably collinear with the fixed effects (all partialled-out values are close to zer
> o; tol = 1.0e-09)
note: current_union is probably collinear with the fixed effects (all partialled-out values are close to zero;
>  tol = 1.0e-09)
(MWFE estimator converged in 3 iterations)
note: 0.year_policy omitted because of collinearity
note: current_union omitted because of collinearity

HDFE Linear regression                            Number of obs   =     39,279
Absorbing 2 HDFE groups                           F(  15,     32) =     170.01
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.0358
                                                  Adj R-squared   =     0.0346
                                                  Within R-sq.    =     0.0183
Number of clusters (dept)    =         33         Root MSE        =     1.3957

                                   (Std. Err. adjusted for 33 clusters in dept)
-------------------------------------------------------------------------------
              |               Robust
         pc1R |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
  year_policy |
     Control  |          0  (omitted)
        2010  |          0  (base)
        2015  |   .0520408    .040778     1.28   0.211    -.0310212    .1351028
              |
          age |   .0451996   .0077022     5.87   0.000     .0295106    .0608885
         age2 |  -.0004721   .0001196    -3.95   0.000    -.0007157   -.0002286
              |
    wealthinx |
     poorest  |          0  (base)
      poorer  |   .0744988    .022523     3.31   0.002     .0286209    .1203767
      middle  |    .115128   .0621139     1.85   0.073    -.0113938    .2416498
      richer  |   .0355656   .0702949     0.51   0.616    -.1076205    .1787517
     richest  |   .0043075   .0710721     0.06   0.952    -.1404616    .1490765
              |
        urban |   .1901572     .03164     6.01   0.000     .1257087    .2546058
       eduyrs |   .0170864   .0090113     1.90   0.067     -.001269    .0354418
              |
       edulvl |
no education  |          0  (base)
     primary  |   .0050659   .0747045     0.07   0.946    -.1471021    .1572339
   secondary  |   .0546609   .1041734     0.52   0.603    -.1575333    .2668552
      higher  |  -.0525286   .1211991    -0.43   0.668    -.2994031    .1943459
              |
      numkids |   .0427943   .0071832     5.96   0.000     .0281626    .0574261
       jobnow |   .0484199   .0186195     2.60   0.014     .0104933    .0863465
current_union |          0  (omitted)
    ethnicity |   .0218879   .0120949     1.81   0.080    -.0027485    .0465244
        _cons |  -1.676471   .1570277   -10.68   0.000    -1.996326   -1.356616
-------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
        dept |        33          33           0    *|
        year |         2           0           2     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Thanks

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17699
#5

13 Mar 2023, 01:19

Maya:
the -reghfdfe- note tells you exactly the reason of the omission (collinearity with the -fe-).
In addition, your 1-year only data example does not allow interested listers to delve into the issue.
That said, I'd be more concerned about the low -Within R-sq- that bith regressions report.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maya Ward

Join Date: Aug 2021

Posts: 9
#6

19 Mar 2023, 15:51

Mr. Lazzaro: I do see

0bn.year_policy is probably collinear with the fixed effects (all partialled-out values are close to zer > o; tol = 1.0e-09)

which tells me why the "control" category of year_policy is omitted (it is also listed as such in the regression output), but it does not tell me why the "2005" category is not even listed in the regression. My question was really about the second part. Why would a category of a variable not even be listed or shown?
Comment
Maya Ward

Join Date: Aug 2021

Posts: 9
#7

19 Mar 2023, 16:48

Update, trying to figure out my problem: the regression is not even reading in the observations from 2005, but when I do "tab year" or "tab [any variable] year", the data shows up. The observations listed in the regression output (39,279) exactly correspond to the number of observations for years 2010 and 2015. How do I get the regression to also include year = 2005 in the sample?? Did I somehow make it ignore this? How do I reverse it?
Comment
Maya Ward

Join Date: Aug 2021

Posts: 9
#8

19 Mar 2023, 17:15

I just solved the problem! One of my control variables didn't exist for 2005, and was throwing everything off. I need to recode that. Thank you for the comment about the R-squared; when I re-run everything I will be sure to note if that is still an issue.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30061
#9

19 Mar 2023, 17:31

Did I somehow make it ignore this?

Almost certainly. -reghdfe- has been in widespread use for a long time. If there is a bug that would effect something like this, it would almost certainly have become known (and been fixed) early on. Your problem is almost guaranteed to be due to a problem in your data.

How do I reverse it?

I would say that the most likely reason for the omission of all year 2005 observations is that there is some other variable whose value is always missing when year == 2005. Or perhaps it just happens that for each year 2005 observation there is some model variable with a missing value. Remember that in any regression, any observation with missing value for any regression variable is automatically excluded. Your example data does not exhibit any missing values, but it also does not include all of the variables in your regression. By the way, "model variable" here means every variable mentioned in the regression command, including the pweight, the fixed effects, and the outcome, as well as all of the explanatory variables. I would look into this possibility first.

If that doesn't turn up the problem, then I would post back with a more complete -dataex- output that includes all of the regression variables, along with observations from each of the three years.

Another issue that may be related is your variable year_policy, which looks mis-specified. You have calculated it as year*policy. This could be an appropriate way to set up an interaction between a dichotomous variable (policy) and a continuous variable (year). But it would then be inappropriate to enter it into the regression as i.year_policy, treating it as a discrete variable. If, on the one hand, your intent is to have an interaction between dichotomous policy and continuous year, don't calculate a new variable. Just enter i.policy##c.year into the model. If, on the other hand, your intent is to treat year as discrete, then, again, don't calculate a new variable; enter i.policy##i.year into the model. (year will be omitted by -reghdfe- as it is also present as an absorbed effect--that's not a problem.)

Added: Crossed with #8, which confirms that the problem is what I suspected.

Last edited by Clyde Schechter; 19 Mar 2023, 17:34.
Comment

Announcement

Category omitted from fixed effect when running regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment