Panel data quality with REGHDFE

Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#1

Panel data quality with REGHDFE

29 Apr 2019, 10:26

Hi all, I am relative new to STATA and to panel data. I have a dataset with observations of 26 states, distributed in five socioeconomic macro-regions, from 2004 to 2015. The panel is strongly balanced and as I have multilevel time variant fixed effects, I run the reghdfe command and I guess my results are good for my proposal, desire to confirm the association between the dependent variable (imrr) and my independent variables. I clustered by factor variable (idh_f), for macro-region (mr_id) and for year. My results seem to be ok, I guess, but I am a little concerned about the loss of degrees of freedom (119) and about the quality of the parameters estimated and of the model as well. I guess that Root MS is ok and my R and R-adjusted statistics are fine as well?!? As far as I know I used a fair number of clusters (120), but I am not sure if this loss of DoF affects the quality of the model.
Anyone could help me to evaluate my model?

. reghdfe imrr occ_1 pib pbf gi ta tf prenat int_sinv, absorb(idh_f mr_id) vce(cluster idh_f#mr_id#year)
(MWFE estimator converged in 3 iterations)

HDFE Linear regression Number of obs = 312
Absorbing 2 HDFE groups F( 8, 119) = 58.12
Statistics robust to heteroskedasticity Prob > F = 0.0000
R-squared = 0.8063
Adj R-squared = 0.7979
Within R-sq. = 0.6538
Number of clusters (idh_f#mr_id#year) = 120Root MSE = 1.6719

(Std. Err. adjusted for 120 clusters in idh_f#mr_id#year)
------------------------------------------------------------------------------
| Robust
imrr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
occ_1 | -.213045 .0614771 -3.47 0.001 -.3347758 -.0913143
pib | -.0000948 .0000188 -5.04 0.000 -.000132 -.0000576
pbf | -.0844561 .0278709 -3.03 0.003 -.1396433 -.029269
gi | -.1085079 .0285647 -3.80 0.000 -.1650688 -.0519469
ta | .2166834 .1631563 1.33 0.187 -.1063824 .5397491
tf | 4.887433 .4666264 10.47 0.000 3.963466 5.8114
prenat | -1.33e-06 5.71e-07 -2.33 0.021 -2.46e-06 -2.00e-07
int_sinv | .0000319 7.36e-06 4.34 0.000 .0000174 .0000465
_cons | 38.29141 6.988698 5.48 0.000 24.45309 52.12973
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
idh_f | 2 0 2 |
mr_id | 5 1 4 |
-----------------------------------------------------+

Thanks all.

Alexandre Bugelli
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

29 Apr 2019, 13:22

The loss of degrees of freedom comes from the use of cluster robust standard errors. If you remove the clustering, the degrees of freedom will return to the number of observations minus the number of predictors minus 1. But then your standard errors are based on assuming homoscedasticity and independence within clusters--which may be dubious assumptions. You could re-run the model omitting the clustering and see what happens. The coefficient estimates will be identical. The standard errors will change. But if they are almost the same, you could then use the unclustered results, if, for some reason, you feel more comfortable with a larger number of degrees of freedom. To be honest, I can't think of any real reason to care about the number of degrees of freedom here, but apparently you do.

As for your R², it's beyond fine. It's fantastic for socio-economic variables. If anything, it's so good that some people may wonder if you faked the data! What were you expecting? If you are analyzing data from physics experiments you have a right to sneer at R² = 0.95 even. But socio-economic variables are very noisy, and R² = 0.65 is amazingly good, bordering on too good to be true.

The RMSE cannot be judged without the context of the variance of the outcome variable. But the judgment with that context is nothing more or less than R², which, as already noted, is excellent in this context.
Comment
Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#3

29 Apr 2019, 14:08

Thanks Clyde for your answer.
Maybe I am a little concerned because someone told me that 312 observations is maybe a too small sample to so many independent variables, specially, as you said, regarding socioeconomic variables. The option for reghdfe was mainly due to high socioeconomic disparities among states and regions. Indeed, I pursued for a long time quality variables, as I ran many -xtreg- FE models and results has been very fuzzy, with inverted signals and high p values etc.... As I change the conception of some critical variables, according to my proposal; example: I had some variables in rates, using the same population base and I replaced them to absolute values, so I figured that maybe the most important is the way variables behave (varies) concerning the I. variable. Fortunately I had these results and variables get the same quality level to my research
I run the model without clustering, great idea, thank you, and here are the results. As you mentioned, we must have see "too good to be true" results with some suspicious, but hopefully hard work ends fine. I suppose loss of DoF is not a critical case in my case, I was just, as you did, suspicious about the statistics. I think, the unclustered model is more ‘fitted “ in ststatistic terms but the clustered one is more in line with the theoretical concepts of my study.

Here are the results.
Any suggestion to test the model, besides -test-?!?
Thank you once again for your reply.

. reghdfe imrr occ_1 pib pbf gi ta tf prenat int_sinv, absorb(idh_f mr_id)
(dropped 2 singleton observations)
(MWFE estimator converged in 3 iterations)

HDFE Linear regression Number of obs = 307
Absorbing 2 HDFE groups F( 8, 293) = 69.86
Prob > F = 0.0000
R-squared = 0.8061
Adj R-squared = 0.7975
Within R-sq. = 0.6560
Root MSE = 1.6749

------------------------------------------------------------------------------
imrr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
occ_1 | -.219474 .0608284 -3.61 0.000 -.33919 -.099758
pib | -.0000875 .0000245 -3.57 0.000 -.0001358 -.0000392
pbf | -.0826775 .0208863 -3.96 0.000 -.1237836 -.0415714
gi | -.0996322 .0275957 -3.61 0.000 -.1539432 -.0453213
ta | .2385579 .1342139 1.78 0.077 -.0255875 .5027033
tf | 4.996489 .4951117 10.09 0.000 4.022063 5.970915
prenat | -1.36e-06 6.49e-07 -2.09 0.037 -2.64e-06 -8.13e-08
int_sinv | .0000321 6.50e-06 4.94 0.000 .0000193 .0000449
_cons | 37.70095 6.935863 5.44 0.000 24.05053 51.35138
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
idh_f | 2 0 2 |
mr_id | 5 1 4 |
-----------------------------------------------------+
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#4

29 Apr 2019, 14:54

Well, I agree that 312 observations is cutting it a bit close. At 8 predictors that's a bit under 40 observations per predictor. Not optimal, and perhaps a bit skimpy, but not at a level where serious overfitting of the noise looms large.
Comment
Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#5

29 Apr 2019, 14:57

Sorry, once, just to correct my last post,
Now I run without clustering.

Thnak you again.

reghdfe imrr occ_1 pib pbf gi ta tf prenat int_sinv, absorb(idh_f mr_id) vce(robust) summ
(dropped 2 singleton observations)
(MWFE estimator converged in 3 iterations)

HDFE Linear regression Number of obs = 307
Absorbing 2 HDFE groups F( 8, 293) = 56.84
Prob > F = 0.0000
R-squared = 0.8061
Adj R-squared = 0.7975
Within R-sq. = 0.6560
Root MSE = 1.6749

------------------------------------------------------------------------------
| Robust
imrr | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
occ_1 | -.219474 .0700025 -3.14 0.002 -.3572455 -.0817026
pib | -.0000875 .0000213 -4.10 0.000 -.0001295 -.0000455
pbf | -.0826775 .0258893 -3.19 0.002 -.1336301 -.0317249
gi | -.0996322 .0273527 -3.64 0.000 -.1534649 -.0457996
ta | .2385579 .1435229 1.66 0.098 -.0439085 .5210243
tf | 4.996489 .5192523 9.62 0.000 3.974552 6.018426
prenat | -1.36e-06 5.97e-07 -2.28 0.024 -2.53e-06 -1.84e-07
int_sinv | .0000321 6.67e-06 4.82 0.000 .000019 .0000453
_cons | 37.70095 7.749275 4.87 0.000 22.44966 52.95225
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
idh_f | 2 0 2 |
mr_id | 5 1 4 |
-----------------------------------------------------+

Regression Summary Statistics:
-----------------------------------------------
Variable | mean min max
-------------+---------------------------------
imrr | 18.54111 11.355 28.84
occ_1 | 92.4486 84.571 96.992
pib | 14306.01 2933.35 40608.7
pbf | 23.99055 2.08 48.25
gi | 77.31922 66 91
ta | 4.79355 1.36 11.18
tf | 2.088046 1.55 3.33
prenat | 335640.2 11519 1300000
int_sinv | 29521.43 798 125395
-----------------------------------------------
Comment
Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#6

06 Jun 2019, 12:48

Hi Clyde,

sorry to come back with this subject. I made a revision over my dataset and variables and found a problem modeling a new code for new variables variables.
Just to remember, imrr is a health outcome and “occ_1" (unemployment rate lagged in one year), “pib" (GDP) and “policy_cover" (a social policy coverage), “tf” (fertility rate) and “resc” an educational indicator; all socioeconomic variables; and “med” and “enf" (number of health professionals) and adms (hospital admissions); health services indicators. I count on 26 states (panels), nested in 5 macro-socioeconomic regions, over 12 years, couting 312 observations. I have statistically significant estimations for almost all parameters, but for “pib" (GDP), “med” and “enf”; all calculated per thousand habitants (population in each state-“panels). I droped IDH (Human Development index) of the model, ince it is already nested in socioeconomic macro_regions (“mr_id”). (The mr_id’s re distributed as follows: 1 = North = 7 states; 2 = North-east = 9 states; 3 = South-east = 4 states, 4 = South = 3 states and 5 = Center-west = 3 states, so clusters are unequal distributed). I keep the model nested at state level (id) and year : xtset id year and clustered for macro-regions interacting with years.

I have two questions, maybe you could waste a little of your time helping me.

I guess there is some collinearity among variables. Not a big deal when regdhfe identifies and drops alll collinear variables. But, as I predicted residuals after using the code below:

CODE: reghdfe imrr occ_1 pib policy_cover tf resc med enf adms, absorb(mr_id) vce(cluster mr_id#year)

I verified that the distribution of residuals are not Normal distributed and when I apply a Log transformation to the health indicator, the independent variable “imrr”, code below:
CODE: reghdfe limrr occ_1 pib policy_cover tf resc med enf adms, absorb(mr_id) vce(cluster mr_id#year)

1. Is this correct to apply Log transformation for reghdfe, once this command also estimates for heterokedasticity and correlation?

2. As unemployment (or employment rates) has a cumulative effect over time, is this correct to use “L." commands with occ_1 (unemployment rate) to estimate de effect of structural unemployment (more than one period of unemployment) as code below:

CODE: reghdfe limrr L1.c.occ#L2.c.occ#L3.c.occ pib policy_cover tf resc med enf adms, absorb(mr_id) vce(cluster mr_id#year)

Maybe another option, as I have data for time lag for unemployment (occ_1, occ_2 and occ_3: 1, 2 and 3 years lagged)?!?

Thanks in advance

Alexandre Bugelli
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#7

06 Jun 2019, 13:34

I have two questions, maybe you could waste a little of your time helping me.

I won't enumerate here the many benefits I derive from my Statalist activities, but suffice it to say that when I see a post that responding to would be a waste of my time, I just pass it by.

1. Is this correct to apply Log transformation for reghdfe, once this command also estimates for heterokedasticity and correlation?

There are two different issues raised by your question in the context of your post. The first is whether it is appropriate to log-transform your outcome variable in response to the non-normality of residuals. To that my answer is: no. Normality of residuals is a sufficient, but not necessary condition for valid t-, z-, and F- statistic based inferences in regression. In a sufficiently large sample (and I would judge yours to be large enough for this purpose) the central limit theorem kicks in and makes the various components of those test statistics that ought to be normal actually be (asymptotically) normal. So the log transformation is unnecessary for this purpose. There remains, however, the modeling issue: does the regression model provide a better fit to the data if the outcome variable is log transformed? To decide that you have to explore the results of both models in your data. I would explore plots of predicted vs observed outcomes with both models and see which looks better. If the logged outcome is better fit, then go with it. Otherwise not.

2. As unemployment (or employment rates) has a cumulative effect over time, is this correct to use “L." commands with occ_1 (unemployment rate) to estimate de effect of structural unemployment (more than one period of unemployment) as code below:

CODE: reghdfe limrr L1.c.occ#L2.c.occ#L3.c.occ pib policy_cover tf resc med enf adms, absorb(mr_id) vce(cluster mr_id#year)

I would not hesitate to say that it is fine to add a single lag (i.e. any one of L1.occ, L2.occ, or L3.occ) to the model. But when you start involving multiple lags of the same variable, you may be introduce serial correlation into the error structure, and I am not certain that using cluster-robust standard errors deals effectively with that. This kind of thing really doesn't come up in my line of work, so I have never delved into the issue in depth. This type of thing is common in finance and econometrics, and I hope that somebody from one of those disciplines will respond to this question, as I am not confident of my answer here.
1 like
Comment
Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#8

06 Jun 2019, 19:19

Thank you once Clyde for your attention and your comments.
I do understand your points for both questions. Specially for the second one, I guess you are also right about the risk to introduce serial correlation in to the error structure. To my purpose of just to infer and to analyze the possible association between health outcome and socioeconomics variables.
Thanks, again.
Comment
Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#9

17 Aug 2020, 09:42

HI Clyde,
I hope you and yours are doing well through this pandemic crisis.
Comment
Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#10

22 Aug 2020, 08:07

Hope all there are doing well during this pandemic crisis.
I wonder if someone could help me to evaluate when using areg vs reghdfe commnads as they produce exactly the same results?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

#11

22 Aug 2020, 08:40

Alexandre:
the main difference is that -areg-, unlike the community-contributed programme -reghdfe-, does not support absorbing more than one variable:

Code:

. use "https://www.stata-press.com/data/r16/nlswork.dta"
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. xtset idcode year
       panel variable:  idcode (unbalanced)
        time variable:  year, 68 to 88, but with gaps
                delta:  1 unit

. reghdfe ln_wage wks_ue , abs(idcode year)
(dropped 716 singleton observations)
(converged in 9 iterations)

HDFE Linear regression                            Number of obs   =     22,114
Absorbing 2 HDFE groups                           F(   1,  18170) =      18.72
                                                  Prob > F        =     0.0000
                                                  R-squared       =     0.6380
                                                  Adj R-squared   =     0.5594
                                                  Within R-sq.    =     0.0010
                                                  Root MSE        =     0.3083

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      wks_ue |  -.0014705   .0003399    -4.33   0.000    -.0021368   -.0008043
-------------+----------------------------------------------------------------
    Absorbed |    F(3942, 18170) =      7.944   0.000             (Joint test)
------------------------------------------------------------------------------

Absorbed degrees of freedom:
---------------------------------------------------------------+
 Absorbed FE |  Num. Coefs.  =   Categories  -   Redundant     |
-------------+-------------------------------------------------|
      idcode |         3929            3929              0     |
        year |           14              15              1     |
---------------------------------------------------------------+

. areg ln_wagewks_ue,abs(idcode year) vce(cluster idcode)
absorb():  too many variables specified
r(103);

.

Moreover, -areg-, unlike -reghdfe- was not specifically developed for panel data regression.
For more details, I would consider the -areg- entry in Stata .pdf manual.

Kind regards,
Carlo
(Stata 19.0)

Comment

Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#12

26 Aug 2020, 16:19

Ho Carlo, thank you for the explanation. As I understand reghdfe fits better to multiway panel data. In my research I need to absorb only by year and use vce cluster for macro-region and year.
I have this model that is a variant of a basic model I run with 4 levels, but I decided to drop one socioeconomic level and use income stratified instead.
These are my codes:

Attached Files
Comment
Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#13

26 Aug 2020, 16:22

Sorry, forgot to explain: mr is socioeconomic macro-region, nmr is a health outcome and variables are mainly socio-ceonomic and health factors.

Regards.

Alexandre
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#14

27 Aug 2020, 06:02

Alexandre:
the code you shared seems in line with what you're after.
In addition, 60 clusters are enough for invoking non-default standard errors (obviously, during yous Stata session you typed -cluster- instead of -clsuter-).
As an aside, for the future, please use CODE delimiters to share what you typed and what Stata gave you back. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Alexandre Bugelli

Join Date: Mar 2019

Posts: 14
#15

29 Aug 2020, 07:57

Hi Carlo. Here is the right code: reghdfe nmr occ rgdp bfpcov fr eda lbpre twc tsw, absorb(year) vce(cluster mr#year). Just replaced mr_year by mr#year and had the same results.
Best regards, and thank you so much.
Comment

Announcement

Panel data quality with REGHDFE

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment