I think I have a strong theoretical argument for not including time-dummies in my random effects regression, I would be interested in the opinion of users here and suggestions as to how to strengthen this argument, particularly if there are any quantitative methods that I could apply in Stata.
I have panel data of local area unemployment and health outcomes in the same mothers analysed at 3 Waves, each five years apart, before, during and after a recession.
The results of my initial regression are as follows:
The data is clustered at the respondents local area level (i.e. County, which are similar to American States), there are 30 clusters and the effect of local area unemployment (psum_unemployed_total_cont_y) on health is measured at this same local area level to account for endogeneity in the relationship between unemployment and health (i.e. is the same person who is likely to be unemployed likely to be unhealthy for some unobserved reason?)
A colleague suggested I make use of year dummies, as this is often done in panel data, the below is what this looks like when I add the year variable:
In each case I used testparm to test the significance of the years.
However I wasn't too happy with this approach, to begin, I was surprised to see Prob > chi2 = . when including year as i.year. I'm not quite sure what this means and would be open to interpretation.
More importantly I think there is a good theoretical basis for not including year in this analysis, as follows:
Here I investigate the impact of unemployment on health during the Great Recession. Impacts are an effect of county level unemployment but are also made up of the overall impact of the recession experienced at a country level, when I hold year fixed, I feel I am leaving out the national level trend of unemployment and because of this, I ignore the importance of the recession as an employment effect.
Basically, I feel that what I am seeing is a combination of the national variation in the unemployment rate and the local area variation. I feel that if I were to add year dummies, that I might not get much variation, as there can’t be that much to identify the employment effect, just from the local effect. Put another way, if I am framing this paper as a recessionary analysis (i.e. the effect of the great recession on health as mediated by unemployment) then there wouldn’t be that much to identify the effect of the recession on health by just looking at local area unemployment and holding the effects of national unemployment fixed.
I thought the following might be a good explanation of this to place in text:
In random effects models time can be included in the fixed part as discrete time dummies in order to to take into account effects that may influence all cases in a given year to the same amount. Here, the intention is to remove a potential cause of spuriousness that results from common trends in observed variables. Including time dummies in this model however may overfit it. Put simply, the eliminated trend is the national level employment trend, i.e. the effect of the great recession. In other words, this analysis considers the effects of unemployment on health during the great recession. Impacts are an effect of county level unemployment but are also made up of the overall impact of the recession experienced at a country level. When I hold year fixed, I feel I am leaving out the national level trend of unemployment and because of this, I ignore the importance of the recession as an employment effect.
To control for time-specific effects expected to affect the whole sample over time, these are included as controls in the random effects model. These controls include age, marital status, state-support recipient and own-employment, etc., Discrete time dummies are not included, i.e. the effect of time itself is not modeled, because the recession is made up of local and national level employment effects. By controlling for year, national level employment effects are held fixed and thus only local level effects on health may be examined. In other words, by holding year as constant, it means that I ignore year effects in my analysis, the problem with this approach is that there was a recession, and that the year effects would include this recession.
In the case of small samples, such as this, there is the related problems that this will use up some degrees of freedom, which has a direct effect on the precision of the parameter estimates. Thus, estimates may be unbiased but completely unreliable. A model which is too complex, or overspecified, may reduce the precision of coefficient estimates and predicted values. The implications of both bias and precision for the analysis were thus considered when making this analysis decision. In an exploratory analysis where time dummies were included, the significance of the later years in this model. i.e. right when the financial crisis struck, would support this. Results are not robust to the inclusion of years, however, as years becomes significant I think this supports the argument for a national trend effect.
In text explanation ends.
My question is, would this be a reasonable argument? Or can I expect to face concerns over not including year in my analysis? Is there a firm quantitative argument that I can make against the inclusion of year? My concern is in dealing with reviewer queries when sending this article for publication.
I did notice that the in the first model -sigma_u- outperforms -sigma_e- such that a higher portion of the variation in -depvar- is explained by individual effect rather than idiosyncratic error, but I'm not sure if this is something worth mentioning.
I don't know if an argument could be made that the included years don't add anything informative to my results; and hence should not be plugged in among the predictors, i.e. I don't know if my data show any evidence that year has a statistical significant effect on my depvar. Even if there is a statistically significant effect, I would assume that this only supports the argument that I make above, i.e. that there was a recession at this time and that this was effecting health, and that by holding this fixed I can no longer look at the effect of this recession on health?
Although my number of observations may appear large, my actual sample is only 614 mothers, so I don’t know if my argument on the dangers of over-fitting the model above will hold much water.
I have panel data of local area unemployment and health outcomes in the same mothers analysed at 3 Waves, each five years apart, before, during and after a recession.
The results of my initial regression are as follows:
Code:
. * LPM:
.
. xtreg binbmi_obese_y psum_unemployed_total_cont_y i.own_education_y i.maritalstatus_y i.medical_card_y i.employment_y i.ord_age_y if has_y0_questionnaire==1 & has_y5_
> questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (current_
> county_y1) re robust
Random-effects GLS regression Number of obs = 1,133
Group variable: id Number of groups = 556
R-sq: Obs per group:
within = 0.0750 min = 1
between = 0.0147 avg = 2.0
overall = 0.0302 max = 3
Wald chi2(22) = 218331.51
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
(Std. Err. adjusted for 28 clusters in current_county_y1)
-----------------------------------------------------------------------------------------------------------------------------
| Robust
binbmi_obese_y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------------------------------------------------+----------------------------------------------------------------
psum_unemployed_total_cont_y | .005794 .0014117 4.10 0.000 .0030272 .0085608
|
own_education_y |
No schooling | 0 (empty)
Primary school education | -.1315012 .0942464 -1.40 0.163 -.3162206 .0532183
Some secondary school | .0736315 .1101785 0.67 0.504 -.1423144 .2895775
Complete secondary education | .0279008 .0882128 0.32 0.752 -.1449931 .2007947
Some third level education at college, university, RTC | .0842196 .098333 0.86 0.392 -.1085096 .2769488
Complete third level education at college, university, RTC | -.0220746 .0883634 -0.25 0.803 -.1952636 .1511145
|
maritalstatus_y |
Cohabiting | -.0837401 .0382859 -2.19 0.029 -.1587792 -.0087011
Separated | .0225485 .0605217 0.37 0.709 -.0960717 .1411688
Divorced | .084211 .1269417 0.66 0.507 -.1645901 .3330121
Widowed | -.0079601 .1239793 -0.06 0.949 -.250955 .2350348
Single/Never married | -.0970986 .0382337 -2.54 0.011 -.1720353 -.022162
|
medical_card_y |
Yes | .0147679 .0384133 0.38 0.701 -.0605207 .0900565
|
employment_y |
Unemployed | .0231915 .0593355 0.39 0.696 -.093104 .1394869
Unable to work owing to permanent sickness or disability | .2963391 .1077156 2.75 0.006 .0852204 .5074577
At school/student | -.0237847 .0565382 -0.42 0.674 -.1345975 .0870282
Seeking work for the first time | -.1044752 .0674654 -1.55 0.121 -.2367049 .0277545
Employed | -.0413736 .0116618 -3.55 0.000 -.0642303 -.0185169
Self Employed | -.0094837 .0218855 -0.43 0.665 -.0523785 .0334111
|
ord_age_y |
20-23 | .1274583 .0904683 1.41 0.159 -.0498563 .304773
24-27 | .1046117 .0683596 1.53 0.126 -.0293708 .2385941
28-32 | .1036983 .0691316 1.50 0.134 -.0317971 .2391937
33 + | .084037 .0811597 1.04 0.300 -.0750332 .2431072
|
_cons | 0 (omitted)
------------------------------------------------------------+----------------------------------------------------------------
sigma_u | .26123467
sigma_e | .21894127
rho | .5874009 (fraction of variance due to u_i)
-----------------------------------------------------------------------------------------------------------------------------
A colleague suggested I make use of year dummies, as this is often done in panel data, the below is what this looks like when I add the year variable:
Code:
. xtreg binbmi_obese_y psum_unemployed_total_cont_y i.own_education_y i.maritalstatus_y i.medical_card_y i.employment_y i.ord_age_y i.year if has_y0_questionnaire==1 &
> has_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (c
> urrent_county_y1) re robust
Random-effects GLS regression Number of obs = 1,133
Group variable: id Number of groups = 556
R-sq: Obs per group:
within = 0.0862 min = 1
between = 0.0249 avg = 2.0
overall = 0.0427 max = 3
Wald chi2(23) = .
corr(u_i, X) = 0 (assumed) Prob > chi2 = .
(Std. Err. adjusted for 28 clusters in current_county_y1)
-----------------------------------------------------------------------------------------------------------------------------
| Robust
binbmi_obese_y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------------------------------------------------+----------------------------------------------------------------
psum_unemployed_total_cont_y | -.0023701 .0052052 -0.46 0.649 -.0125721 .0078319
|
own_education_y |
No schooling | 0 (empty)
Primary school education | 0 (omitted)
Some secondary school | .1505257 .0497344 3.03 0.002 .0530481 .2480033
Complete secondary education | .1029592 .0175976 5.85 0.000 .0684685 .1374499
Some third level education at college, university, RTC | .1587382 .0249566 6.36 0.000 .1098241 .2076523
Complete third level education at college, university, RTC | .0640187 .0212939 3.01 0.003 .0222834 .105754
|
maritalstatus_y |
Cohabiting | -.0846086 .0379878 -2.23 0.026 -.1590634 -.0101538
Separated | -.0233706 .0737117 -0.32 0.751 -.1678429 .1211016
Divorced | .0769838 .1250147 0.62 0.538 -.1680406 .3220082
Widowed | .0261904 .1288942 0.20 0.839 -.2264376 .2788183
Single/Never married | -.0912056 .0396385 -2.30 0.021 -.1688957 -.0135155
|
medical_card_y |
Yes | .0034374 .0381739 0.09 0.928 -.0713821 .0782569
|
employment_y |
Unemployed | .0245004 .0608822 0.40 0.687 -.0948265 .1438273
Unable to work owing to permanent sickness or disability | .287834 .1075161 2.68 0.007 .0771064 .4985616
At school/student | -.0190166 .0589808 -0.32 0.747 -.1346169 .0965838
Seeking work for the first time | -.1182621 .0651866 -1.81 0.070 -.2460254 .0095012
Employed | -.0297897 .0100574 -2.96 0.003 -.0495018 -.0100776
Self Employed | -.0072221 .0215375 -0.34 0.737 -.0494349 .0349906
|
ord_age_y |
20-23 | .1020646 .0893248 1.14 0.253 -.0730088 .277138
24-27 | .0637783 .0701205 0.91 0.363 -.0736554 .201212
28-32 | .0529197 .0681972 0.78 0.438 -.0807443 .1865838
33 + | -.0025392 .0789943 -0.03 0.974 -.1573651 .1522868
|
year |
5 | .0637809 .0157891 4.04 0.000 .0328348 .094727
10 | .1349336 .0595785 2.26 0.024 .0181618 .2517053
|
_cons | .0148301 .0725344 0.20 0.838 -.1273346 .1569949
------------------------------------------------------------+----------------------------------------------------------------
sigma_u | .25911118
sigma_e | .21775861
rho | .58606947 (fraction of variance due to u_i)
-----------------------------------------------------------------------------------------------------------------------------
. testparm i.year
( 1) 5.year = 0
( 2) 10.year = 0
chi2( 2) = 16.54
Prob > chi2 = 0.0003
Code:
. xtreg binbmi_obese_y psum_unemployed_total_cont_y i.own_education_y i.maritalstatus_y i.medical_card_y i.employment_y i.ord_age_y year if has_y0_questionnaire==1 & ha
> s_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (cur
> rent_county_y1) re robust
Random-effects GLS regression Number of obs = 1,133
Group variable: id Number of groups = 556
R-sq: Obs per group:
within = 0.0864 min = 1
between = 0.0247 avg = 2.0
overall = 0.0425 max = 3
Wald chi2(23) = 470684.48
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
(Std. Err. adjusted for 28 clusters in current_county_y1)
-----------------------------------------------------------------------------------------------------------------------------
| Robust
binbmi_obese_y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------------------------------------------------+----------------------------------------------------------------
psum_unemployed_total_cont_y | -.001825 .0021845 -0.84 0.403 -.0061066 .0024565
|
own_education_y |
No schooling | 0 (empty)
Primary school education | .0110054 .0830468 0.13 0.895 -.1517634 .1737742
Some secondary school | .1617676 .0901997 1.79 0.073 -.0150206 .3385557
Complete secondary education | .1141553 .076832 1.49 0.137 -.0364328 .2647433
Some third level education at college, university, RTC | .1698459 .0919531 1.85 0.065 -.010379 .3500707
Complete third level education at college, university, RTC | .0748105 .0766501 0.98 0.329 -.0754209 .2250419
|
maritalstatus_y |
Cohabiting | -.0847227 .0376777 -2.25 0.025 -.1585697 -.0108757
Separated | -.0233631 .073862 -0.32 0.752 -.1681298 .1214037
Divorced | .0767167 .1251534 0.61 0.540 -.1685795 .3220129
Widowed | .0251002 .1299681 0.19 0.847 -.2296326 .279833
Single/Never married | -.0914132 .0394799 -2.32 0.021 -.1687925 -.014034
|
medical_card_y |
Yes | .0034453 .0383149 0.09 0.928 -.0716506 .0785412
|
employment_y |
Unemployed | .0248206 .0607427 0.41 0.683 -.0942329 .1438742
Unable to work owing to permanent sickness or disability | .2890932 .1035665 2.79 0.005 .0861067 .4920798
At school/student | -.0190324 .059003 -0.32 0.747 -.1346761 .0966113
Seeking work for the first time | -.1181528 .0649302 -1.82 0.069 -.2454136 .009108
Employed | -.0295514 .0107356 -2.75 0.006 -.0505927 -.00851
Self Employed | -.0071762 .0215688 -0.33 0.739 -.0494503 .0350978
|
ord_age_y |
20-23 | .1015066 .0880151 1.15 0.249 -.0709999 .2740131
24-27 | .0630871 .0679689 0.93 0.353 -.0701296 .1963037
28-32 | .0521749 .0656988 0.79 0.427 -.0765924 .1809421
33 + | -.0034875 .0750595 -0.05 0.963 -.1506013 .1436264
|
year | .0129658 .0033547 3.87 0.000 .0063908 .0195409
_cons | 0 (omitted)
------------------------------------------------------------+----------------------------------------------------------------
sigma_u | .25910804
sigma_e | .21763809
rho | .58633216 (fraction of variance due to u_i)
-----------------------------------------------------------------------------------------------------------------------------
. testparm year
( 1) year = 0
chi2( 1) = 14.94
Prob > chi2 = 0.0001
.
However I wasn't too happy with this approach, to begin, I was surprised to see Prob > chi2 = . when including year as i.year. I'm not quite sure what this means and would be open to interpretation.
More importantly I think there is a good theoretical basis for not including year in this analysis, as follows:
Here I investigate the impact of unemployment on health during the Great Recession. Impacts are an effect of county level unemployment but are also made up of the overall impact of the recession experienced at a country level, when I hold year fixed, I feel I am leaving out the national level trend of unemployment and because of this, I ignore the importance of the recession as an employment effect.
Basically, I feel that what I am seeing is a combination of the national variation in the unemployment rate and the local area variation. I feel that if I were to add year dummies, that I might not get much variation, as there can’t be that much to identify the employment effect, just from the local effect. Put another way, if I am framing this paper as a recessionary analysis (i.e. the effect of the great recession on health as mediated by unemployment) then there wouldn’t be that much to identify the effect of the recession on health by just looking at local area unemployment and holding the effects of national unemployment fixed.
I thought the following might be a good explanation of this to place in text:
In random effects models time can be included in the fixed part as discrete time dummies in order to to take into account effects that may influence all cases in a given year to the same amount. Here, the intention is to remove a potential cause of spuriousness that results from common trends in observed variables. Including time dummies in this model however may overfit it. Put simply, the eliminated trend is the national level employment trend, i.e. the effect of the great recession. In other words, this analysis considers the effects of unemployment on health during the great recession. Impacts are an effect of county level unemployment but are also made up of the overall impact of the recession experienced at a country level. When I hold year fixed, I feel I am leaving out the national level trend of unemployment and because of this, I ignore the importance of the recession as an employment effect.
To control for time-specific effects expected to affect the whole sample over time, these are included as controls in the random effects model. These controls include age, marital status, state-support recipient and own-employment, etc., Discrete time dummies are not included, i.e. the effect of time itself is not modeled, because the recession is made up of local and national level employment effects. By controlling for year, national level employment effects are held fixed and thus only local level effects on health may be examined. In other words, by holding year as constant, it means that I ignore year effects in my analysis, the problem with this approach is that there was a recession, and that the year effects would include this recession.
In the case of small samples, such as this, there is the related problems that this will use up some degrees of freedom, which has a direct effect on the precision of the parameter estimates. Thus, estimates may be unbiased but completely unreliable. A model which is too complex, or overspecified, may reduce the precision of coefficient estimates and predicted values. The implications of both bias and precision for the analysis were thus considered when making this analysis decision. In an exploratory analysis where time dummies were included, the significance of the later years in this model. i.e. right when the financial crisis struck, would support this. Results are not robust to the inclusion of years, however, as years becomes significant I think this supports the argument for a national trend effect.
In text explanation ends.
My question is, would this be a reasonable argument? Or can I expect to face concerns over not including year in my analysis? Is there a firm quantitative argument that I can make against the inclusion of year? My concern is in dealing with reviewer queries when sending this article for publication.
I did notice that the in the first model -sigma_u- outperforms -sigma_e- such that a higher portion of the variation in -depvar- is explained by individual effect rather than idiosyncratic error, but I'm not sure if this is something worth mentioning.
I don't know if an argument could be made that the included years don't add anything informative to my results; and hence should not be plugged in among the predictors, i.e. I don't know if my data show any evidence that year has a statistical significant effect on my depvar. Even if there is a statistically significant effect, I would assume that this only supports the argument that I make above, i.e. that there was a recession at this time and that this was effecting health, and that by holding this fixed I can no longer look at the effect of this recession on health?
Although my number of observations may appear large, my actual sample is only 614 mothers, so I don’t know if my argument on the dangers of over-fitting the model above will hold much water.

Comment