I think I have a strong theoretical argument for not including time-dummies in my random effects regression, I would be interested in the opinion of users here and suggestions as to how to strengthen this argument, particularly if there are any quantitative methods that I could apply in Stata.
I have panel data of local area unemployment and health outcomes in the same mothers analysed at 3 Waves, each five years apart, before, during and after a recession.
The results of my initial regression are as follows:
The data is clustered at the respondents local area level (i.e. County, which are similar to American States), there are 30 clusters and the effect of local area unemployment (psum_unemployed_total_cont_y) on health is measured at this same local area level to account for endogeneity in the relationship between unemployment and health (i.e. is the same person who is likely to be unemployed likely to be unhealthy for some unobserved reason?)
A colleague suggested I make use of year dummies, as this is often done in panel data, the below is what this looks like when I add the year variable:
In each case I used testparm to test the significance of the years.
However I wasn't too happy with this approach, to begin, I was surprised to see Prob > chi2 = . when including year as i.year. I'm not quite sure what this means and would be open to interpretation.
More importantly I think there is a good theoretical basis for not including year in this analysis, as follows:
Here I investigate the impact of unemployment on health during the Great Recession. Impacts are an effect of county level unemployment but are also made up of the overall impact of the recession experienced at a country level, when I hold year fixed, I feel I am leaving out the national level trend of unemployment and because of this, I ignore the importance of the recession as an employment effect.
Basically, I feel that what I am seeing is a combination of the national variation in the unemployment rate and the local area variation. I feel that if I were to add year dummies, that I might not get much variation, as there can’t be that much to identify the employment effect, just from the local effect. Put another way, if I am framing this paper as a recessionary analysis (i.e. the effect of the great recession on health as mediated by unemployment) then there wouldn’t be that much to identify the effect of the recession on health by just looking at local area unemployment and holding the effects of national unemployment fixed.
I thought the following might be a good explanation of this to place in text:
In random effects models time can be included in the fixed part as discrete time dummies in order to to take into account effects that may influence all cases in a given year to the same amount. Here, the intention is to remove a potential cause of spuriousness that results from common trends in observed variables. Including time dummies in this model however may overfit it. Put simply, the eliminated trend is the national level employment trend, i.e. the effect of the great recession. In other words, this analysis considers the effects of unemployment on health during the great recession. Impacts are an effect of county level unemployment but are also made up of the overall impact of the recession experienced at a country level. When I hold year fixed, I feel I am leaving out the national level trend of unemployment and because of this, I ignore the importance of the recession as an employment effect.
To control for time-specific effects expected to affect the whole sample over time, these are included as controls in the random effects model. These controls include age, marital status, state-support recipient and own-employment, etc., Discrete time dummies are not included, i.e. the effect of time itself is not modeled, because the recession is made up of local and national level employment effects. By controlling for year, national level employment effects are held fixed and thus only local level effects on health may be examined. In other words, by holding year as constant, it means that I ignore year effects in my analysis, the problem with this approach is that there was a recession, and that the year effects would include this recession.
In the case of small samples, such as this, there is the related problems that this will use up some degrees of freedom, which has a direct effect on the precision of the parameter estimates. Thus, estimates may be unbiased but completely unreliable. A model which is too complex, or overspecified, may reduce the precision of coefficient estimates and predicted values. The implications of both bias and precision for the analysis were thus considered when making this analysis decision. In an exploratory analysis where time dummies were included, the significance of the later years in this model. i.e. right when the financial crisis struck, would support this. Results are not robust to the inclusion of years, however, as years becomes significant I think this supports the argument for a national trend effect.
In text explanation ends.
My question is, would this be a reasonable argument? Or can I expect to face concerns over not including year in my analysis? Is there a firm quantitative argument that I can make against the inclusion of year? My concern is in dealing with reviewer queries when sending this article for publication.
I did notice that the in the first model -sigma_u- outperforms -sigma_e- such that a higher portion of the variation in -depvar- is explained by individual effect rather than idiosyncratic error, but I'm not sure if this is something worth mentioning.
I don't know if an argument could be made that the included years don't add anything informative to my results; and hence should not be plugged in among the predictors, i.e. I don't know if my data show any evidence that year has a statistical significant effect on my depvar. Even if there is a statistically significant effect, I would assume that this only supports the argument that I make above, i.e. that there was a recession at this time and that this was effecting health, and that by holding this fixed I can no longer look at the effect of this recession on health?
Although my number of observations may appear large, my actual sample is only 614 mothers, so I don’t know if my argument on the dangers of over-fitting the model above will hold much water.
I have panel data of local area unemployment and health outcomes in the same mothers analysed at 3 Waves, each five years apart, before, during and after a recession.
The results of my initial regression are as follows:
Code:
. * LPM: . . xtreg binbmi_obese_y psum_unemployed_total_cont_y i.own_education_y i.maritalstatus_y i.medical_card_y i.employment_y i.ord_age_y if has_y0_questionnaire==1 & has_y5_ > questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (current_ > county_y1) re robust Random-effects GLS regression Number of obs = 1,133 Group variable: id Number of groups = 556 R-sq: Obs per group: within = 0.0750 min = 1 between = 0.0147 avg = 2.0 overall = 0.0302 max = 3 Wald chi2(22) = 218331.51 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 (Std. Err. adjusted for 28 clusters in current_county_y1) ----------------------------------------------------------------------------------------------------------------------------- | Robust binbmi_obese_y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------------------------------------------------------------+---------------------------------------------------------------- psum_unemployed_total_cont_y | .005794 .0014117 4.10 0.000 .0030272 .0085608 | own_education_y | No schooling | 0 (empty) Primary school education | -.1315012 .0942464 -1.40 0.163 -.3162206 .0532183 Some secondary school | .0736315 .1101785 0.67 0.504 -.1423144 .2895775 Complete secondary education | .0279008 .0882128 0.32 0.752 -.1449931 .2007947 Some third level education at college, university, RTC | .0842196 .098333 0.86 0.392 -.1085096 .2769488 Complete third level education at college, university, RTC | -.0220746 .0883634 -0.25 0.803 -.1952636 .1511145 | maritalstatus_y | Cohabiting | -.0837401 .0382859 -2.19 0.029 -.1587792 -.0087011 Separated | .0225485 .0605217 0.37 0.709 -.0960717 .1411688 Divorced | .084211 .1269417 0.66 0.507 -.1645901 .3330121 Widowed | -.0079601 .1239793 -0.06 0.949 -.250955 .2350348 Single/Never married | -.0970986 .0382337 -2.54 0.011 -.1720353 -.022162 | medical_card_y | Yes | .0147679 .0384133 0.38 0.701 -.0605207 .0900565 | employment_y | Unemployed | .0231915 .0593355 0.39 0.696 -.093104 .1394869 Unable to work owing to permanent sickness or disability | .2963391 .1077156 2.75 0.006 .0852204 .5074577 At school/student | -.0237847 .0565382 -0.42 0.674 -.1345975 .0870282 Seeking work for the first time | -.1044752 .0674654 -1.55 0.121 -.2367049 .0277545 Employed | -.0413736 .0116618 -3.55 0.000 -.0642303 -.0185169 Self Employed | -.0094837 .0218855 -0.43 0.665 -.0523785 .0334111 | ord_age_y | 20-23 | .1274583 .0904683 1.41 0.159 -.0498563 .304773 24-27 | .1046117 .0683596 1.53 0.126 -.0293708 .2385941 28-32 | .1036983 .0691316 1.50 0.134 -.0317971 .2391937 33 + | .084037 .0811597 1.04 0.300 -.0750332 .2431072 | _cons | 0 (omitted) ------------------------------------------------------------+---------------------------------------------------------------- sigma_u | .26123467 sigma_e | .21894127 rho | .5874009 (fraction of variance due to u_i) -----------------------------------------------------------------------------------------------------------------------------
A colleague suggested I make use of year dummies, as this is often done in panel data, the below is what this looks like when I add the year variable:
Code:
. xtreg binbmi_obese_y psum_unemployed_total_cont_y i.own_education_y i.maritalstatus_y i.medical_card_y i.employment_y i.ord_age_y i.year if has_y0_questionnaire==1 & > has_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (c > urrent_county_y1) re robust Random-effects GLS regression Number of obs = 1,133 Group variable: id Number of groups = 556 R-sq: Obs per group: within = 0.0862 min = 1 between = 0.0249 avg = 2.0 overall = 0.0427 max = 3 Wald chi2(23) = . corr(u_i, X) = 0 (assumed) Prob > chi2 = . (Std. Err. adjusted for 28 clusters in current_county_y1) ----------------------------------------------------------------------------------------------------------------------------- | Robust binbmi_obese_y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------------------------------------------------------------+---------------------------------------------------------------- psum_unemployed_total_cont_y | -.0023701 .0052052 -0.46 0.649 -.0125721 .0078319 | own_education_y | No schooling | 0 (empty) Primary school education | 0 (omitted) Some secondary school | .1505257 .0497344 3.03 0.002 .0530481 .2480033 Complete secondary education | .1029592 .0175976 5.85 0.000 .0684685 .1374499 Some third level education at college, university, RTC | .1587382 .0249566 6.36 0.000 .1098241 .2076523 Complete third level education at college, university, RTC | .0640187 .0212939 3.01 0.003 .0222834 .105754 | maritalstatus_y | Cohabiting | -.0846086 .0379878 -2.23 0.026 -.1590634 -.0101538 Separated | -.0233706 .0737117 -0.32 0.751 -.1678429 .1211016 Divorced | .0769838 .1250147 0.62 0.538 -.1680406 .3220082 Widowed | .0261904 .1288942 0.20 0.839 -.2264376 .2788183 Single/Never married | -.0912056 .0396385 -2.30 0.021 -.1688957 -.0135155 | medical_card_y | Yes | .0034374 .0381739 0.09 0.928 -.0713821 .0782569 | employment_y | Unemployed | .0245004 .0608822 0.40 0.687 -.0948265 .1438273 Unable to work owing to permanent sickness or disability | .287834 .1075161 2.68 0.007 .0771064 .4985616 At school/student | -.0190166 .0589808 -0.32 0.747 -.1346169 .0965838 Seeking work for the first time | -.1182621 .0651866 -1.81 0.070 -.2460254 .0095012 Employed | -.0297897 .0100574 -2.96 0.003 -.0495018 -.0100776 Self Employed | -.0072221 .0215375 -0.34 0.737 -.0494349 .0349906 | ord_age_y | 20-23 | .1020646 .0893248 1.14 0.253 -.0730088 .277138 24-27 | .0637783 .0701205 0.91 0.363 -.0736554 .201212 28-32 | .0529197 .0681972 0.78 0.438 -.0807443 .1865838 33 + | -.0025392 .0789943 -0.03 0.974 -.1573651 .1522868 | year | 5 | .0637809 .0157891 4.04 0.000 .0328348 .094727 10 | .1349336 .0595785 2.26 0.024 .0181618 .2517053 | _cons | .0148301 .0725344 0.20 0.838 -.1273346 .1569949 ------------------------------------------------------------+---------------------------------------------------------------- sigma_u | .25911118 sigma_e | .21775861 rho | .58606947 (fraction of variance due to u_i) ----------------------------------------------------------------------------------------------------------------------------- . testparm i.year ( 1) 5.year = 0 ( 2) 10.year = 0 chi2( 2) = 16.54 Prob > chi2 = 0.0003
Code:
. xtreg binbmi_obese_y psum_unemployed_total_cont_y i.own_education_y i.maritalstatus_y i.medical_card_y i.employment_y i.ord_age_y year if has_y0_questionnaire==1 & ha > s_y5_questionnaire==1 | has_y0_questionnaire==1 & has_y10_questionnaire==1 | has_y0_questionnaire==1 & has_y5_questionnaire==1 & has_y10_questionnaire==1, cluster (cur > rent_county_y1) re robust Random-effects GLS regression Number of obs = 1,133 Group variable: id Number of groups = 556 R-sq: Obs per group: within = 0.0864 min = 1 between = 0.0247 avg = 2.0 overall = 0.0425 max = 3 Wald chi2(23) = 470684.48 corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000 (Std. Err. adjusted for 28 clusters in current_county_y1) ----------------------------------------------------------------------------------------------------------------------------- | Robust binbmi_obese_y | Coef. Std. Err. z P>|z| [95% Conf. Interval] ------------------------------------------------------------+---------------------------------------------------------------- psum_unemployed_total_cont_y | -.001825 .0021845 -0.84 0.403 -.0061066 .0024565 | own_education_y | No schooling | 0 (empty) Primary school education | .0110054 .0830468 0.13 0.895 -.1517634 .1737742 Some secondary school | .1617676 .0901997 1.79 0.073 -.0150206 .3385557 Complete secondary education | .1141553 .076832 1.49 0.137 -.0364328 .2647433 Some third level education at college, university, RTC | .1698459 .0919531 1.85 0.065 -.010379 .3500707 Complete third level education at college, university, RTC | .0748105 .0766501 0.98 0.329 -.0754209 .2250419 | maritalstatus_y | Cohabiting | -.0847227 .0376777 -2.25 0.025 -.1585697 -.0108757 Separated | -.0233631 .073862 -0.32 0.752 -.1681298 .1214037 Divorced | .0767167 .1251534 0.61 0.540 -.1685795 .3220129 Widowed | .0251002 .1299681 0.19 0.847 -.2296326 .279833 Single/Never married | -.0914132 .0394799 -2.32 0.021 -.1687925 -.014034 | medical_card_y | Yes | .0034453 .0383149 0.09 0.928 -.0716506 .0785412 | employment_y | Unemployed | .0248206 .0607427 0.41 0.683 -.0942329 .1438742 Unable to work owing to permanent sickness or disability | .2890932 .1035665 2.79 0.005 .0861067 .4920798 At school/student | -.0190324 .059003 -0.32 0.747 -.1346761 .0966113 Seeking work for the first time | -.1181528 .0649302 -1.82 0.069 -.2454136 .009108 Employed | -.0295514 .0107356 -2.75 0.006 -.0505927 -.00851 Self Employed | -.0071762 .0215688 -0.33 0.739 -.0494503 .0350978 | ord_age_y | 20-23 | .1015066 .0880151 1.15 0.249 -.0709999 .2740131 24-27 | .0630871 .0679689 0.93 0.353 -.0701296 .1963037 28-32 | .0521749 .0656988 0.79 0.427 -.0765924 .1809421 33 + | -.0034875 .0750595 -0.05 0.963 -.1506013 .1436264 | year | .0129658 .0033547 3.87 0.000 .0063908 .0195409 _cons | 0 (omitted) ------------------------------------------------------------+---------------------------------------------------------------- sigma_u | .25910804 sigma_e | .21763809 rho | .58633216 (fraction of variance due to u_i) ----------------------------------------------------------------------------------------------------------------------------- . testparm year ( 1) year = 0 chi2( 1) = 14.94 Prob > chi2 = 0.0001 .
However I wasn't too happy with this approach, to begin, I was surprised to see Prob > chi2 = . when including year as i.year. I'm not quite sure what this means and would be open to interpretation.
More importantly I think there is a good theoretical basis for not including year in this analysis, as follows:
Here I investigate the impact of unemployment on health during the Great Recession. Impacts are an effect of county level unemployment but are also made up of the overall impact of the recession experienced at a country level, when I hold year fixed, I feel I am leaving out the national level trend of unemployment and because of this, I ignore the importance of the recession as an employment effect.
Basically, I feel that what I am seeing is a combination of the national variation in the unemployment rate and the local area variation. I feel that if I were to add year dummies, that I might not get much variation, as there can’t be that much to identify the employment effect, just from the local effect. Put another way, if I am framing this paper as a recessionary analysis (i.e. the effect of the great recession on health as mediated by unemployment) then there wouldn’t be that much to identify the effect of the recession on health by just looking at local area unemployment and holding the effects of national unemployment fixed.
I thought the following might be a good explanation of this to place in text:
In random effects models time can be included in the fixed part as discrete time dummies in order to to take into account effects that may influence all cases in a given year to the same amount. Here, the intention is to remove a potential cause of spuriousness that results from common trends in observed variables. Including time dummies in this model however may overfit it. Put simply, the eliminated trend is the national level employment trend, i.e. the effect of the great recession. In other words, this analysis considers the effects of unemployment on health during the great recession. Impacts are an effect of county level unemployment but are also made up of the overall impact of the recession experienced at a country level. When I hold year fixed, I feel I am leaving out the national level trend of unemployment and because of this, I ignore the importance of the recession as an employment effect.
To control for time-specific effects expected to affect the whole sample over time, these are included as controls in the random effects model. These controls include age, marital status, state-support recipient and own-employment, etc., Discrete time dummies are not included, i.e. the effect of time itself is not modeled, because the recession is made up of local and national level employment effects. By controlling for year, national level employment effects are held fixed and thus only local level effects on health may be examined. In other words, by holding year as constant, it means that I ignore year effects in my analysis, the problem with this approach is that there was a recession, and that the year effects would include this recession.
In the case of small samples, such as this, there is the related problems that this will use up some degrees of freedom, which has a direct effect on the precision of the parameter estimates. Thus, estimates may be unbiased but completely unreliable. A model which is too complex, or overspecified, may reduce the precision of coefficient estimates and predicted values. The implications of both bias and precision for the analysis were thus considered when making this analysis decision. In an exploratory analysis where time dummies were included, the significance of the later years in this model. i.e. right when the financial crisis struck, would support this. Results are not robust to the inclusion of years, however, as years becomes significant I think this supports the argument for a national trend effect.
In text explanation ends.
My question is, would this be a reasonable argument? Or can I expect to face concerns over not including year in my analysis? Is there a firm quantitative argument that I can make against the inclusion of year? My concern is in dealing with reviewer queries when sending this article for publication.
I did notice that the in the first model -sigma_u- outperforms -sigma_e- such that a higher portion of the variation in -depvar- is explained by individual effect rather than idiosyncratic error, but I'm not sure if this is something worth mentioning.
I don't know if an argument could be made that the included years don't add anything informative to my results; and hence should not be plugged in among the predictors, i.e. I don't know if my data show any evidence that year has a statistical significant effect on my depvar. Even if there is a statistically significant effect, I would assume that this only supports the argument that I make above, i.e. that there was a recession at this time and that this was effecting health, and that by holding this fixed I can no longer look at the effect of this recession on health?
Although my number of observations may appear large, my actual sample is only 614 mothers, so I don’t know if my argument on the dangers of over-fitting the model above will hold much water.
Comment