Robustness check for cross-sectional data by merging datasets and creating year dummy variable

Yuki Chan

Join Date: Mar 2021

Posts: 3
#1

Robustness check for cross-sectional data by merging datasets and creating year dummy variable

19 Apr 2021, 04:15

I am currently working on the effects of maternal education on child mortality with cross-sectional data. I got data sets for 2008, 2010 and 2014. I am thinking of doing a robustness checks and I received some advice and I am not sure whether it is valid or how it works on stata. I was told to merge the data sets, ie 2010 and 2014, then create a year dummy variable (i.year), then include the year dummy variable and simply run the regressions that I had for my project.

Also something I am concerned is that do the respondents have to be the same individuals? And what is the difference between running the regression separately in 2010 and 2014 and the method above? Thanks in advance!
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3049
#2

19 Apr 2021, 04:25

If you run the regression separately for the two years, you force all coefficients to be the same across the two years.

If you pool the data, and include a year dummy, you allow only the constant to be different across the two years.
Comment
Yuki Chan

Join Date: Mar 2021

Posts: 3
#3

19 Apr 2021, 05:10

Thank you for your reply! As for the first line, I do not understand how the coefficients are forced the same. The data is entirely different so I suppose they produce different coefficients. As for the second line, which I plan to work with, it seems that making a different constant is not helping in my robustness check. Are you suggesting alternative methods to do so?

By the way, I actually constructed a few probit models and the significance drops as I add more controls. Is this a robustness check?

Last edited by Yuki Chan; 19 Apr 2021, 05:19.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3049
#4

19 Apr 2021, 06:01

Sorry, I misspoke. I meant

1) "If you run the regression separately for the two years, you ALLOW all coefficients to be DIFFERENT across the two years."

2) If you pool the two years, and you do not include any dummies, then you force the coefficients to be the same across the two years.

3) And if you allow only a year dummy, you allow only the constant to be different.

If you pool your data, and you include a year dummy, and an interaction of this year dummy with all your other regressors, then this is the same as in 1), but you can test statistically whether the differences between the two years are significant. You can do a Chow test for stability of the parameters across the two years.

Originally posted by Yuki Chan View Post

Thank you for your reply! As for the first line, I do not understand how the coefficients are forced the same. The data is entirely different so I suppose they produce different coefficients. As for the second line, which I plan to work with, it seems that making a different constant is not helping in my robustness check. Are you suggesting alternative methods to do so?

By the way, I actually constructed a few probit models and the significance drops as I add more controls. Is this a robustness check?
Comment
Yuki Chan

Join Date: Mar 2021

Posts: 3
#5

19 Apr 2021, 06:08

Originally posted by Joro Kolev View Post

Sorry, I misspoke. I meant

1) "If you run the regression separately for the two years, you ALLOW all coefficients to be DIFFERENT across the two years."

2) If you pool the two years, and you do not include any dummies, then you force the coefficients to be the same across the two years.

3) And if you allow only a year dummy, you allow only the constant to be different.

If you pool your data, and you include a year dummy, and an interaction of this year dummy with all your other regressors, then this is the same as in 1), but you can test statistically whether the differences between the two years are significant. You can do a Chow test for stability of the parameters across the two years.

Thank you for the clarification. From your response, I feel like I am oversimplifying the robustness test. My initial plan is to see if a similar/pooled data set that inreases the sample size gives the same signs with the results I have now (wth 2014 dataset). If they are the same/consistent, my results are robust. If not, they are not robust and I explain why. It should be something similar to number 3 but I am not exactly sure about the interpretation. Now it seems that doing number 1 essentially is what I need because I am just looking at the magnitude and signs for the coeffficients of maternal education.

Or should I just do the robustness check by looking at my current probit models and talk about how the signs are the same even I add controls? I came up with an idea of using a similar model -lpm but I was aware that it wasn't the ideal thing to do. I am quite confused.

Last edited by Yuki Chan; 19 Apr 2021, 06:15.
Comment
Luciana Jaime

Join Date: May 2023

Posts: 6
#6

04 Jun 2023, 16:18

Hello Yuki! Did you find your answer? I have the same doubts, if you could kindly share them here would be great! Thanks in advance
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#7

04 Jun 2023, 16:40

For concreteness, and staying close to the details set out in the early part of this thread, I will assume that there are three data sets: cross sections for years 2008, 2010, and 2014, with variables for maternal education (key explanatory variable) and child mortality (outcome variable). I will also assume that there are other variables that are potential or actual confounders of the maternal education:child mortality relationship and that these are included in the analysis. Originally only the 2014 data were available and an initial analysis was based on them. The question now is whether the findings can be reproduced in the 2008 and 2010 data sets.

My approach would be to pool the three data sets (which, by the way, would be an -append-, not a -merge-) and then use an interaction with i.year. So

Code:

regression_command child_mortality c.maternal_education##i2014.year covariates, options

Basically, except for the ##i2014.year factor, this should be the same as the original analysis carried out in the 2014 data. (If maternal_education is a discrete variable, change the c. to i.--everything else works the same either way.) The focus of interest then becomes the coefficients of the maternal_education#2008.year and maternal_education#2010.year terms in the output. These represent the difference between the estimated effects of maternal education on child mortality in 2008 and 2010, respectively, compared to that effect in 2014. A successful robustness test will have these coefficients small.

Now, there is another issue, namely whether the covariates should also be interacted with i2014.year. If you do not include covariate#i2014.year interactions, then you are implicitly constraining the covariate effects on child mortality to be identical in the three years. If you do, then you allow them to vary freely across the years. This decision should, if possible, be based on substantive knowledge. If the effects of the covariates on child mortality are known or generally understood based on theory and previous research, to be time invariant, then it is best to omit their interactions and allow them to be constrained to be equal. But if these effects are reasonably believed to vary over time, then the model's validity would be improved by allowing free variation. It may be that some of the covariates are known to have time invariant effects and others not--the model should reflect this. That is, there is no requirement to treat all of the covariates in the same way.

The reason, by the way, for preferring not to have i2014.year#covariate interactions if they are not necessary is to reduce the number of model degrees of freedom. This avoids two potential problems that arise when model degrees of freedom become too large: overfitting of the noise in the data, and, specifically with interaction terms, the correlation of the interaction terms with the main terms reduces the precision of the estimates and results in unnecessarily wide standard errors. In the worst case, these correlations can result in the estimates of the effects of interest becoming inconclusive. So, if in doubt, do not add year#covariate interactions--but be sure to include them wherever prior knowledge says that time-variation of covariate effects is appreciable.
Comment

Announcement

Robustness check for cross-sectional data by merging datasets and creating year dummy variable

Comment

Comment

Comment

Comment

Comment

Comment