How to automate dummy variable adjustment for missing data?

Marco Kuehne

Join Date: Feb 2019

Posts: 32
#1

How to automate dummy variable adjustment for missing data?

14 Aug 2020, 02:01

I run a fixed effects panel regression on survey data with missing values at my regressors (and DV). Since missings make about 35% in my personal data, it's time to deal with them. First option I found is dummy variable adjustment. I am aware of some drawbacks of the method in general. In this post I am interested in the code implementation.

I followed the procedure from this site: https://ies.ed.gov/ncee/pubs/20090049/section_3a.asp

My setup is very similar to this MWE:

Code:

* load data use http://www.stata-press.com/data/r13/nlswork * set panel structure xtset idcode year * 28534 obs, missing data e.g. union 9296 mdesc * fixed effects regression (automatically uses 13797 complete cases) quietly xtreg ln_wage c.wks_ue##c.wks_ue##i.occ_code union age, fe margins, dydx(wks_ue) * dummy variable adjustment to deal with missing data in regressors gen D=0 replace D=1 if wks_ue==. replace wks_ue=0 if wks_ue==. * run FE again (now 19156 obs are used) quietly xtreg ln_wage c.wks_ue##c.wks_ue##i.occ_code##D union age, fe margins, dydx(wks_ue)

First, my question is whether it is correct to use the dummy D once in the interaction (since the regressor with missing data is a quadratic term). The subsequent question is, if there a way to automate this procedure? Since I have have 15 variables in my model, and I would like to use DVA for 5 of them. Thank you
Tags: None
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#2

15 Aug 2020, 03:24

To me it looks like the purpose of the adjustment is to not lose the observation when a single variable out of 15 are used and replacing it with 1 is intended to achieve that goal. So, using it once sounds appropriate as opposed to twice.
gen D=0
replace D=1 if wks_ue==.
replace wks_ue=0 if wks_ue==.
This part of the code will replace D with 1 and immediately restore it back to 0. Unless I am missing something you need 1 to stay in its place until after you run your regression then restore back to 0. So, replace wks_ue=0 if wks_ue==. should come after the xtreg. If not, it will always be 0 when the xtreg runs.
To automate it for the all 5 variables I think all you need is a loop. Something like

foreach v in var1 var2 var3 var4 var5 {
replace D`v'=1 if `v'==. }

If you have only one of the DVA vars in your reg at a time you can put the reg in the foreach loop like c.wks_ue##c.wks_ue##i.occ_code##D`v'.
If you intend to use all the DVA vars in the xt reg at the same time then you need to share what the model would look like but for sure you cannot include the xt reg in the foreach loop. So, you are looking at something like

gen D=0 foreach v in var1 var2 var3 var4 var5 {
replace D`v'=1 if `v'==.
quietly xtreg ln_wage c.wks_ue##c.wks_ue##i.occ_code##D`v' union age, fe margins, dydx(wks_ue) replace wks_ue=0 if wks_ue==.
}

Last edited by Oscar Ozfidan; 15 Aug 2020, 03:28.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

17 Aug 2020, 15:41

You may already know this, but there is a lot of criticism of that particular approach to missing data. If you are not providing different dummies for each missing observation, you are assuming that all the firms are all the observations would have the same value for that variable which is usually pretty far-fetched. Dealing with missing dependent variables is probably even trickier.

Currently, multiple imputations and maximum likelihood (which you can do through SEM or GSEM) are the most fashionable approaches to missing data. While many like these, when I have dabbled with them on simulated data they don't seem to improve the estimation.
1 like
Comment
Marco Kuehne

Join Date: Feb 2019

Posts: 32
#4

18 Aug 2020, 23:54

Originally posted by Phil Bromiley View Post

You may already know this, but there is a lot of criticism of that particular approach to missing data. If you are not providing different dummies for each missing observation, you are assuming that all the firms are all the observations would have the same value for that variable which is usually pretty far-fetched. Dealing with missing dependent variables is probably even trickier.

Currently, multiple imputations and maximum likelihood (which you can do through SEM or GSEM) are the most fashionable approaches to missing data. While many like these, when I have dabbled with them on simulated data they don't seem to improve the estimation.

Hi Phil, fortunately I don't have the problem with my DV at the moment. I am aware of drawbacks. Can you provide some references for how and why to use one dummy for each missing observation? First, it sounds like a computational challenge. Nonetheless interesting idea. Best regards, Marco
Comment
Marco Kuehne

Join Date: Feb 2019

Posts: 32
#5

19 Aug 2020, 00:01

Originally posted by Oscar Ozfidan View Post

To me it looks like the purpose of the adjustment is to not lose the observation when a single variable out of 15 are used and replacing it with 1 is intended to achieve that goal. So, using it once sounds appropriate as opposed to twice.
gen D=0
replace D=1 if wks_ue==.
replace wks_ue=0 if wks_ue==.
This part of the code will replace D with 1 and immediately restore it back to 0. Unless I am missing something you need 1 to stay in its place until after you run your regression then restore back to 0. So, replace wks_ue=0 if wks_ue==. should come after the xtreg. If not, it will always be 0 when the xtreg runs.
}

Hi Oscar, I don't see this problem. Whether I generate a new modified regressor or modify the old one, results are identical.

Code:

******** ONE NEW VARIABLE * dummy variable adjustment to deal with missing data in regressors gen D=0 replace D=1 if wks_ue==. replace wks_ue=0 if wks_ue==. * run FE again (now 19156 obs are used) quietly xtreg ln_wage c.wks_ue##c.wks_ue##i.occ_code##D union age, fe margins, dydx(wks_ue) // ME -.0082752 ******** TWO NEW VARIABLES * dummy variable adjustment to deal with missing data in regressors gen D=0 replace D=1 if wks_ue==. gen wks_ue_Z = wks_ue replace wks_ue_Z=0 if wks_ue_Z==. * run FE again (now 19156 obs are used) quietly xtreg ln_wage c.wks_ue_Z##c.wks_ue_Z##i.occ_code##D union age, fe margins, dydx(wks_ue_Z) // ME -.0082752
Comment

Announcement

How to automate dummy variable adjustment for missing data?

Comment

Comment

Comment

Comment