Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to automate dummy variable adjustment for missing data?

    I run a fixed effects panel regression on survey data with missing values at my regressors (and DV). Since missings make about 35% in my personal data, it's time to deal with them. First option I found is dummy variable adjustment. I am aware of some drawbacks of the method in general. In this post I am interested in the code implementation.

    I followed the procedure from this site: https://ies.ed.gov/ncee/pubs/20090049/section_3a.asp

    My setup is very similar to this MWE:

    Code:
    * load data
    use http://www.stata-press.com/data/r13/nlswork
    
    * set panel structure
    xtset idcode year
    * 28534 obs, missing data e.g. union 9296
    mdesc
    
    * fixed effects regression (automatically uses 13797 complete cases)
    quietly xtreg ln_wage c.wks_ue##c.wks_ue##i.occ_code union age, fe
    margins, dydx(wks_ue)
    
    * dummy variable adjustment to deal with missing data in regressors
    gen D=0
    replace D=1 if wks_ue==.
    replace wks_ue=0 if wks_ue==.
    
    * run FE again (now 19156 obs are used)
    quietly xtreg ln_wage c.wks_ue##c.wks_ue##i.occ_code##D union age, fe
    margins, dydx(wks_ue)
    First, my question is whether it is correct to use the dummy D once in the interaction (since the regressor with missing data is a quadratic term). The subsequent question is, if there a way to automate this procedure? Since I have have 15 variables in my model, and I would like to use DVA for 5 of them. Thank you

  • #2
    To me it looks like the purpose of the adjustment is to not lose the observation when a single variable out of 15 are used and replacing it with 1 is intended to achieve that goal. So, using it once sounds appropriate as opposed to twice.
    gen D=0
    replace D=1 if wks_ue==.
    replace wks_ue=0 if wks_ue==.
    This part of the code will replace D with 1 and immediately restore it back to 0. Unless I am missing something you need 1 to stay in its place until after you run your regression then restore back to 0. So, replace wks_ue=0 if wks_ue==. should come after the xtreg. If not, it will always be 0 when the xtreg runs.
    To automate it for the all 5 variables I think all you need is a loop. Something like

    foreach v in var1 var2 var3 var4 var5 {
    replace D`v'=1 if `v'==. }

    If you have only one of the DVA vars in your reg at a time you can put the reg in the foreach loop like c.wks_ue##c.wks_ue##i.occ_code##D`v'.
    If you intend to use all the DVA vars in the xt reg at the same time then you need to share what the model would look like but for sure you cannot include the xt reg in the foreach loop. So, you are looking at something like

    gen D=0 foreach v in var1 var2 var3 var4 var5 {
    replace D`v'=1 if `v'==.
    quietly xtreg ln_wage c.wks_ue##c.wks_ue##i.occ_code##D`v' union age, fe margins, dydx(wks_ue) replace wks_ue=0 if wks_ue==.
    }
    Last edited by Oscar Ozfidan; 15 Aug 2020, 03:28.

    Comment


    • #3
      You may already know this, but there is a lot of criticism of that particular approach to missing data. If you are not providing different dummies for each missing observation, you are assuming that all the firms are all the observations would have the same value for that variable which is usually pretty far-fetched. Dealing with missing dependent variables is probably even trickier.

      Currently, multiple imputations and maximum likelihood (which you can do through SEM or GSEM) are the most fashionable approaches to missing data. While many like these, when I have dabbled with them on simulated data they don't seem to improve the estimation.

      Comment


      • #4
        Originally posted by Phil Bromiley View Post
        You may already know this, but there is a lot of criticism of that particular approach to missing data. If you are not providing different dummies for each missing observation, you are assuming that all the firms are all the observations would have the same value for that variable which is usually pretty far-fetched. Dealing with missing dependent variables is probably even trickier.

        Currently, multiple imputations and maximum likelihood (which you can do through SEM or GSEM) are the most fashionable approaches to missing data. While many like these, when I have dabbled with them on simulated data they don't seem to improve the estimation.
        Hi Phil, fortunately I don't have the problem with my DV at the moment. I am aware of drawbacks. Can you provide some references for how and why to use one dummy for each missing observation? First, it sounds like a computational challenge. Nonetheless interesting idea. Best regards, Marco

        Comment


        • #5
          Originally posted by Oscar Ozfidan View Post
          To me it looks like the purpose of the adjustment is to not lose the observation when a single variable out of 15 are used and replacing it with 1 is intended to achieve that goal. So, using it once sounds appropriate as opposed to twice.
          gen D=0
          replace D=1 if wks_ue==.
          replace wks_ue=0 if wks_ue==.
          This part of the code will replace D with 1 and immediately restore it back to 0. Unless I am missing something you need 1 to stay in its place until after you run your regression then restore back to 0. So, replace wks_ue=0 if wks_ue==. should come after the xtreg. If not, it will always be 0 when the xtreg runs.
          }
          Hi Oscar, I don't see this problem. Whether I generate a new modified regressor or modify the old one, results are identical.


          Code:
          ******** ONE NEW VARIABLE
          * dummy variable adjustment to deal with missing data in regressors
          gen D=0
          replace D=1 if wks_ue==.
          replace wks_ue=0 if wks_ue==.
          
          * run FE again (now 19156 obs are used)
          quietly xtreg ln_wage c.wks_ue##c.wks_ue##i.occ_code##D union age, fe
          margins, dydx(wks_ue) // ME -.0082752
          
          ******** TWO NEW VARIABLES
          * dummy variable adjustment to deal with missing data in regressors
          gen D=0
          replace D=1 if wks_ue==.
          gen wks_ue_Z = wks_ue
          replace wks_ue_Z=0 if wks_ue_Z==.
          
          * run FE again (now 19156 obs are used)
          quietly xtreg ln_wage c.wks_ue_Z##c.wks_ue_Z##i.occ_code##D union age, fe
          margins, dydx(wks_ue_Z) // ME -.0082752

          Comment

          Working...
          X