Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel Data Missing Observations! Solution

    Dear Statalisters,

    I am currently running analysis on a 2-period panel data set from China. I use a Fixed Effects model for some regressands, an ordered logit model for others.

    I do have non-responses for some of my regressors, this causes my observations to drop. What I do for the FE model is the following:

    I create a dummy which I set to 1 if the value in my regressor is missing. I then change the missing values in my regressor to 0 and include both in the regression. For my FE, this yields very similar results to when I accept the deletion of variables, however I am now able to use all observations.

    This is a code example for my education variable

    Code:
    First I would do:
    xtreg y edu    --> uses 49.000 observations
    
    then i do:
    gen eduX=0
    replace eduX=1 if edu==.
    replace edu=0 if edu==.
    
    and now i run:
    xtreg y edu eduX
    --> now uses full 72.000 observations, results are very similar
    However, when I do EXACTLY the same with an Ordered Logit Model for a different set of variables, I get very different results than to when I use it without eduX

    Code:
    xtologit y edu --> uses 49.000 observations
    xtologit y edu eduX --> uses 72.000 observations but results VERY different.
    Now my question: Can you use this strategy with an ordered Logit Model? Or is there something fundamentally different with ordered logits so this does not work? And if it does work, how do I interpret that the results change a lot?

    Many thanks in advance!!

  • #2
    Andreas:
    1) intoducing categorical variables to account for missing values is not recommended, as this approach produces biased results regardless the mechanism underlying the missingness of your data (see: https://www.guilford.com/books/Missi...9781593853938: 169-170),
    2) in addition to what above, no wonder that you've obtained different results with (such) different regression models.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Originally posted by Carlo Lazzaro View Post
      Andreas:
      1) intoducing categorical variables to account for missing values is not recommended, as this approach produces biased results regardless the mechanism underlying the missingness of your data (see: https://www.guilford.com/books/Missi...9781593853938: 169-170),
      2) in addition to what above, no wonder that you've obtained different results with (such) different regression models.
      Carlo, thanks for your answer, I will read through your 1) point when I get to the library later.

      Regarding your 2) point: I am estimating different regressands with both models. It is obvious to me that If I my regressand would be the same in both models, they would produce different results.

      However, within the same regressand in the ologit, my results change remarkably when including the categorical variables. I don't think this changes anything about your answer, I just wanted to make clear what I meant.

      Comment


      • #4
        Andreas:
        In see your point 2).
        Posting what you typed (as you actually did) and the entire outcome that Stata gave you back (as you probably forgot to do) would have avoided misunderstandings. Thanks.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Dear Carlo,

          I am deeply sorry for bringing this tedious question up again. It was actually my supervisor who brought up the idea of using the Dummy Variable Adjustment (DVA). I have further consulted the literature, and it all confirms your argument that this produces biased results. I think you can understand that this makes me feel really unsure about what to do (my supervisor is gone for a week and I need to make progress)

          At the moment, my strategy is the following:

          1. I run the regression as it is and get results with a sample of 15,000 observations
          Code:
          xtreg Regressand Regressor1 Regressor2 Regressor3 Missing1 Missing2, FE
          2. I apply DVA to the 2 controls that have the most missing data (they account for >75% of the missingness) and then run it again, now I get 21,000 observations, results are remakably similar to the one from the first regression
          Code:
          xtreg Regressand Regressor1 Regressor2 Regressor3 Missing1 Missing2 Missing1X Missing2X, FE
          3. I now run the same regression and simply exclude the two controls. I now get also 21,000 observations and similar results as in regression 1 and basically the same results as in regression 2
          Code:
          xtreg Regressand Regressor1 Regressor2 Regressor3, FE
          I then argue that DVA is biased and one should treat the results from 2 with caution. For the third regression, I argue that it is very reassuring that results do not change much when I exclude them. However, am I right in thinking that I cannot argue that I can just exclude these 2 controls overall, as they are significant when included?
          --> In general, I argue that regression 2. and 3. show that there seems to be no (strong) attrition bias. The results do not seem to be driven by the sample. Is this line of argueing sound?

          Now, as you might remember, I also run an ordered logit on a different regressand.

          What I do there is similar to 1. above, and I get results with a sample of 15,000 observations
          Code:
          xtologit Regressand Regressor1 Regressor2 Regressor3 Missing1 Missing2
          When I do 2 from, above, I also get 21,000 observations, but now results are remarkably different to how they were before
          Code:
          xtreg Regressand Regressor1 Regressor2 Regressor3 Missing1 Missing2 Missing1X Missing2X
          When I exclude the variables entirely as in 3. above, I get 21,000 observations again but again very different results
          Code:
          xtreg Regressand Regressor1 Regressor2 Regressor3
          What should I do now regarding this second part of the analysis, so that it is coherent with the FE regression? My supervisor was not able to help me on this as he does not work with ordered logits a lot, and I struggle to find (simple) literature regarding this issue. Can I just proceed with the 15000 observation estimation, as I have shown that the results seem not to be driven by the sample? I just do not know how to apply the strategies from the first regression to the ordered logit....

          I know my questions are tedious, but since I have spoken to my supervisor I am just really confused and lost as to how to address this issue.

          I would be extremely grateful if you could help me with this issue.

          Many thanks in advance,
          Andreas

          Comment


          • #6
            Andreas:
            as the augmented regression with dummies for missing values is in all likelihood biased, probably the most political (but nit methodological) correct approach is to run your regressions on observed values only (this is something Stata does by default) and highlighting in the Discussion section of your dissertation that you did not deal with missing values (ie, this is a limitation of your research).
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment

            Working...
            X