Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to not lose regression observations when there are missing values

    Dear Statalist Community,

    I am currently working with a panel dataset consisting of 20,082 observations over 19 years, focusing on youth aged 15-24. Each observation includes variables for both the mother and father, which I merged from the original raw panel data before restricting my sample to the youth. However, I encountered issues with missing data. If a respondent has a single parent (e.g., the father is absent), the corresponding "dad variables" (e.g., dad_age, dad_schooling) are marked as missing. Additionally, some respondents did not answer certain survey questions, resulting in missing variables like financial satisfaction.

    To address this, I want to include observations with missing values for some regressors in the regression analysis. Excluding these observations would restrict my sample to those with two parents, potentially introducing selection bias, especially since my dependent variables of interest are mental and physical health. Note, my regression is a fixed effects regression, with individual, local government area and time fixed effects.

    One approach I am considering is assigning an abstract value of -1 to the missing variables to keep these observations in the regression. However, I do not want these filled values to impact the estimated relationship between my dependent and independent variables.

    My plan is to create a missingness indicator for each variable, assigning a value of 1 if the variable is not missing and 0 if it is. After filling the missing values with -1, I would interact the missingness indicator with the original variable. This way, when an observation has a missing value, the indicator (0) multiplied by the filled value (-1) will ensure that it does not affect the dependent variable. Meanwhile, the observation will remain in the regression, allowing other non-missing variables to contribute to the analysis.

    Does this make sense to do? Or are there any problems with doing this? Is there any other ways to deal with this issue?

    I hope this makes sense. Looking forward to any insights!


    Edit:
    I forgot to add the above method is not what I am doing currently. Currently, I am making a dummy equal to one if missing.

    For example, a variable mum_ed and it is missing (.) for all kids that have a single dad parent. Then

    Gen new_mum_ed = mum_ed
    Replace new_mum_ed = 0 if mum_missing == 1 [nb it needn't be filled in as 0 if missing - you could use another value]

    Regress y new_mum_ed mum_missing

    Does this also work as I think it allows for a linear effect of mum_ed for those with real responses to that question and allows for a level difference in y on average for the ones with missing mums.
    Last edited by Andrew Mizon; 02 Aug 2024, 00:14.

  • #2
    Andrew:
    welcome to this forum.
    the only reasonable and acceptable approach to deal with missing values entails diagnosing their mechanism (missing completely at random; missing at random; missing not at random) and managing them according to their missing machanism.
    Among tons of literature, you may want to start off from the -mi- suite, Stata .pdf manual,
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Originally posted by Andrew Mizon View Post
      Note, my regression is a fixed effects regression, with individual, [...] fixed effects.
      Won't individual fixed effects wipe out the parent information anyway? I mean, who your parents are doesn't change much over the life course.

      That aside, a so-called "dummy variable adjustment" approach to missing parental information may be justified and even outperform more advanced methods. Make sure to carefully read Paul Allison's take on the matter. In the basic model setup, you imply that associations between all variables and the outcome are constant across those with two parents and those with a single parent. Given the massive number of observations implied in your original post, separated models might be preferable.
      Last edited by daniel klein; 02 Aug 2024, 02:38.

      Comment


      • #4
        Originally posted by daniel klein View Post

        Won't individual fixed effects wipe out the parent information anyway? I mean, who your parents are doesn't change much over the life course.

        That aside, a so-called "dummy variable adjustment" approach to missing parental information may be justified and even outperform more advanced methods. Make sure to carefully read Paul Allison's take on the matter. In the basic model setup, you imply that associations between all variables and the outcome are constant across those with two parents and those with a single parent. Given the massive number of observations implied in your original post, separated models might be preferable.
        Hi Daniel,

        Thank you for your comment. Paul Allison's article was very helpful!

        To get a second opinion, do you think my case could work under his example of "Data are missing on X because that variable doesn’t apply or has no meaning for some subset of the sample"? Similar to his marriage and marital satisfaction example, the mum_education variable for a child without a mother seems nonsensical. Or is my case more nuanced given the context?

        Additionally, to address your fixed effects comment, there is variation in the mother and father variables I am working with, such as their mental and physical health, labor force status, income, and so on.

        Looking forward to hearing your thoughts!

        Comment


        • #5
          First, let me admit that I am not completely clear on one of Allison's (crucial) points. If it is true that we can plug in any constant value to substitute the missing values, and it is also true that the dummy variable adjustment approach (DVA) generally leads to biased estimates, then it cannot be true that
          [...] this is just the DVA model with c = 0. And if the ε satisfies the usual assumptions for linear regression, OLS applied to this model will yield "best unbiased estimates" of all coefficients.
          The solution to this might be buried in the details that DVA
          [...] coefficient estimates tend to be biased
          (my emphasis)

          and that Allision chooses to not
          [...] going into the details of [the] proof

          With that said, I think, generally, the "doesn't apply" scenario fits your description of missing parental information. I am unsure how the fixed effects framework plays out for DVA. I mean, the missing indicator (dummy) is constant within units so unit-fixed effects will wipe them out. Whether that means that the missingness is already accounted for in FE or cannot be accounted for, I don't know. I would have to think about this but I will not find the time to do it. If anyone has some insights, I'd highly appreciate their thoughts.
          Last edited by daniel klein; 03 Aug 2024, 13:16.

          Comment


          • #6
            Originally posted by Carlo Lazzaro View Post
            Andrew:
            -mi- suite, Stata .pdf manual,
            Hi Carlo, Thank you for the warm welcome. Although, do you have any thoughts on, if I impute them (using a prediction from some kind of regression model), then the imputation might/could depend directly on house prices, which is my main regressor of interest (i.e., I am looking at house prices influence of youth wellbeing).

            Comment


            • #7
              Andrew:
              see if -ipolate- can help you out easier than -mi-.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment

              Working...
              X