Dear Statalist Community,
I am currently working with a panel dataset consisting of 20,082 observations over 19 years, focusing on youth aged 15-24. Each observation includes variables for both the mother and father, which I merged from the original raw panel data before restricting my sample to the youth. However, I encountered issues with missing data. If a respondent has a single parent (e.g., the father is absent), the corresponding "dad variables" (e.g., dad_age, dad_schooling) are marked as missing. Additionally, some respondents did not answer certain survey questions, resulting in missing variables like financial satisfaction.
To address this, I want to include observations with missing values for some regressors in the regression analysis. Excluding these observations would restrict my sample to those with two parents, potentially introducing selection bias, especially since my dependent variables of interest are mental and physical health. Note, my regression is a fixed effects regression, with individual, local government area and time fixed effects.
One approach I am considering is assigning an abstract value of -1 to the missing variables to keep these observations in the regression. However, I do not want these filled values to impact the estimated relationship between my dependent and independent variables.
My plan is to create a missingness indicator for each variable, assigning a value of 1 if the variable is not missing and 0 if it is. After filling the missing values with -1, I would interact the missingness indicator with the original variable. This way, when an observation has a missing value, the indicator (0) multiplied by the filled value (-1) will ensure that it does not affect the dependent variable. Meanwhile, the observation will remain in the regression, allowing other non-missing variables to contribute to the analysis.
Does this make sense to do? Or are there any problems with doing this? Is there any other ways to deal with this issue?
I hope this makes sense. Looking forward to any insights!
Edit:
I forgot to add the above method is not what I am doing currently. Currently, I am making a dummy equal to one if missing.
For example, a variable mum_ed and it is missing (.) for all kids that have a single dad parent. Then
Gen new_mum_ed = mum_ed
Replace new_mum_ed = 0 if mum_missing == 1 [nb it needn't be filled in as 0 if missing - you could use another value]
Regress y new_mum_ed mum_missing
Does this also work as I think it allows for a linear effect of mum_ed for those with real responses to that question and allows for a level difference in y on average for the ones with missing mums.
I am currently working with a panel dataset consisting of 20,082 observations over 19 years, focusing on youth aged 15-24. Each observation includes variables for both the mother and father, which I merged from the original raw panel data before restricting my sample to the youth. However, I encountered issues with missing data. If a respondent has a single parent (e.g., the father is absent), the corresponding "dad variables" (e.g., dad_age, dad_schooling) are marked as missing. Additionally, some respondents did not answer certain survey questions, resulting in missing variables like financial satisfaction.
To address this, I want to include observations with missing values for some regressors in the regression analysis. Excluding these observations would restrict my sample to those with two parents, potentially introducing selection bias, especially since my dependent variables of interest are mental and physical health. Note, my regression is a fixed effects regression, with individual, local government area and time fixed effects.
One approach I am considering is assigning an abstract value of -1 to the missing variables to keep these observations in the regression. However, I do not want these filled values to impact the estimated relationship between my dependent and independent variables.
My plan is to create a missingness indicator for each variable, assigning a value of 1 if the variable is not missing and 0 if it is. After filling the missing values with -1, I would interact the missingness indicator with the original variable. This way, when an observation has a missing value, the indicator (0) multiplied by the filled value (-1) will ensure that it does not affect the dependent variable. Meanwhile, the observation will remain in the regression, allowing other non-missing variables to contribute to the analysis.
Does this make sense to do? Or are there any problems with doing this? Is there any other ways to deal with this issue?
I hope this makes sense. Looking forward to any insights!
Edit:
I forgot to add the above method is not what I am doing currently. Currently, I am making a dummy equal to one if missing.
For example, a variable mum_ed and it is missing (.) for all kids that have a single dad parent. Then
Gen new_mum_ed = mum_ed
Replace new_mum_ed = 0 if mum_missing == 1 [nb it needn't be filled in as 0 if missing - you could use another value]
Regress y new_mum_ed mum_missing
Does this also work as I think it allows for a linear effect of mum_ed for those with real responses to that question and allows for a level difference in y on average for the ones with missing mums.
Comment