Missing data for categorical variables

Tracy Lam

Join Date: Jul 2014

Posts: 91
#1

Missing data for categorical variables

12 Apr 2015, 16:44

Hi Statalist,

I am using longitudinal survey data and have some missing cases for categorical variables. I'm handling missing data with dummy variable adjustments. For categorical variables with missing data, such as parental level of education (no HS diploma, HS diploma, some college, college degree, advanced degree), does it make sense to create a new category that indicates there is missing data or should I create a missing dummy variable for each of the 5 categories of parental level of education?
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

12 Apr 2015, 19:16

neither - suggest you "h mi"
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#3

13 Apr 2015, 01:32

To expand a bit on Rich's correct answer:

Say you have a linear model with two explanatory variables:

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2\]

In that case, \(\beta_1\) is the effext of \(x_1\) while adjusting for \(x_2\). What happens when some the observations have missing values for \(x_2\) and you adjusted for that with an indicator variable \(m_2\) (and set the missing values equal to 0). You then have the model:

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 m_2\]

What happens when \(x_2\) is not missing?

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 0 \\
= \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

So in that case \(\beta_1\) is the effect of \(x_1\) while adjusting for \(x_2\), which is what we wanted.

What happens when \(x_2\) is missing?

\[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 0 + \beta_3 1 \\
= \underbrace{\beta_0 + \beta_3}_{\beta_0^*} + \beta_1 x_1 \]

Now you have a model that does not adjust for \(x_2\), but we have the same parameter for the effect of \(x_1\). So \(\beta_1\) is a mixture of the effect of \(x_1\) while adjusting for \(x_2\) and the effect of \(x_1\) while not adjusting for \(x_2\). If the missing values are genuine missing values, then that does not make sense.

An exception could be when the missing values aren't really missing but there is some structural reason why that variable has no value, for example mother's occupational status when she is a homemaker.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
El son

Join Date: Oct 2020

Posts: 4
#4

06 Oct 2020, 00:01

What if we run the regression by using the indicator dummy variable for missing and non-missing values, but we remove the beta for the indicator and the dummy variable from our model(but our regression is run on full observation). That way when we enter values we don't have to consider if it is a missing value or not since we will only consider for those that we have data. Would that work?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#5

06 Oct 2020, 01:13

I don't understand you propose. How can you run a regression with a variable and than remove it from the model?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#6

06 Oct 2020, 03:18

Tracy:
Authors of https://www.guilford.com/books/Missi.../9781593853938 (thanks once more to Maarten Buis for sharing this reference many years ago on this list) at page 169-170 warn against the dummy variable adjustment approach as it usually produces biased estimates regardless the underlying missing mechanism (MCAR; MAR; MNAR).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
El son

Join Date: Oct 2020

Posts: 4
#7

06 Oct 2020, 14:27

What if we only had missing values on one of our variables that can be relaxed by creating several a dummy variable, can we have a regression with full observation somehow
?
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#8

07 Oct 2020, 01:06

How would you relax that with dummy variables? You have tried in several steps to discuss your method, but none of us understood what you want to do. I think the problem is that you try to be brief. Take your time and describe step by step the procedure you propose.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17711
#9

07 Oct 2020, 01:37

Tracy:
why challenging yourself with methodologically weak appraoaches (that may well be questioned by any average reviewer) when tons of literature points you to sounder procedures, such as multiple imputation (and Stata supports it)?
Besides, have you carried out a diagnosis of the missing mechanism underlying the data that you did not observe?

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Missing data for categorical variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment