Multiple imputation (MICE) and missing values for explanatory scale variables

Niels Henrik Bruun

Join Date: Aug 2014

Posts: 568
#1

Multiple imputation (MICE) and missing values for explanatory scale variables

07 Nov 2018, 04:23

Hi
I have some explanatory scale variables (eg possible values 0 through 10).
They are integers and limited in span.
What is the best choice for handling these.

I started by setting it up with regress assuming the variable to continuous,
Then I tried using intreg (censored data) because I found an 2011/2012 presentation doing this.
And now I've found out that there is truncreg (truncated data) as well.
It is especially the last two I can't decide when to use.

I've searched the net without luck.
Obviously, I'm looking the wrong places.
Do anyone have an answer?

Another question is whether there exists a good book on chained multiple imputation (MICE).

Thank you very much

Kind regards

nhb
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3494
#2

07 Nov 2018, 04:54

Just to add to the options (rather than reduce them): I sometimes like pmm for such variables. It does a regression, computes predicted values, and uses as imputed value a random observed value from the k observations that are closest with respect to the predicted value. Since it uses actually observed values, it automatically retains the range and any discreteness of that variable.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#3

07 Nov 2018, 06:35

The variables in question sound like Likert items, and I'd normally default to ordinal logit as the imputation model. Most likely either ordinal logit or PMM will be regarded as valid. Some readers may not have heard of PMM (my main collaborator hadn't), but you could maybe explain that it's a bit like propensity score matching.

Edit: if you use PMM, please be aware that the default is to match to the nearest neighbor. Paul Allison argues that this is a flawed default, and suggests that we reset the default to k = 5 (i.e. in each draw, match randomly to one of the nearest 5 neighbors).

This is just me thinking out loud, so this is optional reading: I'm aware that the Stata manual describes truncated regression as appropriate for imputing variables that are limited in range. I've always found this a bit odd, as it doesn't correspond to the real-life case for truncated regression (that is, when observations above/below a certain value exist in real life, but are not captured in your sample at all). But it is what the manual says, and that part of the manual was written by real statisticians (as opposed to applied statisticians like me). Nonetheless, truncated regression would definitely not retain the discreteness of the data.

In principle, interval regression is for cases where the real dependent variable is continuous, but you observe something coded in an interval range (e.g. income from $30k to $40k). When we perform exploratory factor analysis on a bunch of Likert items, best practice is to use the polychoric correlation matrix, which assumes that the observed responses actually stem from a set of normal, latent variables (that are distributed multivariate normal). Taken in that context, interval regression doesn't sound absurd either.

Last edited by Weiwen Ng; 07 Nov 2018, 06:38.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Niels Henrik Bruun

Join Date: Aug 2014

Posts: 568
#4

09 Nov 2018, 01:45

Thank you both very much for your answers.

Kind regards

nhb
Comment

Announcement

Multiple imputation (MICE) and missing values for explanatory scale variables

Comment

Comment

Comment