multiple imputation - logit and mlogit imputed values

Eva Graham

Join Date: Oct 2016

Posts: 2
#1

multiple imputation - logit and mlogit imputed values

13 Oct 2016, 09:14

Hello,

I am imputing missing categorical data using logit for binary and mlogit for categorical variables.

After running my imputation model, the imputed values are all integers (categories). Ex. if I imputed the variable female (0=male, 1=female), all imputed values are either 0 or 1. This seems strange to me, as I would have expected a value between 0 and 1 as the outcome of a logistic regression.

This occurs both when using mi chained and when using the ice package: example below of imputing female with both methods.

mi impute chained (logit) female = i.hincfel i.emp6 i.marstat i.educ3 agea, add(3) double rseed(10) force noisily
ice female m.hincfel m.emp6 m.marstat o.educ3 agea, gen(miss) m(2)

Having thoroughly searched the mi and ice documentation, I can't find information on how integers / categories are created from logistic or multinomial logistic imputation. Ex. Does a predicted probability of and above generate a 1, while all predictions below 0.50 yield a 0? How does this apply to multinomial logistic regression?

If anyone can point me to documentation explaining this process, I would be very grateful.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

13 Oct 2016, 09:55

Your expectations are incorrect. While the immediate outcome of a logistic regression would be a predicted probability for each observation ranging over the interval from 0 to 1, the imputed values are then selected by Monte Carlo simulation. That is, if the predicted probability for a given observation is 0.25, a random number is drawn from the uniform distribution on the unit interval, and if that number is <= 0.25, the imputed value is set to 1, and otherwise it is set to 0. Analogous considerations apply to multinomial logistic modeling. The idea is that the imputed values should exhibit the kind of variation that one would expect to see in a data set with no missing values. Just setting the imputed value to 0.25 would not accomplish that, as there would be no variation at all.
1 like
Comment
Eva Graham

Join Date: Oct 2016

Posts: 2
#3

14 Oct 2016, 08:34

Excellent - that makes sense. Thanks very much.
Comment
Martijn Hogerbrugge

Join Date: Feb 2015

Posts: 29
#4

14 Oct 2019, 09:55

Instead of starting a new topic, I decided to post a reply to this related thread.

In my experience, using logistic regressions for binary variables tends to overestimate or underestimate the occurrence of 0s or 1s - depending on the distribution of the binary variable. For instance, when I have a binary variable with 15% of the cases having a value of 1, the mean value of the binary variable is almost always above .15 in the imputed datasets. Reversely, when 80% of the cases of a binary variable has a value of 1, the mean value of the binary variable is almost always below .80 in the imputed datasets. Only when the binary variable has a mean value of around .50, do the means in the imputed datasets look similar (i.e., slightly above or below .50).

Does anyone know why this is the case, and if so, how this can be avoided?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#5

14 Oct 2019, 13:58

I don't know if Martijn's claim is true or not. But if it is, I wonder if this is a regression toward the mean effect. If the observed P is 15%, the imputed P can't go much lower but it can go much higher. Likewise if P = 80%, imputed P can't get much higher but it can go a lot lower.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#6

14 Oct 2019, 14:56

I am not sure I follow this, but I tend to disagree with both Martijn and Richard. The imputed values (should) depend on the values of the predictors alone. We are assuming missing at random and if that assumption is true, then the imputed means (proportions) might differ (by a large extent) in either direction from the observed means (proportions). That is fine. There really is no reason to expect that the imputed means (proportions) should always match the observed means (proportions). Without knowing the true values (e.g., in a simulation study), I do not believe that we can say anything about under- or overestimating means (proportions) or regression to the mean phenomena just by comparing observed and imputed distributions.

Best
Daniel

Last edited by daniel klein; 14 Oct 2019, 15:05.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#7

14 Oct 2019, 15:03

We are assuming missing at random and if that assumption is true, then the imputed means (proportions) might differ (by a large extent) in either direction from the observed means (proportions)

If Martijn is correct, though, they consistently differ by being closer to the mean. Of course, his anecdotal observation may not be correct.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#8

14 Oct 2019, 15:09

I have edited my earlier response. I think the key to questions like these is either doing the math (I doubt I could) or run simulations where the true values are known (still pretty tricky, I guess).

Best
Daniel
Comment
Martijn Hogerbrugge

Join Date: Feb 2015

Posts: 29
#9

16 Oct 2019, 02:30

So can I simply ignore the deviations in my imputed data, and run my models on the imputed data and use checks like Monte Carlo Errors to assure myself that the results from the MI are reliable?
Comment
Martijn Hogerbrugge

Join Date: Feb 2015

Posts: 29
#10

21 Oct 2019, 10:24

Anyone?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#11

21 Oct 2019, 11:30

Well, I would not ignore the deviations; I would think about the deviations. Conditional on the covariates, what would you expect the true but missing values to be? Do the deviations that you observe confirm your expectations?

Concerning the question of whether MI results are reliable: What is the alternative? Would that alternative be more reliable? Why?

Best
Daniel
Comment

Announcement

multiple imputation - logit and mlogit imputed values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment