Imputation of missing data

Carole Khairallah

Join Date: Jun 2014

Posts: 19
#1

Imputation of missing data

06 Mar 2017, 09:24

Hello all,

I am looking for some guidance on how to impute categorical variables nominal and ordinal (up to 7 categories).
Overall there are <2% of missing values in my dataset. These variables are included in a principal component analysis in order to get a socio-economical status score, which will be used as a covariate in my final model.

1) Is there a threshold where simple imputation is preferred over multiple imputation? I imagine that when missing data is very low, one can do simple imputation but I can't find any reference that can value the sentence "when data has low missing values". Or is multiple imputation the only method used nowadays?

2) First, I wanted to do a simple (or single?) imputation by randomly assigning the values using the original variable distribution (please see the command below where SESWall indicate the wall material) :

Code:

gen rand = uniform() tab SESWall, nolabel | Freq. Percent Cum. ------------+----------------------------------- 1 | 171 17.34 17.34 2 | 103 10.45 27.79 3 | 117 11.87 39.66 4 | 595 60.34 100.00 ------------+----------------------------------- Total | 986 100.00 gen drawwall = cond(rand <.17, 1, cond(rand<.28, 2, cond(rand<.40, 3, 4))) gen iSESWall=SESWall replace iSESWall=drawwall if iSESWall==.

Is it a good option? I think it is similar to a mean substitution.

3) Using the multiple imputation command mi impute chained (mlogit) would allow me to do a multivariate imputation, including the outcome variable (as recommended) and other auxiliary variables. Is a multiple imputation with m=1 equivalent to a simple imputation? I would use only the imputed dataset m=1 (and not m=0 with the observed data).

Many thanks,
Carole
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 5008
#2

06 Mar 2017, 19:57

If there is very little missing data any kind of imputation may not gain you much and may not be worth the trouble. But if you are going to impute at all I personally would use multiple imputation. I would use listwise deletion before I used single imputation.

Incidentally, Wisconsin has some very good materials on multiple imputation:

http://www.ssc.wisc.edu/sscc/pubs/stata_mi_intro.htm

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#3

06 Mar 2017, 23:14

Seems not to be main issue here, but I would think about whether a PCA really makes sense for categorical variables that are not ordered, no matter how you impute the missing values. It will also not necessarily be straight forward to run this kind of analysis with imputed data. Maybe an option would be to create the score based on complete cases then impute the missing score variables?

Best
Daniel
Comment
Carole Khairallah

Join Date: Jun 2014

Posts: 19
#4

07 Mar 2017, 04:30

Thank you both for your comments.

Richard: In my case, I understand that imputation is a lot of trouble, but for my final analysis every case is important, I can't afford to reduce my sample size. The complete case analysis will be done, anyway.

Daniel: You are raising a good option, but I couldn't find any reference on the subject that would recommend imputing the final score rather than the original individual variables. Moreover, I imagined that PCA would take into account the "multivariability" that simple imputation can not handle by definition. I'll have a look, especially because the score is a continuous variable.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

07 Mar 2017, 04:33

Shall you use gsem, considering you have less then 2% of missing values and according to the Stata Manual, maybe you could try to elaborate a model without even imputing data:

gsem’s method ML is sometimes able to use more observations in the presence of missing values than can sem’s method ML. Meanwhile, gsem does not provide the MLMV method provided by sem for explicitly handling missing values.

Best regards,

Marcos
Comment

Announcement

Imputation of missing data

Comment

Comment

Comment

Comment