Problem with Multiple Imputation

Oliver Scott

Join Date: Dec 2018

Posts: 40
#1

Problem with Multiple Imputation

31 Mar 2019, 22:36

Hi all, I am (seemingly) having some trouble with my imputation model. I am trying to impute values for missing data for variables related to cancer and demographics such as stage, grade, receptor status, and deprivation.

I am using multiple imputation with chained equations, and am correctly specifying the model for each imputed variable as far as I am aware. All the separate models converge, but when I test the imputed values vs observed values after imputing, I am getting some wild differences between the two. I know there is no test to ascertain whether this is a problem, but the differences between the observed and imputed values are concerning me.

Can anyone point me in a sensible direction so I can go about correcting my imputation model?

Many thanks.
Tags: None
Oliver Scott

Join Date: Dec 2018

Posts: 40
#2

01 Apr 2019, 16:36

Can anyone with any sort of experience with MI please give me a hand? Apologies for the double post
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#3

01 Apr 2019, 22:59

It is hard to say much with the info given. I suppose you could start by showing what these wild differences are, e.g. do the means of variables after imputation differ greatly from those before?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Oliver Scott

Join Date: Dec 2018

Posts: 40
#4

02 Apr 2019, 21:25

Originally posted by Richard Williams View Post

It is hard to say much with the info given. I suppose you could start by showing what these wild differences are, e.g. do the means of variables after imputation differ greatly from those before?

Hi Richard,

Apologies for not specifying my imputation model and commenting on the differences. Variables I inputted included stage, grade, er status, pr status, public/private status of the healthcare facility, deprivation, and urban/rural status (my research project is concerned with medication use and cancer outcomes). I figured ordinal logistic models would make sense for stage, grade, and deprivation, and I used logit for the others. Explanatory variables included all my main confounders as well as death status and cumulative hazard (as recommended), so my final imputation model looks as follows;

mi impute chained (ologit) stage_grp (ologit) newgrade (logit) er (logit) pr (logit) pubprinew (ologit) deprivation (logit) urban=bcdeathnew HT timetofbetablockerfirstyear i.catdateofdiag i.newage i.ethnicn i.regnew ///
i.lvi i.her2new i.screendetectednew i.surgery_radio i.chemo_hor_bio timetofnsaidfirstyear timetofaceifirstyear timetofarbfirstyear timetofstatinfirstyear c3score_allsites, add(10) rseed(1000) noisily

The 'timeto...' variables represent certain medication use in the year after cancer diagnosis (binary yes/no), and most of the others are standard clinical/demographic variables.

foreach var of varlist stage_grp newgrade er pr pubprinew deprivation urban{
mi xeq 0: tab `var'
mi xeq 1/10: tab `var' if miss_`var'
}

Here are the differences shown between the observed data and imputed data for each variable I inputted (for the first 2 imputations). Some are probably fine, but there appear to be large differences between the observed data and imputed data for grade, er, and urban in particular.

Any advice would be appreciated.

Cheers, Oliver

Last edited by Oliver Scott; 02 Apr 2019, 22:05.
Comment
Oliver Scott

Join Date: Dec 2018

Posts: 40
#5

02 Apr 2019, 21:28

PR here, which seems fine IMO.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#6

03 Apr 2019, 10:03

Screen shots are difficult to read and many people won't even try. Code tags are better. See pt. #12 in the FAQ.

Having said that, it is not obvious to me that there are problems. Perhaps the cases where differences seem large are also cases where you would expect larger differences, e.g. if some group has relatively large amounts of missing data and those people also tend to differ on the variables you are imputing.

I think I might first try to identify the characteristics of those who tend to have md vs those who don't and see if any major differences stand out.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

03 Apr 2019, 12:57

Originally posted by Oliver Scott View Post

...

Here are the differences shown between the observed data and imputed data for each variable I inputted (for the first 2 imputations). Some are probably fine, but there appear to be large differences between the observed data and imputed data for grade, er, and urban in particular.
...

I think that in your first post, you were saying that the distributions of your variables in the complete data are pretty different from the distributions in the imputed data. That is consistent with the data not being missing completely at random. The people with missing variables are different from the complete cases, e.g. they are less likely to have positive ER. I agree with Richard that nothing really stands out as being erroneous.

That said, a very minor point is that your code could have been a bit more concise, e.g.

Code:

mi impute chained (ologit) stage_grp newgrade deprivation (logit) er pr pubprinew urban = bcdeathnew HT timetofbetablockerfirstyear i.catdateofdiag i.newage i.ethnicn i.regnew /// i.lvi i.her2new i.screendetectednew i.surgery_radio i.chemo_hor_bio timetofnsaidfirstyear timetofaceifirstyear timetofarbfirstyear timetofstatinfirstyear c3score_allsites, add(10) rseed(1000) noisily

This affects nothing substantively, but you do type less. Also, about the number of imputations, the last general guide I recall hearing was that you should have about 1 imputation per percent of data missing in the variable with the most missing. I don't know if the guidelines have got more stringent.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Oliver Scott

Join Date: Dec 2018

Posts: 40
#8

03 Apr 2019, 22:33

Originally posted by Richard Williams View Post

Screen shots are difficult to read and many people won't even try. Code tags are better. See pt. #12 in the FAQ.

Having said that, it is not obvious to me that there are problems. Perhaps the cases where differences seem large are also cases where you would expect larger differences, e.g. if some group has relatively large amounts of missing data and those people also tend to differ on the variables you are imputing.

I think I might first try to identify the characteristics of those who tend to have md vs those who don't and see if any major differences stand out.

Hi Richard. Thanks for the reply. I examined the characteristics of the variables with missing data in relation to the other variables with missing data, and came up with the following findings;

Slight difference in ER between missing/not missing grade.

Slight difference in PR between missing/not missing grade.

Very slight differences in PR between missing/not missing deprivation

Very slight differences in PR between missing/not missing urban

Very slight difference in Public/private between missing urban/not missing urban

Slight difference in deprivation between missing/not missing grade

Slight difference in deprivation between missing/not missing ER

Slight difference in deprivation between missing/not missing PR

Slight difference in deprivation between missing/not missing public/private

Very slight difference in urban between missing urban/not missing grade

Slight difference in urban between missing/not missing public/private

Do you think these differences would be enough to drive the differences between the observed and imputed data for the variables I mentioned above?

Last edited by Oliver Scott; 03 Apr 2019, 23:18.
Comment
Oliver Scott

Join Date: Dec 2018

Posts: 40
#9

03 Apr 2019, 23:07

Originally posted by Weiwen Ng View Post

I think that in your first post, you were saying that the distributions of your variables in the complete data are pretty different from the distributions in the imputed data. That is consistent with the data not being missing completely at random. The people with missing variables are different from the complete cases, e.g. they are less likely to have positive ER. I agree with Richard that nothing really stands out as being erroneous.

That said, a very minor point is that your code could have been a bit more concise, e.g.

Code:

mi impute chained (ologit) stage_grp newgrade deprivation (logit) er pr pubprinew urban = bcdeathnew HT timetofbetablockerfirstyear i.catdateofdiag i.newage i.ethnicn i.regnew /// i.lvi i.her2new i.screendetectednew i.surgery_radio i.chemo_hor_bio timetofnsaidfirstyear timetofaceifirstyear timetofarbfirstyear timetofstatinfirstyear c3score_allsites, add(10) rseed(1000) noisily

This affects nothing substantively, but you do type less. Also, about the number of imputations, the last general guide I recall hearing was that you should have about 1 imputation per percent of data missing in the variable with the most missing. I don't know if the guidelines have got more stringent.

Thanks mate, appreciate it. Even if my missing data isn't missing completely at random, am I still able to use multiple imputation to impute missing values? I guess I am able to do so, with the caveat that the imputed values are likely to be less accurate than if the data was missing completely at random?

In terms of the number of imputations, the variable with the most missing data in my dataset is deprivation, which has 23% missing data. Do you think this calls for 23 imputations? Perhaps I could run more imputations and see if it produces different results.

Last edited by Oliver Scott; 03 Apr 2019, 23:17.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#10

04 Apr 2019, 02:44

Originally posted by Oliver Scott View Post

Even if my missing data isn't missing completely at random, am I still able to use multiple imputation to impute missing values?

Yes. Weiwen's point is that you would not expect the distribution of the imputed values to mirror that of the observed values when your data is MAR. In fact, MAR implies that the (unconditional/marginal) distribution of missing values is indeed different from that of the observed values; these differences are assumed to be due to differences in other variables and that, only conditional on these variables, the distribution of missing values should be close to the distribution of the observed values.

I guess I am able to do so, with the caveat that the imputed values are likely to be less accurate than if the data was missing completely at random?

I am not sure that is correct. What do you mean by "less accurate"? When your data were MCAR your only concern would be power and if that was not a problem, you would not even go for MI.

In terms of the number of imputations, the variable with the most missing data in my dataset is deprivation, which has 23% missing data. Do you think this calls for 23 imputations?

If I were following rules of thumbs, I would look at the FMI, reported after mi estimate and multiply this by 100 to get the number of "required" imputations.

Perhaps I could run more imputations and see if it produces different results.

That is usually a good idea.

Best
Daniel

Last edited by daniel klein; 04 Apr 2019, 02:49.
1 like
Comment
Saeid Bitaraf

Join Date: Nov 2020

Posts: 4
#11

05 Aug 2022, 12:17

After multiple imputation, when our data is binary (for example 0 and 1) imputed variables are decimal or negative numbers. How can I use these numbers in my analysis?
Comment

Announcement

Problem with Multiple Imputation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment