MI chained impute error and force option

Kate Ennis

Join Date: Jan 2018

Posts: 15
#1

MI chained impute error and force option

04 Jan 2018, 06:54

Hi there,

I am trying to use mi impute chained to fill in missing data for quality of life data at 5 different time points (discharge, 3 months, 6 months, 12 months and 18months). There is some missing data at all the different time points. I was trying to use a number of variables to help determine the missing values such as age, gender and other measures of quality of life. However some of these other measures also have some missing data. When I try and run the commands it then comes up with the following:

EQ6: missing imputed values produced
This may occur when imputation variables are used as independent variables or when independent variables contain missing values. You can specify option force if you wish
to proceed anyway.

I just wondered if I can therefore only use those variables for which there is complete data and so I can see what the data looks like when it is forced, how do I specify the force option as noted?

The code I have used is below, Any advice would be greatly appreciated firstly on if my code is correct and!

mi set flong
mi misstable patterns
mi register imputed EQdis EQ3 EQ6 EQ12 EQ18
mi register regular age gender gcs modrandis_e3_1_c8 mrs_3months mrs_12months gosdis gos3months gos12months
mi impute chained (pmm,knn(5)) EQdis EQ3 EQ6 EQ12 EQ18 = gender age gcs modrandis_e3_1_c8 mrs_3months mrs_12months ,add(5) rseed(5)

Thanks
Tags: None
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#2

04 Jan 2018, 07:56

Kate,

In general, I think many health services researchers would caution against imputing your primary outcome (which I assume is the EQ5D index here?). That said, if you must proceed, it's likely that you've got some missing data in your predictor variables. You should simply put the predictors on the left hand side of the equation. For example, say you have a few missing values on age and gender. Changes to your code in bold:

Code:

mi set flong mi misstable patterns mi register imputed EQdis EQ3 EQ6 EQ12 EQ18 age gender mi register regular gcs modrandis_e3_1_c8 mrs_3months mrs_12months gosdis gos3months gos12months mi impute chained (pmm,knn(5)) EQdis EQ3 EQ6 EQ12 EQ18 gender age = gcs modrandis_e3_1_c8 mrs_3months mrs_12months ,add(20) rseed(5)

Some side notes: 1) I've generally seen recommendations to add 20-30 imputations, rather than just 5.

2) Your data look like they're in wide format, and I think most analyses in Stata are much better done in long format. You'd want a time variable taking on, for example, values of 0, 3, 6, 12, and 18.

3) Are your data monotone missing, i.e. is it that people drop out and their data are henceforth always missing, e.g. people make the 3 month measurement, then they always have 6, 12, and 18 missing? Or is it that participants make some but not all of the follow up measurements without any clear drop out pattern, e.g. some people are seen at discharge, 3, 12, and 18 months (missing months 3 and 6)? If you have monotone missing (the former case), then I'd check the syntax for mi impute monotone as well (I haven't used it personally, so can't offer advice on the syntax easily).

4) You may have thought of this already. But if this is indeed the EQ-5D survey, then if you have each of the individual question responses (instead of the index score), it's perhaps worth checking to see if your missingness is caused by a missing response to one or two of the questions, rather than the person not making the entire follow-up session. If so, it's better to impute at the item level, then use -mi passive- to generate the index score. That said, I don't recall that the EQ-5D survey has a lot of item non-response (as opposed to unit non-response), so this may not help much.

5) It really helps if you post your code in the code delimiters. Look for the # button on your formatting toolbar when you write a post (it's between a double quote button and a <> button).

6) That said, you posting your exact code is really helpful to people trying to help you. So thanks!

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#3

04 Jan 2018, 08:25

I agree with and strongly endorse everything Weiwen has said in #2 with, ironically, the exception of:

2) Your data look like they're in wide format, and I think most analyses in Stata are much better done in long format. You'd want a time variable taking on, for example, values of 0, 3, 6, 12, and 18.

While I am accustomed to writing long rants on this Forum about why the long layout is better than the wide for most things in Stata, this is one of the things I regard as an exception. It is entirely reasonable to want to impute missing values of EQ at one time from observed values of EQ by the same respondent at other times. Indeed, it is entirely possible, even likely, that the MAR assumption could only be met by including those other observed EQ values in the imputation model. And including them can only be done if the EQ variables are in wide layout.

That said, once they have been imputed, it is probably wise to -mi reshape long- before proceeding to analysis.
1 like
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

04 Jan 2018, 08:47

Originally posted by Clyde Schechter View Post

...While I am accustomed to writing long rants on this Forum about why the long layout is better than the wide for most things in Stata, this is one of the things I regard as an exception. It is entirely reasonable to want to impute missing values of EQ at one time from observed values of EQ by the same respondent at other times. Indeed, it is entirely possible, even likely, that the MAR assumption could only be met by including those other observed EQ values in the imputation model. And including them can only be done if the EQ variables are in wide layout.

That said, once they have been imputed, it is probably wise to -mi reshape long- before proceeding to analysis.

Excellent point. I stand corrected!

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Kate Ennis

Join Date: Jan 2018

Posts: 15
#5

04 Jan 2018, 09:04

Thank you very much for your response, I really appreciate it!

Firstly yes you would be correct that this is the EQ-5D index. Yes I have realised that I have missing data in the gsc score and the modified rankin scores. However if I place these on the left hand side will that mean they are no longer being used to predict the missing values of the eq-5d? So as it currently is, will scores at 3 months ed5d not be being used to help impute a missing value for say 6 months eq5d?

1) Thank you very much for your comments regarding number of imputations I was planning on increasing this once I had done a test run but good to get the feedback

2) Sorry I thought I had set my data to long format using the //mi set flong// is this correct?
One issue I have noticed when playing around with the data just to have a look at what it is doing is that say for ed5d index at 18 months, it only imputes missing values using values that had been available at 18 months for those with complete data at this timepoint. However as I have small numbers this only leaves a small amount of options and then does not seem to take into account scores that came in the time points before this. Is this anything to do with how the data is set? Sorry if this is not very clear.

3) The data is not all monotone missing, this is the case for a few but it is also random for the majority (ie. may have eq5d at discharge missing but have other timepoints).

4) Yes I have looked at the missing individual domain scores when I began but this was only the case for a couple of patients. I think one of the issues I am having is very small patient numbers. There was actually more missing data for the eq-5d scores but I have excluded those patients who had eq5d scores missing at more than two timepoints out of the 5 as it didn't seem like I should be trying to predict more than two missing time points for each patient. So I am now only working with a set of 27 patients of whom had eq5d responses for at least 3 time points. I understand this is a very small sample but the sample is from a rare disease area so hard to get large patient numbers.

5) Thank you very much for this advice I will make sure to use delimeters in future posts, just to check do you mean to use // to signify the end of a line?

Lastly, if I were to use the force option as it mentions, hwo do I go about doing this as I am unsure if it is a particular command and can't seem to find much information on it

Thanks again for your help, it is greatly appreciated

Last edited by Kate Ennis; 04 Jan 2018, 09:45.
Comment
Kate Ennis

Join Date: Jan 2018

Posts: 15
#6

04 Jan 2018, 09:05

And apologies I have only just noticed the additional comments that have been made above
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#7

04 Jan 2018, 11:43

Originally posted by Kate Ennis View Post

2) Sorry I thought I had set my data to long format using the //mi set flong// is this correct?
One issue I have noticed when playing around with the data just to have a look at what it is doing is that say for ed5d index at 18 months, it only imputes missing values using values that had been available at 18 months for those with complete data at this timepoint. However as I have small numbers this only leaves a small amount of options and then does not seem to take into account scores that came in the time points before this. Is this anything to do with how the data is set? Sorry if this is not very clear.

3) The data is not all monotone missing, this is the case for a few but it is also random for the majority (ie. may have eq5d at discharge missing but have other timepoints).

Lastly, if I were to use the force option as it mentions, hwo do I go about doing this as I am unsure if it is a particular command and can't seem to find much information on it

Thanks again for your help, it is greatly appreciated

As Clyde mentioned, ignore my long vs wide comment for now. That said, the mi format exists independent to whether the base data are in wide or long format. I was talking about reshaping your base data, but it turns out it's best not to do that for now.

To your last point, I don't believe you need to use the -force- option at all. You almost certainly don't want to impute a missing value. The error message warned you that your code, as written, imputed some missing values. This is either because you used one of your variables to be imputed as a regular variable (all regular variables must be complete), or your regular variables had some missing values. If your code is presented exactly like you typed it, then it can't be the former. It has to be that one of your regular variables has some missing data. If that's the case, then MI will handle that. Just declare the culprit as an imputed variable. Just to be clear on nomenclature all imputed variables are on the left of the = sign, and all regular variables are on the right:

Code:

mi impute chained (some method) imputed_variables = regular_variables

When you impute, you are actually using data from all the variables, both imputed and regular. You're just telling Stata that everything on the right side of the equation (the regular variables) have complete data. That is, every observed EQ-5D score, plus every observed value of age, gender, and whatever else you have.

You may already know this, but you can even use multiple regression models (whatever is allowed by MI estimate, which is a fair number of models):

Code:

mi impute chained (pmm, knn(5)) EQ* (regress) age (logit) gender = regular_variables

But PMM seems very reasonable for the EQ5 score, and I am pretty sure it will cover you for every other variable type that could be missing.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
2 likes
Comment
Kate Ennis

Join Date: Jan 2018

Posts: 15
#8

05 Jan 2018, 02:38

Thank you very much for your very clear explanation Weiwei, that has been really helpful for someone new to using MI.

One other query I have concerning this is that I also have other outcome data collected at the same timepoints as the eq5d- (the VAS score collected as part of the eq-5d questionnaire if you are familiar with this) and there is less missing data on this that the eq5d scores as it is easier to complete. As this is collected at the same time point and is likely to be highly related to the missing eq5d score at that same time point is there any way that stata can know this? Please let me know if this isn't clear at all.

Thank you both again for your comments and advice
Comment
Kate Ennis

Join Date: Jan 2018

Posts: 15
#9

05 Jan 2018, 02:48

Originally posted by Kate Ennis View Post

I think one of the issues I am having is very small patient numbers. There was actually more missing data for the eq-5d scores but I have excluded those patients who had eq5d scores missing at more than two timepoints out of the 5 as it didn't seem like I should be trying to predict more than two missing time points for each patient. So I am now only working with a set of 27 patients of whom had eq5d responses for at least 3 time points. I understand this is a very small sample but the sample is from a rare disease area so hard to get large patient numbers.

Also do you think it is ok that I have reduced the data set to only use those for which there was less missing data or by doing so I am excluding potential variables to help with provide the inputted values with more options?

Sorry for the multiple questions
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#10

05 Jan 2018, 07:58

Originally posted by Kate Ennis View Post

...
One other query I have concerning this is that I also have other outcome data collected at the same timepoints as the eq5d- (the VAS score collected as part of the eq-5d questionnaire if you are familiar with this) and there is less missing data on this that the eq5d scores as it is easier to complete. As this is collected at the same time point and is likely to be highly related to the missing eq5d score at that same time point is there any way that stata can know this? ...

Also do you think it is ok that I have reduced the data set to only use those for which there was less missing data or by doing so I am excluding potential variables to help with provide the inputted values with more options?...

Kate,

I'm glad this is helpful. When I first tried using MI, I made a few blunders here and there, because everything is harder in MI. Hopefully this can aid you in clarifying what is going on!

I'm familiar with the Visual Analog Scale. If anyone reading is interested, the EQ-5D is a health-related quality of life (HRQoL) measurement scale. It has 5 ordinal questions, each representing a domain of HRQoL. It also has a visual analog scale, where you mark your health on a thermometer (0 to 100). The 5 questions are preference weighted, so you feed them into a formula to obtain a person's utility. There is also a VAS, and that isn't used in the preference weights, and I don't think it's used widely. People interested in what it measures may wish to read some work by Nancy Devlin or her colleagues. I haven't extensively read up on the VAS, but I know that Devlin has said that it appears to be measuring something somewhat different from the index score - I recall it's definitely correlated with the index score, but not strongly. In any case, most people typically focus on the index score.

To include the VAS in the imputation routine, simply register it as an imputed variable, then add it to the left side of the imputation command, e.g.:

Code:

mi register imputed EQdis EQ3 EQ6 EQ12 EQ18 age gender VASdis VAS3 VAS6 VAS12 VAS18 mi register regular gcs modrandis_e3_1_c8 mrs_3months mrs_12months gosdis gos3months gos12months mi impute chained (pmm,knn(5)) EQdis EQ3 EQ6 EQ12 EQ18 VASdis VAS3 VAS6 VAS12 VAS18 gender age= gcs modrandis_e3_1_c8 mrs_3months mrs_12months ,add(20) rseed(5)

I am assuming that the VAS scores are also incomplete. You'll be imputing the missing VAS scores alongside the missing EQ-5D index scores. Then you can ignore the VAS scores entirely - or not, it's up to you. But adding it to the imputation equation is the only way for Stata to know that it matters. Stata will use data in only the variables that are present in the imputation equation on both sides.

Edit: I realized after writing the above, but is the problem that the subjects missed observations entirely, or that they have the VAS or the EQ index score missing at random at each observation? If the problem is the subjects missing observations altogether, then the VAS score won't help you (I think).

Your second question is harder to answer. I am usually very hesitant to impute an outcome variable. Having the person not make follow up does introduce irrecoverable bias. But, while I think this opinion is well-founded, it's my opinion.

But, you do have multiple measures at 4 time points post discharge. Things do depend on the pattern of missingness, which you would presumably discuss in a paper. Excluding people with 2 missing measures strikes me as OK in a vacuum (I can't see your pattern of missingness). I would probably recommend excluding people with a missing discharge EQ-5D index score - since you can't measure their change in score at all. Last, I'd assume you are trying to estimate total QALYs over 18 months. If the number of missing responses at 18 months are too large, then it would be worth reconsidering that goal and focusing on 12 months.

A lot also depends on the standard of practice where you are. There may be different tolerances for imputing data. In my opinion, it's worth finding someone who has more extensive experience doing this, and asking them in person. If there isn't anyone like that at your institution, ask at conferences - hopefully you'll be presenting your work.

Last edited by Weiwen Ng; 05 Jan 2018, 08:23.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Kate Ennis

Join Date: Jan 2018

Posts: 15
#11

09 Jan 2018, 02:51

Hi again Weiwei and thank again for your helpful comments.

I will be looking to estimated the total QALYs at some point so yes that is a good idea about about the 18 month goal, I will look back at the data and see how much is missing from this time point.

However at this stage my main aim was to have a dataset in which I can observe the eq-5d index scores of individuals over the time horizon so I was wanting to create one complete dataset which has imputed missing data for eq5d based on the available data we have. So I was hoping that once I had done the MI I would be able to combine all the imputed datasets to give one final complete in which the individuals who had a missing value now have one value which has been obtained from the MI. I was hoping that I would be able to use 'mi estimate' which would use Rubins rule to combine however I have noticed that if I do

Code:

mi estimate: mean eq5ddis

this will give the mean score over all individuals for that time point. However, is it possible for me to get the mean eq5d score for each time point for each individual. This would be the same when I generate QALYs, I will want to generate each individuals total QALY and not the mean QALY score over all patients.

I hope I have worded my query properly and this makes sense. I am trying to find someone I can speak to in person about this again but have found your comments to be most helpful.
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#12

09 Jan 2018, 08:20

Originally posted by Kate Ennis View Post

Hi again Weiwei and thank again for your helpful comments.

I will be looking to estimated the total QALYs at some point so yes that is a good idea about about the 18 month goal, I will look back at the data and see how much is missing from this time point.

However at this stage my main aim was to have a dataset in which I can observe the eq-5d index scores of individuals over the time horizon so I was wanting to create one complete dataset which has imputed missing data for eq5d based on the available data we have. So I was hoping that once I had done the MI I would be able to combine all the imputed datasets to give one final complete in which the individuals who had a missing value now have one value which has been obtained from the MI. I was hoping that I would be able to use 'mi estimate' which would use Rubins rule to combine however I have noticed that if I do

Code:

mi estimate: mean eq5ddis

this will give the mean score over all individuals for that time point. However, is it possible for me to get the mean eq5d score for each time point for each individual. This would be the same when I generate QALYs, I will want to generate each individuals total QALY and not the mean QALY score over all patients.

I hope I have worded my query properly and this makes sense. I am trying to find someone I can speak to in person about this again but have found your comments to be most helpful.

Hmm ...

The fundamental motivation behind imputation is that we're uncertain about the actual value of something, because it's missing. For people with missing data, there is no EQ-5D score at some time points. There is a possible distribution of EQ-5D scores (which implies a mean and a variance, estimated from the imputation process). I know some cost effectiveness analysis, and I know that it's common to bootstrap a cost-effectiveness acceptability curve around this time. The problem is, when you bootstrap, you need to make sure you account for that uncertainty, e.g. by telling Stata to bootstrap the MI estimate of the ICER. I've only bootstrapped a CEA curve once, and that was without MI estimation. Here's a link to some discussion on bootstrapping CEA curves in Stata.

Nonetheless, if you must manually calculate each individual's MI-estimated EQ-5D score for each time-point, then I would reshape the data to wide, then calculate from there. Each imputed dataset is stored with a prefix that reads something like _1_EQ5dis. You can use -egen- to calculate values. The code should look something like this (not vetted, because my MI data are usually in flong):

Code:

mi reshape wide foreach eq in EQdis EQ3 EQ6 EQ12 EQ18 { egen `eq'_mi = rowmean(_*_`eq')

You could then replace the missing values in the original values if you wish. Again, I would not recommend doing this if you are bootstrapping a CEAC.

My first response to your query was literally "hmm...". This is a complex issue. I'd really meet with someone in person who has more experience in CEA.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17748
#13

09 Jan 2018, 08:53

Dear All,
reading about so many health economists dealing with the economic evaluation of health care programmes I cannot help from joining the party.
The following textbook covers many issues brought about in the original post (https://global.oup.com/academic/prod...?lang=en&cc=it).
The webpages of the same source include Stata code for calculating the 95% CI of the ICER according to the Fieller's method and for creating parametric CEAC.
Another inspirating source is the older textbook by Andy Briggs and colleagues (https://global.oup.com/academic/prod...?lang=en&cc=it) with worked out examples concerning Monte Carlo simulations, CEAC, cost-effectiveness acceptability frontier (CEAF) and Expected Value of Perfect Information (EVPI) (both decisional and for parameters).

Last edited by Carlo Lazzaro; 09 Jan 2018, 08:55.

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment

Announcement

MI chained impute error and force option

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment