Filling in Data gap using multiple imputation

Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#1

Filling in Data gap using multiple imputation

24 Oct 2016, 19:59

Hello all,

I have never had the opportunity to deal with handling missing data so I'd appreciate any guidance. Thank you in advance.

I have a dataset with 10 variables and 399 observations. Gender, Q51 and PTSD are binary. Age, Race, Geography, Year of Residency and Q50 are categorical with more than 3 categories each.
The main goal of my project is to see if age, gender, race/ethnicity, geography, year of residency, Q50 and Q51 differ between PTSD positive and PTSD negative group.

Since this survey data has some missing information I need to first fill in the data gaps. I am thinking of doing multiple imputation. What is the best way to do this in STATA? I got as far as setting the mi dataset, setting the variable as imputed and regular and setting number of imputation as 5 then got confused regarding imputation method. How should I approach this? Also, once I get the imputed dataset, is it valid to run univariate analysis such as chi square test? Almost all the examples talks about fitting the predictive model so I was curious if we can do simple analysis in imputed dataset or not.

The description of missing data is as follows.
Variable | Missing Total Percent Missing
----------------+-----------------------------------------------
UniqueIden~r | 0 399 0.00
Residency_~e | 0 399 0.00
Gender | 10 399 2.51
Age | 3 399 0.75
Race | 19 399 4.76
Geography | 4 399 1.00
Year_Resid~y | 0 399 0.00
Q50 | 0 399 0.00
Q51 | 1 399 0.25
PTSD | 5 399 1.25
----------------+-----------------------------------------------

Example dataset:
input int UniqueIdentifier long(Residency_Type Gender Age Race) float Geography long Year_Residency float Q50 long Q51 float PTSD
1 1 2 3 4 4 1 2 1 0
3 1 1 6 3 1 2 3 2 0
4 1 1 2 3 1 2 3 2 0
5 1 2 2 2 3 4 2 2 1
6 1 2 6 3 4 1 2 1 0
7 1 2 2 1 1 2 3 2 0
8 1 1 2 3 1 1 3 2 0
9 1 1 2 3 1 1 3 2 1
13 1 2 2 3 4 1 2 1 0
14 1 2 2 2 1 4 2 1 1

end

label values Residency_Type Residency_Type

label def Residency_Type 1 "Surgery-General", modify

label values Gender Gender

label def Gender 1 "Female", modify

label def Gender 2 "Male", modify

label values Age Age

label def Age 2 "25 - 29", modify

label def Age 3 "30 - 34", modify

label def Age 6 "55 - 59", modify

label def Age 7 "60 - 64", modify

label values Race Race

label def Race 1 "African American", modify

label def Race 2 "Asian/Pacific Islander", modify

label def Race 3 "Caucasian", modify

label def Race 4 "Hispanic/ Latino", modify

label def Race 5 "Other", modify

label values Geography Geography

label def Geography 1 "Northeast", modify

label def Geography 2 "Midwest", modify

label def Geography 3 "South", modify

label def Geography 4 "West", modify

label values Year_Residency Year_Residency

label def Year_Residency 1 "PGY 1", modify

label def Year_Residency 2 "PGY 2", modify

label def Year_Residency 4 "PGY 4", modify

label def Year_Residency 5 "PGY 5", modify

label values Q50 Q50

label def Q50 2 "3 to 5", modify

label def Q50 3 "6 to 10", modify

label def Q50 4 "11 to 20", modify

label values Q51 Q51

label def Q51 1 "Community", modify

label def Q51 2 "University", modify

label values PTSD PTSD

label def PTSD 0 "Negative (0, 1 or 2)", modify

label def PTSD 1 "Positive (3 or 4)", modify

[/CODE]
Thanks,
PA.
Tags: None
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#2

25 Oct 2016, 14:37

Hi all,

I have been trying different codes to see if I can do imputation but to no avail. I keep getting error message. It would be really helpful if anyone can point me in the right direction. Thanks

Some of the code that I tried and the error message gererated are as follows:

. mi set wide

. mi register regular Year_Residency Q50

. mi register imputed Gender Age Race Geography Q51 PTSD

. mi impute chained (logit) Gender (mlogit) Age (mlogit) Race (mlogit) Geography (mlogit) Q51 (logit) PTSD = Year_Residency Q50, add (20) rseed (1234)
outcome does not vary; remember:
0 = negative outcome,
all other nonmissing values = positive outcome
-- above applies to specification (logit ) Gender = Year_Residency Q50

r(2000);

. mi impute chained (logit) Gender (mlogit) Age (mlogit) Race (mlogit) Geography (mlogit) Q51 (logit) PTSD = i.Year_Residency i.Q50, add (20) rseed (1234)
outcome does not vary; remember:
0 = negative outcome,
all other nonmissing values = positive outcome
-- above applies to specification (logit ) Gender = i.Year_Residency i.Q50
r(2000);

. mi impute logit Gender Age Race Geography Q51 PTSD Year_Residency Q50, add(5) rseed(1234)
note: variables Age Race Geography Q51 PTSD registered as imputed and used to model variable Gender; this may cause some observations to be omitted from the estimation and may lead to missing
imputed values
outcome does not vary; remember:
0 = negative outcome,
all other nonmissing values = positive outcome
r(2000);
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#3

25 Oct 2016, 15:20

What Stata is telling you is that the variable Gender is not a suitable outcome for logit, at least not the way you have it coded. The outcome of a -logit- model in Stata must be coded as 0 for negative and non-zero for positive. Usually this is done as 0 vs 1 coding. I suspect your Gender variable is coded as 1 vs 2 or some other scheme. As an outcome for a -logit-, 1 vs 2 coding means no variation because 1 and 2 are both non-zero.

Check how your gender variable is coded.

(Alternatively, it is possible that you have it coded 0 vs 1 and all values for which Year_residency and Q50 are non-missing are all the same. That would be surprising, but it is theoretically possible.)
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#4

26 Oct 2016, 11:05

Thank you.

I fixed coding of my binary variables to 0 and 1. I ran the following code and got the error message. Is there an easy way to fix it? I am not trying to do any complex modeling. I have a data set with 10 variables and <5% missing info. I am basically trying to do univariate analysis to see if variables such as gender, race etc. differ between PTSD negative vs. PTSD positive group. I just need to fill in the missing data first. is there an easy way to do this? Also if I am comparing between PTSD groups, while doing the imputation is it recommended to have complete cases of the dependent group and impute only the independednt groups? Thank you once again for the guidance.

. mi impute chained (logit, augment) Gender_F (mlogit, augment) Age (mlogit, augment) Race (mlogit, augment) Geography (logit, augment) Q51_Community (logit, augment) PTSD = Year_Residency Q50, add (5) rseed (1234)

Conditional models:
Q51_Community: logit Q51_Community i.Age i.Geography i.PTSD i.Gender_F i.Race Year_Residency Q50 , augment
Age: mlogit Age i.Q51_Community i.Geography i.PTSD i.Gender_F i.Race Year_Residency Q50 , augment
Geography: mlogit Geography i.Q51_Community i.Age i.PTSD i.Gender_F i.Race Year_Residency Q50 , augment
PTSD: logit PTSD i.Q51_Community i.Age i.Geography i.Gender_F i.Race Year_Residency Q50 , augment
Gender_F: logit Gender_F i.Q51_Community i.Age i.Geography i.PTSD i.Race Year_Residency Q50 , augment
Race: mlogit Race i.Q51_Community i.Age i.Geography i.PTSD i.Gender_F Year_Residency Q50 , augment

Performing chained iterations ...
convergence not achieved
convergence not achieved
mlogit failed to converge on observed data
error occurred during imputation of Gender_F Age Race Geography Q51_Community PTSD on m = 1
r(430);
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#5

26 Oct 2016, 16:03

Well, imputation by chained equations involving logit models can be very frustrating indeed. There could be several reasons why you are not getting convergence, and it would take a long time to go through them all and resolve them. And even then, sometimes you still can't get convergence.

Several people on this Forum have recommended using -ice- (net describe st0067_4, from(http://www.stata-journal.com/software/sj9-3)) for these problems. I have only limited experience with -ice-, but it did rescue me from a very recalcitrant problem similar to your own on the occasion I used it for that.

Added: If you are successful in getting your imputed data sets with -ice-, you will need to run -mi import ice- to convert it to a Stata MI data set so that you can then use -mi estimate-.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#6

26 Oct 2016, 16:30

Thank you Clyde. I will definitely explore -ice- and see if I can solve the problem.

Also, For my project the specific question given to me was as follows:

1. Demographic variations (age, sex, ethnicity) between positive vs. negative for PTSD.
2. Geographical variations between positive vs. negative for PTSD.
3. Year of residency between positive vs. negative for PTSD.
4. Q50 between positive vs. negative for PTSD.
5. Q51 between positive vs. negative for PTSD.

Since I was getting the convergence issues, I deleted the 5 missing info for PTSD (dependent variable) and then imputed variables age, sex, race, geography, race, and Q51 individually based on complete variables PTSD, year of residency and Q50. The command I used was: mi impute logit Gender i.Year i.Q50 i.PTSD, add(5) rseed(1234) augment
When I impute one variable at a time based on 3 non missing variables I was able to impute. Then basically I just did "mi estimate, or: logit PTSD Variable" to see if any OR are significant. Since I am not trying to build a model, is this method of using mi impute in STATA valid?

Thank you for your guidance. I sincerely appreciate it.

Last edited by Priyanka Acharya; 26 Oct 2016, 17:11.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#7

26 Oct 2016, 18:06

Well, first let's consider this abstractly. The purpose of multiple imputation is to reduce the bias in estimation that is associated with having missing values of model variables. In order for the process to actually reduce bias, the underlying assumption is that the missing data are missing at random. While there are settings in which this is a reasonable assumption, my impression from my experience as an epidemiologist is that this is seldom plausible when working with clinical data. While nowadays relative few people seek to conceal their age, the failure to respond to a demographic variable (particular race or ethnicity) is typically informative, and not random even when conditioned on other available information. I don't know what Q50 and Q51 are, but if they are responses to items on some psychometric instrument I would say that there is a good chance that missingness is, again, informative and not random conditional on other available information. Of course, this is always a matter of conjecture as the missing at random assumption can never be directly verified in the existing data. But one really should ponder the mechanisms that generate missingness in the data and question whether missingness at random is credible.

Editorial comment: there is increasing pressure from reviewers and editors in clinical journals to use multiple imputation as if it were some panacea for missing data. This is an unfortunate tendency and is leading to more and more mindless application of the technique in circumstances where its underlying assumption of missingness at random is untenable. If after thinking about how the missing data came to be missing you conclude that a random mechanism is unlikely, I would urge you to resist the pressure to use MI and opt for either accepting the bias that comes with complete case estimation only, or doing a robustness analysis instead of MI.

By removing the 5 observations with missing values of PTSD before running multiple imputation you have at least partially defeated the purpose of multiple imputation. If you are going to try your luck with -ice- to see if you can get convergence, I strongly urge you to restore those 5 observations to your data set. The whole point is to reduce bias, not introduce more bias! Other than that, your approach seems reasonable as far as it goes. Again, not knowing what Q50 and Q51 are, it is hard for me to comment, but, again, if these are questions on a psychometric instrument, it would surprise me to learn that they are not associated with things like age, sex, and ethnicity, and maybe with geography as well. If that is the case, then odds ratios obtained from crude univariate logistic regression may be biased by confounding. So if that is the case, your results should be presented with plenty of cautions about that.
1 like
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#8

26 Oct 2016, 20:23

Thank you for the input.
Q50 is how many residents are in your program and the options are categories such as 1-5; 6-10 and so on
Q51 is: what is the setting of your program? Answer being binary community or university.
I will explore -ice- option too.
thank.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#9

27 Oct 2016, 06:52

I was able to impute the dataset with -mi ice-. Thank you for the guidance.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#10

27 Oct 2016, 08:58

Hi.

I was able to impute the dataset with mi-ice command. I included original dataset (i.e. I did not delete any observation from PTSD as mentioned in previous post). then when I run the estimate command I ran across the following error message.

Stata command: ice Gender_F Age Race Geography Year_Residency Q50 Q51_Community PTSD, cmd(Age:mlogit) seed(1234) m(10) saving(imp10, replace)

Stata command: mi import ice, automatic
(32 m=0 obs. now marked as incomplete)

. mi varying
Possible problem variable names
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
imputed nonvarying: (none)
passive nonvarying: (none)
unregistered varying: (none)
*unregistered super/varying: (none)
unregistered super varying: (none)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* super/varying means super varying but would be varying if registered as imputed; variables vary only where equal to soft missing in m=0.

mi convert wide, clear

. mi estimate: logit PTSD i.Gender_F i.Age i.Race i.Geography i.Year_Residency i.Q50 i.Q51_Community
estimation sample varies between m=1 and m=4; click here for details
r(459);
When I click: it shows the following

Estimation sample varies across imputations

There is something about the specified model that causes the estimation sample to be different between imputations. Here are several situations when this can happen:

1. You are fitting a model on a subsample that changes from one imputation to another. For example, you specified the if expression containing imputed variables.

2. Variables used by model-specific estimators contain values varying across imputations. This results in different sets of observations being used for completed-data analysis.

3. Variables used in the model (specified directly or used indirectly by the estimator) contain missing values in sets of observations that vary among imputations. Verify that your mi
data are proper and, if necessary, use mi update to update them.

A varying estimation sample can lead to biased or less efficient estimates. We recommend that you evaluate the differences in records leading to a varying estimation sample before
continuing your analysis. To identify the sets of observations varying across imputations, you can specify the esampvaryok option and save the estimation sample as an extra variable in
your data (in the flong or flongsep styles only) by using mi estimate's esample() option.

Note about a varying estimation sample with mi estimate using

mi estimate checks for a varying estimation sample during estimation and stores the result in e(esampvary_mi) equal to 1 if e(sample) varies and 0 otherwise. If saving(miestfile) is used
with mi estimate, the varying-sample flag e(esampvary_mi) is also saved to miestfile. mi estimate using checks that flag and displays a warning message if its value is 1. Thus mi estimate
using displays the warning message even if you are consolidating results from a subset of imputations for which the estimation sample may be constant; you can suppress the message by
specifying the nowarning option.

To check whether the estimation sample changes for the selected subset of imputations, you can use mi estimate to refit the model on the specified subset. You can also save the estimation
sample as an extra variable by using mi estimate's esample() option during estimation.

I did try . mi estimate,esampvaryok: logit PTSD i.Gender_F i.Age i.Race i.Geography i.Year_Residency i.Q50 i.Q51_Community and got the logit model but it has the following at the bottom.
Warning: estimation sample varies across imputations; results may be biased. Sample sizes
vary between 390 and 391.

Am I on the right track? what else should I try?
Thank you for your guidance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#11

27 Oct 2016, 09:29

Here's my best guess as to what is happening. In logistic regression models, observations are omitted for two reasons:

1. Missing values of a model variable. This is true in any regression model, but does not apply here because you have imputed all missing values.

2. Unique to logit/probit type models, one or more observations may be dropped if there is a predictor variable that perfectly predicts the outcome. For example, if there is some value of Geography which is always associated with outcome = 1, then the indicator for that value of Geography will be omitted from the model and all observations having that value of Geography will be excluded from the estimation sample.

Even if condition 2 doesn't arise in the original data, it could occur in an imputed dataset, just due to the chance assignment of values to replace missings. If that happens, then you end up with varying estimation samples. Given that your estimation sample sizes vary only between 390 and 391, I doubt this is going to seriously bias your analyses. But if you want to pursue the problem, follow the advice that Stata gave you, specifically:

To identify the sets of observations varying across imputations, you can specify the esampvaryok option and save the estimation sample as an extra variable in
your data (in the flong or flongsep styles only) by using mi estimate's esample() option.

Then you can identify the particular imputations that have only 390 observations in the estimation sample. If you run -logit- (without -mi estimate-, each separately) on those, Stata will tell you why it is dropping observations, and it will name the variable(s) causing the problem. Then you have a choice: you can remove the problematic variables from your model, or, if they are important, you can exclude the problematic imputations from your MI estimation.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#12

27 Oct 2016, 12:43

Thank you.
Comment

Announcement

Filling in Data gap using multiple imputation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment