Multiple Imputation, clustered data, multiple time points

Delphina Gomes

Join Date: Nov 2014

Posts: 35
#1

Multiple Imputation, clustered data, multiple time points

20 Apr 2015, 09:16

Hello everyone.

I have a long clustered data set with anthropometric data (height and weight) at different time points. It is possible that there are missing values in height and/or weight because the person did not come on a follow-up visit and I need to impute them (50% missing values in each of the two variables).

I have never done imputation before and I tried doing the following:

Code:

*STEP 1: Setting the dataset mi set mlong *STEP 2: Registering variables (where znr, pnr = identification variables; description2 = age; gcal_new ze_new zf_new zk_new = nutrient intake) mi register imputed height weight mi register regular znr pnr description2 gcal_new ze_new zf_new zk_new gender *STEP 3: Checking the imputation model, i.variable=indicator variable regress height i.description2 i.gender rvfplot, yline(0) regress weight i.description2 i.gender rvfplot, yline(0) *STEP 4: Imputation mi impute regress weight description2 gender, add(5) rseed(1234) mi impute regress height description2 gender, add(5) rseed(1234)

My questions are:

1. The main outcome of the analysis is binary (yes/no for being a false reporter). This outcome variable is dependent on the height and the weight of a person. I am not really sure if I should calculate this outcome variable after imputation or should I should do it before the imputation and later replace it. I am also not sure how to do the replace if I have several imputations.

2. I used linear regression for continuous variable (regress) as the imputation procedure. Will if be a problem if I do multilevel modelling (using MLwiN) in the complete dataset?

3. Is it possible to check if the imputation was done correctly or not (continuous variable)?

4. All the imputed values for both height and weight are generated under the original file. Is it possible to match the imputed values?

5. Lastly, I came across REALCOM impute. Correct me if I am wrong, it is a package that can be installed in STATA so that another software can be used to impute data. Imputation can be done successfully without using REALCOM as well.

Thank you in advance.
Tags: None
JoeSchmidt

Join Date: Jan 2015

Posts: 4
#2

21 Apr 2015, 14:43

Some suggestions:

1. If I understand correctly, it sounds like what you need to do is impute height and weight, and passively impute your binary responder variable which is a function of the continuous variables. So calculate after imputation

2. In principle, you can do multilevel modelling, but you need to respect the imputed nature of the dataset - there will be variation within and between your imputed datasets. This is a developing area of research - you might find some helpful leads in, eg, Multiple imputation methods for handling missing data in cost-effectiveness analyses that use data from hierarchical studies an application to cluster randomized trials M Gomes, K Díaz-Ordaz, R Grieve, MG KenwardMedical Decision Making, 0272989X13492203

3. There is probably no single way to check if the imputation was done "correctly" but sensible things to do would be to make sure that ranges are meaningful (eg no-one with a negative height), and that the distribution is what you expect, e.g. not all clustered around some extreme value. The literature will suggest more formal methods

4. Not sure what this means - perhaps you're interested in a comparison with between original and imputed values. The m=0 dataset will give you your originaldata (see http://www.ats.ucla.edu/stat/stata/s..._stata_pt2.htm)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

21 Apr 2015, 14:55

The variables involved being heights and weights of people, -mi impute mvn- seems like a reasonable alternative here and might be a bit simpler.

By the way, even if you stick with your current approach, I think using the same rseed() value in both imputation steps will, I think, introduce an unintended occult correlation between imputed values of height and weight beyond what would be implied by their relationships to each other and to the other variables in your data. No rseed() specification is needed after the first time.

Specifying add(5) twice is also, I believe, incorrect. The second -mi impute- command should use the -replace- option instead, thereby causing the imputed heights to be placed in the same imputed data sets as the imputed weights.

I am confused by OP's question 1. If the outcome variable is simply a deterministic function of height and weight, then why would you bother using it as the dependent variable in a statistical model that includes height and weight? If what you mean is that the outcome variable (false reporter) is associated with both height and weight, and you want to use height and weight, among other things, as predictors of outcome in a model, then you definitely should not passively impute outcome from height and weight. It should be registered as an imputed variable, and its -mi impute- command should include height and weight (and anything else that's relevant). [Note: There are some who think that dependent variables should not be imputed, and that cases with missing values for the dependent variable should just be omitted from analysis. I won't take a position on that here.]

Last edited by Clyde Schechter; 21 Apr 2015, 15:08.
Comment
Delphina Gomes

Join Date: Nov 2014

Posts: 35
#4

22 Apr 2015, 00:27

Thank you Joe and Clyde for your great answers.

This time, I tried to do two types imputations: one with truncreg and the other with mvn.

I tried chained mi but it gave me implausible imputed values (across different ages of children and some in negative as well). Therefore, I used mean ranges of both height and weight (age and gender specific) from my complete cases and ran the truncreg mi. Main code as follows:

Code:

mi set wide mi register imputed height weight mi register regular description2 gender mi impute truncreg height = i.description2 gender, add(5) rseed(1234) ll(height_min) ul(height_max) mi impute truncreg weight = i.description2 gender, replace ll(weight_min) ul(weight_max)

Result: This gives me plausible imputed values and fair enough kernel densities.

I also wanted to try mvn mi and I ran the following code:

Code:

mi impute mvn height = description2 gender, add(5) rseed(1234) saveptrace(extrace_mvn, replace) burnin(10) mi impute mvn weight = description2 gender, replace saveptrace(extrace_mvn, replace) burnin(10)

Result: It does not give me plausible imputed values (if I see individual imputed values for height). For weight, I have 23 negative imputed values.

My questions are:

1. I am totally confused as to which method to opt for imputation. I have tried chained, pmm, truncreg and mvn. In my opinion, truncreg gave plausible results but is it OK to put in age (description2 in data) and gender in the mi command (because the range set is anyway according to age and gender)?

2. What is wrong with mvn code?

Thank you once again.

Last edited by Delphina Gomes; 22 Apr 2015, 00:30.
Comment
Robert Lunn

Join Date: Nov 2014

Posts: 4
#5

22 Apr 2015, 06:36

The first question that comes to my mind concerns whether the presence of missing data is random or systematic? If the presence of missing data is not random, using an imputation process may result in biased estimates. One quick way to check this is to run a discriminate analysis on the key variables with the grouping code being missing or not missing on height /weight. If you see significant differences (for example on gender), that would complicate any imputation procedure. If there are significant differences present that raises some interesting questions that might merit further study.

Given the amount of missing data present, I suggest you run the analysis twice. Once for cases with no missing data, and once for cases with missing data that was imputed. If the results are not the same, you will need to do further investigation as a non equivalent result suggests that the two samples are fundamentally different, or that the logic behind the imputation process is flawed, or both.

With respect to whether the imputation was done correctly, there are many ways to check your results. Imputed data should have similar means and variation with the non imputed data. If they are not, the differences need to be examined. You should also check to insure that known patterns of co-variation are not significantly modified in your imputed sample. For example, weight and height are usually correlated. If the pattern of correlations of weight and height within gender and age groups are similar for non imputed and imputed data, that suggests that the imputation process has preserved that signal. If the correlations are significantly different, than the imputation process that was applied will need to be examined in more detail.

You did not mention if your study is in an Academic field. In my experience, different fields of study have different opinions about imputing large amounts of data, where large means half of the sample.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

22 Apr 2015, 09:18

First, with respect to the -mi impute mvn-, there are two issues. One is that it makes more sense to impute height and weight together so that the imputation accounts for the correlation between height and weight. Thus, a single command

Code:

mi impute mvn height weight = description2 gender, add(5) rseed(1234) saveptrace(extrace_mvn, replace) burnin(10)

The other issue is that I was assuming that the heights and weights in question were those of adults: in that case the numbers would be large enough that you would seldom, if ever, encounter negative imputed values. But if these are children, then negative imputed values are possible.

As for the implausibility of such values, the statistical theory behind multiple imputation in no way requires that the imputed values be plausible, or even lie within the range of admissible values for the variables in the real world. Some statisticians will say that you should just use the imputations, regardless of whether the imputed values "make sense" or not. Intuitively, of course, that advice is hard to follow. But the theorem that shows the de-biasing effect of multiple imputation and Rubin's rules requires no assumptions about the sensibility of the imputed values. I'm not going to take a position on that here. But I thought you should be aware of that.

Robert Lunn's advice on other ways to look at the suitability of the imputed values is quite reasonable. And he is also correct that in some disciplines, a study that imputes 50% of the data would simply be dismissed out of hand. Of course, again, the theorem that shows the de-biasing effect of mulltiple imputation and Rubin's rules does not in anyway exclude that situation. Once more, it is strong intuition vs mathematical statistics.

I disagree, however, with what Robert Lunn says in his first paragraph. Multiple imputation and Rubin's rules apply when the data are missing at random (MAR). MAR is defined as: the missingness of observed values is independent of the true (unobserved) values, conditional on the all of the observed data. It does not matter whether the gender distribution of observations with missing values on something else differs from that of observations with non-missing values on that something else. All that would imply is that gender must be included in the imputation model. As long as, separately among males and females, missingness is independent of the unobserved true value, the data would still be MAR.

In fact, there is no way to statistically test the MAR assumption within the data you are analyzing. The only way one could test that would be through some sort of external data that went back and actually observed the missing values. Of course, if you had such data, you would simply use it directly and not do MI! So the MAR assumption is an act of faith, one predicated on a mechanistic understanding of how the missingness arose in the data.

In your case, missingness arose, as I understand it, through missed clinic appointments. So you need to ponder whether missing a clinic appointment is likely to be related to the patient's actual (unobserved) height and weight. In most situations, I would assume it isn't related: missed clinic appointments relate primarily to things like scheduling conflicts, transportation difficulties, forgetting, etc. And those things probably are independent of height and weight. But one can imagine a scenario where this is not true. If, for example, the clinic is, specifically, a weight loss clinic, those who are not losing weight might feel embarrassed or frustrated and chose to skip the clinic visit as a result. In that case, missingness would be rather strongly related to the unobserved height and weight values and MAR would fail. (Which, in turn, would mean that using MI may not remove bias.)
Comment
Robert Lunn

Join Date: Nov 2014

Posts: 4
#7

22 Apr 2015, 18:09

Clyde, I said that it's appropriate to use multiple imputation if the data IS missing at random (MAR). I believe the implementation of all multiple imputation procedures are based on the underlying assumption that the data is missing at random (MAR).

With respect to the usefulness of impossible values from imputation procedures, I would argue it depends on what you want to do with the data. For example, if a researcher were planning to do a cluster analysis, they would want to avoid impossible values as it's likely that a clustering procedure would isolate impossible values.
Comment

Announcement

Multiple Imputation, clustered data, multiple time points

Comment

Comment

Comment

Comment

Comment

Comment