Data transformation after multiple imputation

Maxwell Akanbi

Join Date: Nov 2019

Posts: 2
#1

Data transformation after multiple imputation

15 Nov 2019, 21:31

Hi Statalist,

I am new to the forum and to multiple imputations. I plan to run a cox regression model, but three covariates (World Health Organization Clinical stage, CD4 T-cell count and HIV viral load) have missing values. I have used the mi command to impute missing values for the three variables.
I have the following questions:
1. When I ran the sum command for CD4 T-cell count and HIV viral load after imputation of missing variables, I noted that Ithe newly computed variables include negative numbers which is odd. can this be prevented?

sum baseline_cd4

Variable Obs Mean Std. Dev. Min Max

baseline_cd4 100,777 267.6011 226.9528 -498.9758 1580

2. For the cox regression, I plan to transform both CD4 T-cell count and HIV viral load. I plan to analyze CD4 T-cell count in increments of 100 cell ( gen b_cd4_100 = baseline_cd4/100) and categorize HIV Viral load ( egen viralload0_cat = cut( rnavload_0 ), at( 0 , 1000,10000, 100000,9914937 )
With these transformations, will the integrity of the multiple imputations be maintained? or do I need to carry out the transformation before the multiple imputations?

Thank you.

Max
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

15 Nov 2019, 22:10

1. When I ran the sum command for CD4 T-cell count and HIV viral load after imputation of missing variables, I noted that Ithe newly computed variables include negative numbers which is odd. can this be prevented?

Yes, it can be prevented, but you shouldn't do that. The mathematics behind multiple imputation does not involve the imputed data sets' being alternative versions of reality. They are simply numbers calculated in such a way that the results of using them for regression and then combining them according to Rubin's rules (i.e. using -mi estimate-) reduces the bias in the coefficients that would result from regressing only complete cases. There is no mathematical or statistical reason why the imputed values have to resemble real data, or even be plausibly realistic, or even be possible in the real world. So leave it alone.

With these transformations, will the integrity of the multiple imputations be maintained? or do I need to carry out the transformation before the multiple imputations?

Well, if you are going to do these transformations, it is better to do them on the unimputed data set and then impute the transformed data. But these transformations look like a bad idea. Turning continuous variables into discrete categories discards information and introduces bias. Why bother fixing up the usually relatively minor problem of missing values in your data if you are then going to mangle the data in this disinformative way? Unless in the real world something truly qualitative and discrete happens at the cutpoints you show, there are no benefits, and potentially great harms, entailed by discretizing your good, high-quality continuous data.

If your motivation for doing this is concern about possible non-linearity between the log hazard ratio and the predictor variables, a much better solution is to transform the continuous variable in a continuous way (maybe a log, or square root, or exponential, or square, or cube, etc.) to produce a variable that is more linearly related to the log hazard ratio. And, again, if you do this, it is better to transform first on the real data and then impute.
1 like
Comment
Maxwell Akanbi

Join Date: Nov 2019

Posts: 2
#3

16 Nov 2019, 14:32

Thanks, Clyde. The HIV viral load is skewed and usually log-transformed or categorized into 'clinically relevant' cut points. I will log transform it before imputation.
Comment
alessio lombini

Join Date: Dec 2020

Posts: 98
#4

08 Aug 2021, 08:03

Dear Clyde Schechter,

I have a question for you. Do you think it is always better to transform a variable before imputing missing values, or it's just true in this specific situation? For instance, in my case, I am not sure whether taking the log of the variable X before or after linearly interpolating its missing values. Also, would you know any reference to that?
Thank you in advance for your time.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#5

08 Aug 2021, 12:09

If you are referring specifically to multiple imputation, there is general agreement that you should first transform the variable and then impute the transformed results.

In your next sentence, however, you are referring to linear interpolation, which is a different way of dealing with missing values. Unless you have strong theoretical justifications for using linear interpolation, though, you really should not do it at all. It relies on an extremely strong assumption about the nature of the missing data, one that is usually not true in the real world. If you do have a strong theoretical justification for believing that the missing data lie on straight line segments with the data around them, then do the interpolation first and then take the logarithms. But, as I say, that is seldom justified. So consider your assumptions carefully.
Comment
Maryam Bashir

Join Date: Aug 2021

Posts: 5
#6

19 Aug 2021, 14:52

Dear Clyde Schechter
I am new here and to MI. I use the ice command for my imputations but i don't know how to change my number of observations to the original value. I keep converting the style mostly to wide but without any change. Please what is your advice?
Secondly, is it advisable to go beyond 100 imputations?
Thanks a lot.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#7

19 Aug 2021, 15:02

I don't understand your question. I haven't used -ice- in a long time and my memory of it is a bit rusty. But -ice- does its imputations in long style, and there is no changing that. If you want your imputed data in wide style, then either you can do them with Stata's native -mi- commands in the first place, or you can -mi import- what ice produces and then change the style within -mi convert-.

As for your second question, I don't really know. At one time, it was claimed that a very small number of imputations was usually sufficient. Then it was said otherwise. I have also seen guidance suggesting that the number of imputations should be approximately the same as the fraction of missing information. I don't honestly know. There are others on this Forum who have much deeper knowledge of multiple imputation than I do. I hope one of them will respond.
1 like
Comment
Maryam Bashir

Join Date: Aug 2021

Posts: 5
#8

19 Aug 2021, 16:12

Thanks a lot for your prompt reply. i will try the native -mi- commands.
Comment
Ralph Santos

Join Date: Dec 2021

Posts: 1
#9

01 Dec 2021, 05:24

Dear Clyde Schechter,

I am trying to perform a logistic regression model (using -logistic PC5years age BMI sex smoking nod pancreatitis-) however, I have missing values for BMI and smoking, and I also found that there is non-linearity between the log hazard ratio and the continues predictors BMI and age. Do you know of anyway to combine fractional polynomial, multiple imputation and logistic regression?
Thanks in advance.
Comment

Announcement

Data transformation after multiple imputation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment