Trouble with imputing right skewwed data with multiple imputation

layla sarah

Join Date: Mar 2016

Posts: 7
#1

Trouble with imputing right skewwed data with multiple imputation

19 Mar 2016, 15:47

Hello statalist,

I have a dataset that includes parasite data. There are two variables of interest, the infection [1 or 0] (binary) and and intensity (continuous). Intensity data is only available if a person is infected (i.e., has a 1 for the infection variable). For the purpose of this post, let's say the infection variable is asc_bin and the intensity variable is asc_num.

I have 1010 participants, of which 962 provided a specimen to assess parasites. Of those who provided a specimen, 179 are infected. I need to impute both asc_bin and asc_num for the 48 participants with missing data. I believe the best way to do this would be to impute using negative binomial regression.

Steps:

1) Recode the continuous variable of asc_num to 0 for uninfected individuals (i.e., asc_bin==0), thereby all 962 observed participants have a value for asc_num. [successful]
2) Run a negative binomial regression with complete baseline variables (group, age, education, school, district) [successful]
3) Run the following MI code:
mi set wide
mi register imputed asc_num
mi register regular group, age, education, school, district
mi impute nbreg asc_num group, age, education, school, district, add(20) rseed(11) noisily [unsuccessful]

Error: asc_num: missing imputed values produced
This may occur when imputation variables are used as
independent variables or when independent variables
contain missing values. You can specify option force if
you wish to proceed anyway.

4) Using the force option does not impute any observations.

Question: Is there a work around for this? I am not understanding because all my independent variables (i.e., group, age, education, school, district) have no missing values.

Thanks,
Layla Sarah
Tags: mi impute, multiple imputation, nbreg
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

19 Mar 2016, 16:24

A couple of hunches.

1. If you really ran the commands as written, I'm surprised you even got an error message out of the -mi impute nbreg- command, because all those commas except the one before add(20) are syntax errors. Probably this is just a transcription error in your post, but I can't emphasize enough that there are no unimportant details and when asking for help troubleshooting code it should always be exactly what you ran, created by copying from Results window or log file or do file into the clipboard and then pasting into the forum. (Preferrably paste into a code block tor enhanced readability--see FAQ #12 7th paragraph.)

2. Assuming that your -mi impute nbreg- command was actually syntactically correct, I'm concerned about some of your independent variables. In particular, I don't know what group, school, and district are, but my guess is that they are not intended to be continuous variables, but rather are categorical variables. This could be particuarly problematic if the numeric codes for those variables range widely. In that situation -nbreg- might be trying to impute wildly large values of asc_num for observations in district 1748, or the wide range of values for these numbers could be forcing -nbreg- to try to estimate coefficients that are so small in magnitude that it can't converge.

So, you need to treat these as categorical variables if that is what they are. I know that -nbreg- in isolation supports factor variable notation, and I believe that it does so when used under -mi impute- as well. If so, try -mi impute nbreg ascnum i.group age education i.school i.district, add(20) rseed(11) noisily-. If age or education or actually categorical variables, then they need to be treated as such also (but their names don't immediately leap out to me as implausible continuous variables.)

See if that makes a difference. If not, post back. If you do, please show exact commands and Stata output as requested above.
Comment
layla sarah

Join Date: Mar 2016

Posts: 7
#3

19 Mar 2016, 16:40

Dear Clyde, thank you very much for the reply. I apologize, but I did make some copy and paste errors in my first post. I promise to be more careful.

I have re-run the code with your suggestions and get the same problem.

CODE:

mi set wide
mi register imputed asc_num
mi register regular group age education district

mi impute nbreg asc_num i.group age i.education i.district, add(20) rseed(12) noisily

OUTPUT:

Running nbreg on observed data:

Fitting Poisson model:

Iteration 0: log likelihood = -3994066.3
Iteration 1: log likelihood = -3994056.1
Iteration 2: log likelihood = -3994056.1

Fitting constant-only model:

Iteration 0: log likelihood = -8193.4207
Iteration 1: log likelihood = -2396.386
Iteration 2: log likelihood = -2344.653
Iteration 3: log likelihood = -2344.6387
Iteration 4: log likelihood = -2344.6387

Fitting full model:

Iteration 0: log likelihood = -2340.0459
Iteration 1: log likelihood = -2339.1547
Iteration 2: log likelihood = -2339.0579
Iteration 3: log likelihood = -2339.0577

Negative binomial regression Number of obs = 962
LR chi2(4) = 11.16
Dispersion = mean Prob > chi2 = 0.0248
Log likelihood = -2339.0577 Pseudo R2 = 0.0024

------------------------------------------------------------------------------
asc_num | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
group |
0 | 0 (empty)
1 | -1.181287 .4750133 -2.49 0.013 -2.112296 -.2502783
|
age | -.0197683 .0340822 -0.58 0.562 -.0865682 .0470316
|
education |
0 | 0 (empty)
1 | .547247 .537541 1.02 0.309 -.506314 1.600808
|
district |
0 | 0 (empty)
1 | .4868574 .4903156 0.99 0.321 -.4741436 1.447858
|
_cons | 7.788868 1.119938 6.95 0.000 5.593831 9.983906
-------------+----------------------------------------------------------------
/lnalpha | 3.939018 .0789629 3.784253 4.093782
-------------+----------------------------------------------------------------
alpha | 51.36812 4.056175 44.0028 59.96627
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0: chibar2(01) = 8.0e+06 Prob>=chibar2 = 0.000

asc_num: missing imputed values produced
This may occur when imputation variables are used as independent variables or when
independent variables contain missing values. You can specify option force if you
wish to proceed anyway.

Last edited by layla sarah; 19 Mar 2016, 16:44.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

19 Mar 2016, 18:03

Well, what jumps out at me right away is the alpha = 51.36812. The random effects in -nbreg- are distributed exp(nu) ~gamma(1/alpha, alpha). So, gamma(1/51.36812, 51.36812) is a very bizarre looking distribution which is extremely dense near zero and also have a very, very long tail. In a sample of 10000 random draws from that distribution, I got a median value of 8.86e-15! So, the corresponding values of nu have a median value of -32.35733, and range from about -500 to + 5.6. The nu values get added to the fixed effects predictor portion, and then that gets exponentiated to provide a predicted mean for a Poisson distribution. Then, to predict an imputed value for asc_num, it has to make a random draw from a Poisson distribution with that predicted mean. If I assume that district always 1, and education is always 1 (which are the most favorable circumstances) and that age is typically around 15 (I'm guessing these are students), then the predict means range from 1.7e-216 to 29864.04, and Stata is unable to sample from Poisson distributions with means in the lower end of that range. In particular, when I gry to do that, 6,543 of the 10,000 observations are missing, those being the ones with predicted means below about 1e-6. (The online help says that 1e-6 is the lower limit of the domain for rpoisson().). The inability to sample Poisson distributions with means that small is the source of the missing imputed values. The -force- option won't fix that.

So the bottom line here is that your distribution for asc_num is just too wild to be simulated by -nbreg-. I'm guessing that the distribution of asc_num in your data is, on the one hand, immensely zero heavy, but on the other hand also has frequent very large observed.

So that's the diagnosis. I don't really know what treatment to suggest. Perhaps there is some replacement variable you could use in lieu of asc_num. Or perhaps you could transform it to make it a bit more manageable, such as cube root, or arctan, and then try imputing the transform. A suitable approach might also depend one the role that asc_num is going to play in the model that you ultimately -mi estimate-, and what kind of model that is. So, I don't think I can offer any more specific advice.

Perhaps others on the Forum have some experience trying to impute variables with wild distrbutions like this and can chime in with better guidance about solving this problem.
Comment
layla sarah

Join Date: Mar 2016

Posts: 7
#5

19 Mar 2016, 21:24

Thank you Clyde, you have been incredibly helpful. There are many zeros in asc_num because the majority of the population is uninfected. I have since decided to impute with pmm. It worked! Thanks again.
Comment

Announcement

Trouble with imputing right skewwed data with multiple imputation

Comment

Comment

Comment

Comment