Impute missing data with uniform() command

Ronan Ollivier

Join Date: Jan 2017

Posts: 2
#1

Impute missing data with uniform() command

18 Jan 2017, 09:24

Hello,

I am a newbie in stata.

Currently I analyze a data set of influenza cases hospitalized in intensive care units in an administrative region. Among the variables studied, I have the type (A or B) and the virus subtype (A (H1N1) or A (H3N2) or B). If the percentage of missing data for the type is relatively small (3.2%) the percentage of missing data for the subtype is 32%.

As we know the distribution of subtypes for outpatients in the country for each season, is it correct to impute the type and subtype missing data with the proportion Virus A “pr_A” and the proportion of subtype Virus A (H1N1) found in general population?

Below Stata commands to implement the procedure

There are two variables: “typ” and “styp”:
“typ” values: 0 for B virus, 1 for A virus

“styp” values: 0 for A(H3N2), 1 for A(H1N1), 2 for B

gen Rtyp=cond(typ==.,1,0)
gen Rstyp=cond(sstyp==. ,1,0)
merge m:1 i_season using $chemin\outpatient.dta, keepusing(pr_A pr_h1n1)
replace typ=uniform()<=pr_A if Rtyp==1
replace styp=2 if Rstyp==1 & typ==0
replace styp=uniform()<=pr_h1n1 if Rstyp==1 & typ==1

Or, is it better to use multiple imputation with ICE ?
Thanks for your help.
Ronan.
Tags: None
Tim Morris

Join Date: Apr 2014

Posts: 92
#2

18 Jan 2017, 10:09

Hi Ronan

Welcome. As far as I know, there isn't a set answer to this question, but a couple of thoughts:
If your study is representative of the population, then anchoring your imputed proportions to the population is a sensible idea. If you have no reason to believe it is representative, using this approach could go wrong.

There is no reason to do your approach or MI – you could do both (multiply impute using your method). I co-supervise a PhD student who is using population-level information with multiple imputation inference. Multiple imputation is likely to be better than single imputation because you can get sensible standard errors and confidence intervals, which will be smaller than with single imputation.

Have you tried using ice (or mi impute) as you mention above? Does it return reasonable (to your mind) proportions of each type and subtype?

Tim
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

18 Jan 2017, 10:11

Well, using random imputation, whether single or multiple, is only reasonable if you believe the data are missing at random. Only you know enough about why the missing values are missing to judge whether this is a plausible assumption or not.

Putting that aside, multiple imputation is preferable to single. With any method of single imputation, even one that is completely unbiased, the variance of the variables that are imputed is underestimated, and that in turn leads to upward bias in the estimates of regression coefficients (or comparable statistics such as correlation coefficients, t-test results, etc.) produced in analyses of that data.

If, however, you perceive that your missingness mechanism does not work at random, then no imputation technique really does the job and you are better off just performing sensitivity analyses using reasonable best and worst-case (for whatever hypotheses you are testing) scenarios for the missing data.
Comment
Tim Morris

Join Date: Apr 2014

Posts: 92
#4

18 Jan 2017, 10:38

Clyde Schechter Your point about missing at random is true to the extent that you can 'just do it'. But multiple imputation also does the job under missing not at random; it's just much harder. Life becomes easier if you can base the imputation on some external data (like here, and Ronan's suggestion). A multiple imputation version of Ronan's suggestion could work, I think. (We've been developing something similar in spirit and it does work.) Specifically, it should be consistent if:
The probability of being missing depends only on the underlying value for both typ and styp

The study is taken sampled from the population, so it's reasonable to want the proportions of typ and styp to match the population

When it is consistent, the SE may be slightly overestimated, leading to coverage >95%.

Tim
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

18 Jan 2017, 11:39

Tim, that looks very interesting. Thank you.
Comment
Ronan Ollivier

Join Date: Jan 2017

Posts: 2
#6

19 Jan 2017, 10:03

Thank you for your answers.

Yes, I think the distribution of subtypes of influenza viruses in the general population is representative of that of severe cases hospitalized in intensive care units. For the 2009-2010 season, all A viruses were assimilated to A (H1N1). We known, however, that the A(H1N1) subtype give more serious clinical pictures and affect younger patients than those affecting patients with A (H3N2).

On the other hand, I have a hard time understanding the notion of missing data at random. However, my understanding was that the data were missing at random when these missing data depended on other variables in the dataset. For our dataset, we found that the missing data were related to the season and the health facility (tab order). The subtyping modalities varied between years and hospitals. In fact the realization of subtyping depends on the wealth and human resources available in the virology laboratories. So I do not know if this is missing data at random?
I tried the ICE command and got 60 sets of data. But the "mi estimate, noisily: proportion styp, over (season)" gave a result that did not match the distribution of subtypes in the general population (outpatients).

Maybe the data are not consistent enough: too few records (559) and too much missing data 30% for subtype and vaccination. Too bad ...

To the next time on this forum.

Ronan.
Comment
Tim Morris

Join Date: Apr 2014

Posts: 92
#7

20 Jan 2017, 06:30

Ronan Ollivier Thanks for the details you have included here. This sounds like you have a degree of missing at random (i.e. the probability of missing subtyp partly depends on variables you have recorded), as well as some missing not at random (indicated by the fact that when you use ice it doesn't recover the population proportions of subtyp). Tra Pham has been working on a method for exactly this situation! Would you be interested in talking further and trying our method in your influenza data? She has a Stata command.

I'm not sure of the best way to get in touch with someone on Statalist, so if you're interested, email me on: tim [dot] morris [at] ucl.ac.uk

Thanks, Tim
Comment

Announcement

Impute missing data with uniform() command

Comment

Comment

Comment

Comment

Comment

Comment