Initializing MICE to avoid patchy imputation

Jonathan Afilalo

Join Date: Nov 2016

Posts: 42
#1

Initializing MICE to avoid patchy imputation

29 Jun 2023, 06:27

When using multiple imputation by chained equations to impute several MAR variables, some sources suggest the following steps:
Use mean value imputation for all missing variables as placeholders

Set the placeholder back to missing for one variable to be imputed ("var")

Use regression imputation for "var" (benefitting from complete case data thanks to step 1)

Repeat steps 2-3 for each variable you want to impute

Repeat steps 2-4 for a given number of cycles, updating the imputations each cycle, resulting in one imputed dataset

Repeat steps 1-5 for a given number of imputations

https://onlinelibrary.wiley.com/doi/...0.1002/mpr.329

The problem is that mi impute chained does not do steps 1 and 2!

As a result, many values cannot be imputed because of missingness in the independent / auxiliary variables.

Is there a way to get the STATA mi package or a 3rd party package to do step 1 and 2 as part of the mi workflow??
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3862
#2

29 Jun 2023, 07:15

Originally posted by Jonathan Afilalo View Post

As a result, many values cannot be imputed because of missingness in the independent / auxiliary variables.

I think this is a misunderstanding. Please show syntax (and example data, if possible).

My guess is that you are typing something like

Code:

mi imputed chained ... varname ... = varname ...

and have missing values in variables on both sides of the equals sign. You should have registered all variables with missing values as imputed and include them to the left of the equals sign. Variables to the right of the equals sign should not have missing values.

Last edited by daniel klein; 29 Jun 2023, 07:17.
Comment
Jonathan Afilalo

Join Date: Nov 2016

Posts: 42
#3

29 Jun 2023, 07:36

Originally posted by daniel klein View Post

Variables to the right of the equals sign should not have missing values.

Of course - in this example below imputation of hospitalized is incomplete because of missing values in bmi. This is precisely why some experts suggest to initialize the imputation procedure with a simple mean imputation of all missing values as a "placeholder", and then to reset one "placeholder" at a time back to missing in order to impute it using a complete dataset. See steps 1-2 in the aforementioned workflow. How can I achieve this in STATA's mi package?

Code:

mi set wide mi register hospitalized bmi mi impute chained (logit) hospitalized = age female bmi (regress) bmi = age female, add(20)

Last edited by Jonathan Afilalo; 29 Jun 2023, 07:44.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3862
#4

29 Jun 2023, 07:50

Simple. Code:

Code:

[...] mi impute chained (logit) hospitalized (regress) bmi = age female, add(20)
Comment
Jonathan Afilalo

Join Date: Nov 2016

Posts: 42
#5

29 Jun 2023, 08:22

Originally posted by daniel klein View Post

Simple. Code:

Code:

mi impute chained (logit) hospitalized (regress) bmi = age female, add(20)

In this example, what independent variables will STATA use to impute the dependent variables hospitalized and bmi? I assume that it will use the stated auxiliary variables age and female, but will it also use other variables in the dataset? (I may be missing something basic here or just not getting through...)
Comment
Jonathan Afilalo

Join Date: Nov 2016

Posts: 42
#6

29 Jun 2023, 08:41

Code:

mi impute chained (logit) hospitalized (regress) bmi = age female, add(20)

I am pretty sure that the dependent variables in any imputation are a combination of (a) the auxiliary variables + (b) the other imputed variables, equivalent to:

Code:

logit hospitalized age female bmi regress bmi age female hospitalized

Naturally, since there are missing values for bmi, I get this error:

Code:

hospitalized: missing imputed values produced This may occur when imputation variables are used as independent variables or when independent variables contain missing values. You can specify option force if you wish to proceed anyway.

If I specify force then it will work but it will generate an incomplete imputed dataset. This is exactly what steps 1-2 in the aforementioned workflow aim to circumvent!!

Still unsolved...
Comment
daniel klein

Join Date: Mar 2014

Posts: 3862
#7

29 Jun 2023, 08:45

It's documented.

For starters, type

Code:

mi impute chained ... , dryrun

to see the model specifications. You will find something like

Code:

logit hospitalized bmi age female regress bmi i.hospitalized age female

meaning that by default all variables are used in all equations. And, Stata does use steps 1 and 2 to fill in missing values in bmi and hospitalized where these variables appear as predictors.

btw. female should probably be i.female

Edit: The conditional specifications are shown differently from what I suggest here (see #9); the information is the same

Last edited by daniel klein; 29 Jun 2023, 09:05.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3862
#8

29 Jun 2023, 08:51

Originally posted by Jonathan Afilalo View Post

Still unsolved...

Really? Did you change your syntax as I have adviced? Please show the exact commands that you type and also the output that you get. Ideally, provide example data to reproduce the problem.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3862

29 Jun 2023, 08:59

Here is an example that should replicate your case:

Code:

sysuse auto

summarize price
generate expensive = price > r(mean)

keep expensive mpg weight foreign

set seed 42

replace expensive = . if runiform() < .2
replace mpg = . if runiform() < .2

mi set wide
mi register imputed expensive mpg
mi impute chained (logit) expensive (regress) mpg = weight i.foreign , add(5)

Here is the relevant output

Code:

. mi impute chained (logit) expensive (regress) mpg = weight i.foreign , add(5)

Conditional models:
               mpg: regress mpg i.expensive weight i.foreign
         expensive: logit expensive mpg weight i.foreign

Performing chained iterations ...

Multivariate imputation                     Imputations =        5
Chained equations                                 added =        5
Imputed: m=1 through m=5                        updated =        0

Initialization: monotone                     Iterations =       50
                                                burn-in =       10

         expensive: logistic regression
               mpg: linear regression

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
         expensive |         60           14        14 |        74
               mpg |         62           12        12 |        74
------------------------------------------------------------------
(Complete + Incomplete = Total; Imputed is the minimum across m
 of the number of filled-in observations.)

Note how i.expensive is a predictor for mpg and mpg is a predictor for expensive. Note that both expensive and mpg have missing values. Note that all missing values are imputed.

Comment

Jonathan Afilalo

Join Date: Nov 2016

Posts: 42
#10

29 Jun 2023, 09:34

I think it's OK now, thanks! (I may have missed a variable with missing values in my right hand side )

Was STATA doing steps 1-2 all along? Is this built-in to the standard mi impute workflow?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3862
#11

30 Jun 2023, 01:07

Originally posted by Jonathan Afilalo View Post

Was STATA doing steps 1-2 all along? Is this built-in to the standard mi impute workflow?

Yes, since Stata 12, which introduced mi chained.

I believe the misunderstanding in syntax might arise because the help file for mi impute chained (more precisely the syntax diagram in that help file) does not further explain the term indepvars. The documentation might benefit from adding something along the lines: indepvars are names of variables with no missing values that are used as predictors in all equations.
Comment
Jonathan Afilalo

Join Date: Nov 2016

Posts: 42
#12

30 Jun 2023, 06:34

Yes indeed, the documentation could have benefitted from an example and explanation like this:

Code:

mi set wide mi register imputed IMP_VAR1_CONT IMP_VAR2_CONT IMP_VAR3_DICHOT mi impute chained (regress) IMP_VAR1_CONT IMP_VAR2_CONT (logit) IMP_VAR3_DICHOT = AUX_VAR1_CONT i.AUX_VAR2_DICHOT, add(10) mi estimate : regress IMP_VAR1_CONT IMP_VAR2_CONT AUX_VAR1_CONT

IMP_VARs = have missing data

AUX_VARs = cannot have missing data

IMP_VARs and AUX_VARs can be either dependent vars, independent vars, or vars only used to impute

a given IMP_VAR will be imputed from all of the (a) other IMP_VARs & (b) AUX_VARs, unless specify omit(VAR)

for example, the imputations in the example above are equivalent to these 3 regression commands:
regress IMP_VAR1_CONT IMP_VAR2_CONT IMP_VAR3_DICHOT AUX_VAR1_CONT i.AUX_VAR2_DICHOT

regress IMP_VAR2_CONT IMP_VAR1_CONT IMP_VAR3_DICHOT AUX_VAR1_CONT i.AUX_VAR2_DICHOT

logit IMP_VAR3_DICHOT IMP_VAR1_CONT IMP_VAR2_CONT AUX_VAR1_CONT i.AUX_VAR2_DICHOT

above commands ordered from most observed to least observed dependent var, unless specify orderasis
Comment

Announcement

Initializing MICE to avoid patchy imputation

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment