Multiple imputation confusion

Anne Todd

Join Date: Dec 2018

Posts: 163
#1

Multiple imputation confusion

28 Aug 2021, 13:42

Hello, I was hoping that someone could explain something going on behind the scenes with -mi- that I'm not sure I understand. I have a dataset where "treatment_var" is my main predictor variable, and it has no missing values so I do not want/need to impute that variable at all. This is just a 0/1 variable, with 743 total observations (0 = 400, 1 = 343).

This is the code I'm using for doing mi, which seemingly works perfectly:

Code:

mi set mlong mi register imputed depression race gender gpa p_educ age mi impute chained (regress) p_educ depression gpa age (logit) gender (ologit) race = treatment_var, add(20) rseed(100) mi estimate: regress: depression i.treatment_var i.race i.gender gpa c.age#c.age p_educ, robust

When I do the -mi estimate: regress- command, I see my "Number of obs" is equal to 743, which is my original number of total observations, so that seems to make sense. But then if I do -tab treatment_var- afterwards (on this imputed dataset), there is something like 3,000 total responses.

But I thought I was telling Stata not to impute that variable, as it has no missings, and indeed it seems like the actual regression output itself still has the correct original number of observations.

Am I just overlooking something? What is happening with what that -tab- is showing me?

Sorry for not providing data here, it is on a different server that I cannot access at the moment. Hopefully the question will still be clear otherwise.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

28 Aug 2021, 14:09

Your dataset has had observations added to it, based on observations that had missing values. Open your dataset in the Data Editor window and scroll down past the initial 743 observations. You will see _mi_m = 1 and various observations (identified by _mi_id) which are the original observations with missing values filled in (compare them to the orignal observation with _mi_m = 0).

Code:

webuse mheart5                                                          
mi set mlong                                                            
mi register imputed age bmi                                             
set seed 29390                                                          
mi impute mvn age bmi = attack smokes hsgrad female, add(10)
list if _mi_id==14

Code:

. list if _mi_id==14, clean

       attack   smokes        age        bmi   female   hsgrad   _mi_m   _mi_id   _mi_miss  
 14.        0        0          .          .        0        1       0       14          1  
156.        0        0   38.57524   31.18536        0        1       1       14          .  
184.        0        0   58.34894   19.88316        0        1       2       14          .  
212.        0        0   68.46573   22.72963        0        1       3       14          .  
240.        0        0   48.14063   25.00218        0        1       4       14          .  
268.        0        0   66.52374   24.27379        0        1       5       14          .  
296.        0        0   44.67178   23.36431        0        1       6       14          .  
324.        0        0   60.70895    19.9942        0        1       7       14          .  
352.        0        0   73.13823   25.92297        0        1       8       14          .  
380.        0        0   63.83153   24.19435        0        1       9       14          .  
408.        0        0   50.23097   28.49728        0        1      10       14          .

Comment

Anne Todd

Join Date: Dec 2018

Posts: 163
#3

28 Aug 2021, 14:17

Ah, I think this makes more sense--so the non-missing -treatment var- is not being imputed, there are just "new" observations incorporating the other imputed variables? And so when I'm doing the -mi estimate: regress-, is it sort of collapsing, in a sense, those all back into the original number of observations--which is why I'm still seeing the 743 in the regression output?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

28 Aug 2021, 16:02

Your understanding is correct, and I owe you and apology for omitting a critical reference from post #2, As I drafted post #2, I had included a recommendation that you read the discussion at

Code:

help mi##example

to understand the spirit of mi. Somehow that sentence fell off my screen, and I didn't notice the loss.

But on looking further, the real advice for understanding mi is to look at the Stata Multiple-Imputation Reference Manual PDF included in your Stata installation and accessible from Stata's Help menu. The very first section, although forbiddingly titled "Intro substantive" is in fact an overview of the substance of multiple imputation, a sort of prerequisite reading before leaping into the documentation for the commands.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

29 Aug 2021, 03:11

Anne:
as an aside to William's helpful advice, why did you code your interaction as:

Code:

c.age#c.age

instead of:

Code:

c.age##c.age

?
Just exploiting William's code, in the following toy-example neither the linear, nor the squared term for -age- reach statistcal significance (which is neither a good, nor a bad finding), but I would investigate a possible turning point in your dataset:

Code:

. mi estimate: logistic attack smokes bmi female hsgrad c.age##c.age

Multiple-imputation estimates                   Imputations       =         10
Logistic regression                             Number of obs     =        154
                                                Average RVI       =     0.0803
                                                Largest FMI       =     0.2618
DF adjustment:   Large sample                   DF:     min       =     142.24
                                                        avg       =  13,394.40
                                                        max       =  56,532.86
Model F test:       Equal FMI                   F(   6, 6678.2)   =       2.84
Within VCE type:          OIM                   Prob > F          =     0.0093

------------------------------------------------------------------------------
      attack |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      smokes |   1.206959   .3659681     3.30   0.001     .4895212    1.924398
         bmi |   .1110131   .0520222     2.13   0.035     .0081766    .2138496
      female |   -.048182   .4162727    -0.12   0.908    -.8640789     .767715
      hsgrad |   .1854987   .4077014     0.45   0.649    -.6136161    .9846135
         age |   .0987561   .1178324     0.84   0.402    -.1323856    .3298978
             |
 c.age#c.age |  -.0005969   .0010289    -0.58   0.562    -.0026148    .0014211
             |
       _cons |  -7.257825   3.892616    -1.86   0.063    -14.91629    .4006389
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Anne Todd

Join Date: Dec 2018

Posts: 163
#6

29 Aug 2021, 13:04

Thank you, William Lisowski, I appreciate the help very much.
Comment
Anne Todd

Join Date: Dec 2018

Posts: 163
#7

29 Aug 2021, 13:07

Carlo Lazzaro you are correct that it should be ##, as I want to allow age to be a squared term in the model...I just mis-typed it in the original post.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#8

29 Aug 2021, 14:18

Anne:
thanks for clarifying.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement