Multiple Imputation Results

Stephen Okiya

Join Date: Jul 2025

Posts: 280
#1

Multiple Imputation Results

16 Sep 2022, 02:23

Hi Stata Users,

First, let me apologize in case this may not be a Stata question in a strict manner. However, I believe I can still benefit from the vast knowledge of members of this group.
I am performing multiple imputation using the code below

Code:

mi set wide mi register imputed pr_attend mi impute chained (regress) pr_attend, add(20) by(age) mi estimate: regress pr_attend hhsize hh_head_no_educ clust_literacy num_children hh_member_formal_empl dep_ratio hh_orphan i.hv024

and attached dataset .
I then perform some robustness checks and find whereas the mean of the imputed distribution is accurate (we know this since we have the population estimate!), Kolmogorov-Smirnov test of equality of distributions of original and imputed variables shows they are different. Visual exploration by use of kdensity function shows the same.

I notice that whereas the mean of the distributions are similar (dotted green line superimposed on the continuous red line), standard deviation seems a bit different and am wondering whether there’s a way to try and address this.

Thanks in advance!
Attached Files

exa_data.dta (4.96 MB, 1 view)

Last edited by Stephen Okiya; 16 Sep 2022, 02:25.
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3860
#2

19 Sep 2022, 02:38

I am reluctant to open binary attachments. I can tell that this code

Code:

mi impute chained (regress) pr_attend, add(20) by(age)

is most likely not what you want. Here, the imputed values depend only on age. Thus, you are badly underestimating the relationship between pr_attend and all other predictors, which probably explains the underestimated variance. You want to include all variables in our analysis model in the imputation model.

Moreover, if you are using a linear model, then imputing only the outcome is not really necessary; if missing of the outcome depends only on the predictors, then the linear model remains consistent. You might lose a bit of power with complete case analyses but you are also likely to add unnecesary noise during imputation.
Comment
Stephen Okiya

Join Date: Jul 2025

Posts: 280
#3

19 Sep 2022, 02:53

Thanks daniel klein for the great insights. The reasons for performing imputation by age are:
Conceptually, the estimates should be age specific

Enrollment patterns differ across ages

Is there a way I can perform imputations having this is in mind?

Thanks in advance
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#4

19 Sep 2022, 03:09

Ignoring my point about imputation probably being unnecessary here, there is nothing wrong with imputing by age. It is, however, unlikely that you want to impute only by age. What you probably want is something like

Code:

mi impute regress pr_attend hhsize hh_head_no_educ clust_literacy num_children hh_member_formal_empl dep_ratio hh_orphan i.hv024 , add(20) by(age)

If the predictors also have missing values, you want chained equations (or a multivariate normal approach) and impute those missing values, too.
Comment
Stephen Okiya

Join Date: Jul 2025

Posts: 280
#5

19 Sep 2022, 03:15

daniel klein thanks a bunch for your insights. They are indeed helpful!!
Comment

Stephen Okiya

Join Date: Jul 2025
Posts: 280

19 Sep 2022, 03:30

daniel klein Could you kindly guide me on how to implement chained equations (or a multivariate normal approach)?

I believe that would resolve the error below

Code:

pr_attend: missing imputed values produced
    This may occur when imputation variables are used as independent variables or when independent variables contain missing values. You can
    specify option force if you wish to proceed anyway.
 -- above applies to age = 6

Thanks in advance!

Comment

Felix Bittmann

Join Date: Aug 2018

Posts: 711
#7

19 Sep 2022, 03:33

In general I think the rule of thumb is that the imputation model should contain at least all variables of the analytical model, maybe even more (auxiliary variables). The error you receive usually happens if any variable in the imputation model contains extended missings (like .a .b) and so on since these values are never imputed, only the sysmiss (.) ones. I would check all variables carefully and either replace the extended with the sysmiss or remove the specific cases from the data.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Stephen Okiya

Join Date: Jul 2025

Posts: 280
#8

19 Sep 2022, 03:46

Felix Bittmann Thanks so much! A closer look reveals there's an issue with my code. All explanatory variables shouldn't be missing since that information is available.
Comment

Announcement

Multiple Imputation Results

Comment

Comment

Comment

Comment

Comment

Comment

Comment