Mi impute pmm nearest neighbor matching

Yujin Kwon

Join Date: Feb 2021

Posts: 2
#1

Mi impute pmm nearest neighbor matching

23 Feb 2021, 21:39

Hello,
I am using mi impute pmm command and I have a question regarding nearest neighbor matching.
I want to fill in the missing data with the observation from one nearest neighbor, hence using the knn(1) option.
However I have noticed that my outcomes vary each time I run the code, and I don't quite understand why there is randomness to the imputed values when I am almost certain that there must be only one observation with the nearest predicted value.
Any help/advice would be very much appreciated.

Code:

foreach y in 2013 2014 2015 2016 2017 2018 2019{ use "data.dta",clear keep if year==`y'|year==`y'+1 mi set fl mi register imputed lnhw mi xtset pid year mi impute pmm lnhw lnimphw age edudum1 edudum2 edudum3 marital_simp head i.jobhourt i.jobkind_simp i.jobarea_simp i.wsize regular sex region i.howwage if year==`y', replace knn(1) by(year) add(1) force save "impute_`y'.dta",replace }
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3860
#2

24 Feb 2021, 02:53

Randomness in the imputed values is a core property of MI; it is essential for getting statistical inference right. With the pmm model, there are two sources of randomness: first, the linear prediction is obtained using randomly drawn parameters from the underlying regression model; second, the final imputed value is chosen randomly from the set of nearest-neighbors. Setting knn(1) (which used to be the default in earlier releases of Stata) removes the randomness from step 2 but not from step 1. Note that you do not want to completely remove the randomness from any step (see Paul Allison's discussion of the subject).

There seem to be a couple of other potential problems with your code that I lack the time to address in detail. I will leave it at bullet points:

- use set seed to get reproducible results
- you only register two variables as imputed but then go on and list a series of additional variables; you do not want that
- you use the force option; you do not want that
- you only create one complete dataset; this will lead to incorrect inference (significance tests, p-values, confidence intervals) in the analyses because there is no variation in the (one) imputed value
1 like
Comment
Yujin Kwon

Join Date: Feb 2021

Posts: 2
#3

25 Feb 2021, 22:18

Originally posted by daniel klein View Post

Randomness in the imputed values is a core property of MI; it is essential for getting statistical inference right. With the pmm model, there are two sources of randomness: first, the linear prediction is obtained using randomly drawn parameters from the underlying regression model; second, the final imputed value is chosen randomly from the set of nearest-neighbors. Setting knn(1) (which used to be the default in earlier releases of Stata) removes the randomness from step 2 but not from step 1. Note that you do not want to completely remove the randomness from any step (see Paul Allison's discussion of the subject).

There seem to be a couple of other potential problems with your code that I lack the time to address in detail. I will leave it at bullet points:

- use set seed to get reproducible results
- you only register two variables as imputed but then go on and list a series of additional variables; you do not want that
- you use the force option; you do not want that
- you only create one complete dataset; this will lead to incorrect inference (significance tests, p-values, confidence intervals) in the analyses because there is no variation in the (one) imputed value

Thank you so much for your helpful comment. I completely missed out on step 1 but now the idea became much more clear after reading the article you attached.
I have edited my codes, But can you tell my why will using the force option will lead to misleading results?The variables I used in the regression model has some missing values here and there so using the force option seemed to be the only choice...
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#4

26 Feb 2021, 01:21

Originally posted by Yujin Kwon View Post

But can you tell my why will using the force option will lead to misleading results?The variables I used in the regression model has some missing values here and there so using the force option seemed to be the only choice...

You have answered your first question yourself: your predictors have missing values which causes missing imputed values. Instead of forcing Stata to ignore this, why not impute the missing values in the predictor variables as well?
Comment

Announcement

Mi impute pmm nearest neighbor matching

Comment

Comment

Comment