Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mi impute pmm nearest neighbor matching

    Hello,
    I am using mi impute pmm command and I have a question regarding nearest neighbor matching.
    I want to fill in the missing data with the observation from one nearest neighbor, hence using the knn(1) option.
    However I have noticed that my outcomes vary each time I run the code, and I don't quite understand why there is randomness to the imputed values when I am almost certain that there must be only one observation with the nearest predicted value.
    Any help/advice would be very much appreciated.


    Code:
    foreach y in 2013 2014 2015 2016 2017 2018 2019{  
    use "data.dta",clear
    keep if year==`y'|year==`y'+1
    mi set fl
    mi register imputed lnhw
    mi xtset pid year
    mi impute pmm lnhw lnimphw age edudum1 edudum2 edudum3 marital_simp head i.jobhourt i.jobkind_simp i.jobarea_simp i.wsize regular sex region i.howwage if year==`y', replace knn(1) by(year) add(1) force
    save "impute_`y'.dta",replace
    }

  • #2
    Randomness in the imputed values is a core property of MI; it is essential for getting statistical inference right. With the pmm model, there are two sources of randomness: first, the linear prediction is obtained using randomly drawn parameters from the underlying regression model; second, the final imputed value is chosen randomly from the set of nearest-neighbors. Setting knn(1) (which used to be the default in earlier releases of Stata) removes the randomness from step 2 but not from step 1. Note that you do not want to completely remove the randomness from any step (see Paul Allison's discussion of the subject).

    There seem to be a couple of other potential problems with your code that I lack the time to address in detail. I will leave it at bullet points:

    - use set seed to get reproducible results
    - you only register two variables as imputed but then go on and list a series of additional variables; you do not want that
    - you use the force option; you do not want that
    - you only create one complete dataset; this will lead to incorrect inference (significance tests, p-values, confidence intervals) in the analyses because there is no variation in the (one) imputed value

    Comment


    • #3
      Originally posted by daniel klein View Post
      Randomness in the imputed values is a core property of MI; it is essential for getting statistical inference right. With the pmm model, there are two sources of randomness: first, the linear prediction is obtained using randomly drawn parameters from the underlying regression model; second, the final imputed value is chosen randomly from the set of nearest-neighbors. Setting knn(1) (which used to be the default in earlier releases of Stata) removes the randomness from step 2 but not from step 1. Note that you do not want to completely remove the randomness from any step (see Paul Allison's discussion of the subject).

      There seem to be a couple of other potential problems with your code that I lack the time to address in detail. I will leave it at bullet points:

      - use set seed to get reproducible results
      - you only register two variables as imputed but then go on and list a series of additional variables; you do not want that
      - you use the force option; you do not want that
      - you only create one complete dataset; this will lead to incorrect inference (significance tests, p-values, confidence intervals) in the analyses because there is no variation in the (one) imputed value

      Thank you so much for your helpful comment. I completely missed out on step 1 but now the idea became much more clear after reading the article you attached.
      I have edited my codes, But can you tell my why will using the force option will lead to misleading results?The variables I used in the regression model has some missing values here and there so using the force option seemed to be the only choice...

      Comment


      • #4
        Originally posted by Yujin Kwon View Post
        But can you tell my why will using the force option will lead to misleading results?The variables I used in the regression model has some missing values here and there so using the force option seemed to be the only choice...
        You have answered your first question yourself: your predictors have missing values which causes missing imputed values. Instead of forcing Stata to ignore this, why not impute the missing values in the predictor variables as well?

        Comment

        Working...
        X