Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Imputation of missing data

    Hello all,

    I am looking for some guidance on how to impute categorical variables nominal and ordinal (up to 7 categories).
    Overall there are <2% of missing values in my dataset. These variables are included in a principal component analysis in order to get a socio-economical status score, which will be used as a covariate in my final model.

    1) Is there a threshold where simple imputation is preferred over multiple imputation? I imagine that when missing data is very low, one can do simple imputation but I can't find any reference that can value the sentence "when data has low missing values". Or is multiple imputation the only method used nowadays?

    2) First, I wanted to do a simple (or single?) imputation by randomly assigning the values using the original variable distribution (please see the command below where SESWall indicate the wall material) :

    Code:
    gen rand = uniform()
    
    tab SESWall, nolabel
                |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |        171       17.34       17.34
              2 |        103       10.45       27.79
              3 |        117       11.87       39.66
              4 |        595       60.34      100.00
    ------------+-----------------------------------
          Total |      986      100.00 
    
    gen drawwall = cond(rand <.17, 1, cond(rand<.28, 2, cond(rand<.40, 3, 4)))
    
    gen iSESWall=SESWall
    replace iSESWall=drawwall if iSESWall==.
    Is it a good option? I think it is similar to a mean substitution.

    3) Using the multiple imputation command mi impute chained (mlogit) would allow me to do a multivariate imputation, including the outcome variable (as recommended) and other auxiliary variables. Is a multiple imputation with m=1 equivalent to a simple imputation? I would use only the imputed dataset m=1 (and not m=0 with the observed data).

    Many thanks,
    Carole

  • #2
    If there is very little missing data any kind of imputation may not gain you much and may not be worth the trouble. But if you are going to impute at all I personally would use multiple imputation. I would use listwise deletion before I used single imputation.

    Incidentally, Wisconsin has some very good materials on multiple imputation:

    http://www.ssc.wisc.edu/sscc/pubs/stata_mi_intro.htm
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Seems not to be main issue here, but I would think about whether a PCA really makes sense for categorical variables that are not ordered, no matter how you impute the missing values. It will also not necessarily be straight forward to run this kind of analysis with imputed data. Maybe an option would be to create the score based on complete cases then impute the missing score variables?

      Best
      Daniel

      Comment


      • #4
        Thank you both for your comments.

        Richard: In my case, I understand that imputation is a lot of trouble, but for my final analysis every case is important, I can't afford to reduce my sample size. The complete case analysis will be done, anyway.

        Daniel: You are raising a good option, but I couldn't find any reference on the subject that would recommend imputing the final score rather than the original individual variables. Moreover, I imagined that PCA would take into account the "multivariability" that simple imputation can not handle by definition. I'll have a look, especially because the score is a continuous variable.

        Comment


        • #5
          Shall you use gsem, considering you have less then 2% of missing values and according to the Stata Manual, maybe you could try to elaborate a model without even imputing data:

          gsem’s method ML is sometimes able to use more observations in the presence of missing values than can sem’s method ML. Meanwhile, gsem does not provide the MLMV method provided by sem for explicitly handling missing values.
          Best regards,

          Marcos

          Comment

          Working...
          X