No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple imputation (MICE) and missing values for explanatory scale variables

    I have some explanatory scale variables (eg possible values 0 through 10).
    They are integers and limited in span.
    What is the best choice for handling these.

    I started by setting it up with regress assuming the variable to continuous,
    Then I tried using intreg (censored data) because I found an 2011/2012 presentation doing this.
    And now I've found out that there is truncreg (truncated data) as well.
    It is especially the last two I can't decide when to use.

    I've searched the net without luck.
    Obviously, I'm looking the wrong places.
    Do anyone have an answer?

    Another question is whether there exists a good book on chained multiple imputation (MICE).

    Thank you very much
    Kind regards

    nhb (Moved to Stata 15.1)

  • #2
    Just to add to the options (rather than reduce them): I sometimes like pmm for such variables. It does a regression, computes predicted values, and uses as imputed value a random observed value from the k observations that are closest with respect to the predicted value. Since it uses actually observed values, it automatically retains the range and any discreteness of that variable.
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz


    • #3
      The variables in question sound like Likert items, and I'd normally default to ordinal logit as the imputation model. Most likely either ordinal logit or PMM will be regarded as valid. Some readers may not have heard of PMM (my main collaborator hadn't), but you could maybe explain that it's a bit like propensity score matching.

      Edit: if you use PMM, please be aware that the default is to match to the nearest neighbor. Paul Allison argues that this is a flawed default, and suggests that we reset the default to k = 5 (i.e. in each draw, match randomly to one of the nearest 5 neighbors).

      This is just me thinking out loud, so this is optional reading: I'm aware that the Stata manual describes truncated regression as appropriate for imputing variables that are limited in range. I've always found this a bit odd, as it doesn't correspond to the real-life case for truncated regression (that is, when observations above/below a certain value exist in real life, but are not captured in your sample at all). But it is what the manual says, and that part of the manual was written by real statisticians (as opposed to applied statisticians like me). Nonetheless, truncated regression would definitely not retain the discreteness of the data.

      In principle, interval regression is for cases where the real dependent variable is continuous, but you observe something coded in an interval range (e.g. income from $30k to $40k). When we perform exploratory factor analysis on a bunch of Likert items, best practice is to use the polychoric correlation matrix, which assumes that the observed responses actually stem from a set of normal, latent variables (that are distributed multivariate normal). Taken in that context, interval regression doesn't sound absurd either.
      Last edited by Weiwen Ng; 07 Nov 2018, 06:38.
      Please use the code delimiters to show code and results - use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Please use the command -dataex- to show a representative sample of data; it is installed already if you have Stata 14.2 or 15.1, else you can install it by typing

      ssc install dataex


      • #4
        Thank you both very much for your answers.
        Kind regards

        nhb (Moved to Stata 15.1)