Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple imputation using mi impute regress

    I have a question about using mi imputation regress for multiple imputation. I am using survey data, and the outcome variable that I want to use for subsequent analysis contains missing data (c.10% of the sample). As the outcome variable is univariate and continuous, as advised by the Stata "mi impute" help page, I am using a linear regression model as an imputation model to obtain m=20 imputed values of my outcome variable (hence "mi impute regress").

    However, although the code runs properly and I obtain m=20 vectors or completed data for my outcome variable, I don't fully understand how this works: if the imputed outcome variables stem from a linear regression based on other observed variables, how is it possible to obtain m=20 different vectors? If the imputed values are predicted values from a linear regression, I should only be able to get a single imputed vector for my outcome variable, as the prediction from a linear regression canot give me several different outputs.

    Thank you for your help.

  • #2
    there are books written about this but the place for you to start, I think, is in the "methods and formulas" section of the mi manual (mi.pdf); then go to some of the citations esp. the Rubin one

    Comment


    • #3
      They are not predicted values from a linear model in the sense of what you would get from using, say, the -predict- command.

      For concreteness, suppose that the variable being imputed is X and the variables from which it is being imputed are A, B, and C. First, a linear model is fit to the data:
      Code:
      X = _cons + _b[A]*A + _b[B]*B + _b[C]*C + error term
      If you were to use -predict-, it would give you, deterministically, the values of the linear predictor: _cons + _b[A]*A + _b[B]*B + _b[C]*C. Note the absence of the error term.

      What -mi impute- does, is draw random values of the error term from a normal distribution, with appropriate standard deviation (based on the standard deviation of the residuals from the regression.) And it calculates the sum of the linear predictor and these random values for the error terms. That is why you get different results in each of the 20 imputed data sets: the error terms are random draws and differ.

      Added: Crossed with #2.

      Comment

      Working...
      X