No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing values imputation - conslusions

    Hi, STATA community, hope you will help with the advice or provide the relevant link to litterature.
    In my randomised study I did 4 Measurements of functional tests. And there are of course some missing values: Whole population : 370 persons, many of the persons lacking one or more values. I did a mixed model to see the effect of my intervetion in to ways: complete case analyses and did worst value forward imputation.
    And for some functional tests I get significant improvement in worst case imputation which became non-significant in complete cases, and vice versa for other: non-significant in complete cases and significant in worst value imputation.
    How do I Draw a right complusion: should I use completed cases or worst value imputed dataset?

    Thanks alot in advance for your replays.

    Sincerely, Natallia

  • #2
    Probably neither. Worst value imputation, and other forms of single imputation, are deprecated as ways of dealing with missing data. They worsen the bias that is associated with missing data. They can be useful as a sensitivity analysis, but should never be the basis for drawing conclusions.

    You need an understanding of why you have missing data. If the missing data are, in effect, acts of God, then you have data that is missing completely at random, and just analyzing the complete cases is fine. But if the missing observations are anything but a purely random sample of the data, then just analyzing complete cases will lead to bias in analyses. If the missingness is such that the missing values can be estimated without bias from the observed data, then your data may be missing at random, and you could use multiple imputation to get unbiased analysis estimates. Multiple imputation is a fairly complicated technique with a pretty steep learning curve, so this is probably not something you should do on your own for a real-world project if you are not familiar with it.

    If the data are missing not at random, then you are likely facing an intractable situation unless you have a good statistical model of the missingness process itself (and that is seldom the case). So in that case, you would probably just use the complete cases analysis, acknowledge the intractability of the missing data as a serious study limitation, and do some sensitivity analyses (of which worst value is one possibility) to set some sort of bounds on the amount of error that results from the missingness.


    • #3
      both appraoches have their downsizes:
      - usually, complete cases are not a random subsample of the whole sample: hence, your coefficients, standard errors and the like are in all likelihood biased (up or down; it is practically impossible to forecast);
      - worst-case imputation. I guess that you mean the lowest value among thise registered in different waves of data for te same person who have both missing and observed values. Be as it may, the methodological bias here is that ther's no randomness in your imputation.
      In brief, the best approach for dealing with missing data is:
      - check the mechanism undelying the missingness (MCAR;MAR;MNAR);
      - act accordingly the missing mechanism.
      I would recommend you to take a look at the -mi- entries in Stata .pdf manual and my favourite textbook on this topic: S. van Buuren (2018). Flexible Imputation of Missing Data. Second Edition. CRC/Chapman & Hall, FL: Boca Raton.

      PS: Happy with reading that Clyde and me crossed each other in the cyberspace!
      Kind regards,
      (Stata 15.1 SE)


      • #4
        I am currently working on a team that's doing a systematic review for dementia treatments. We specified that 24-week outcomes for cognition and function were the primary duration we were interested in. So far, all the studies I've reviewed used last observation forward, generally in an ANOVA framework. Some also presented complete case analysis as a sensitivity analysis. I believe that most studies had attrition over 20% by 24 weeks, although at least half also made observations before 24 weeks (e.g. 12 weeks). I believe we require attrition to be under 20% to unconditionally consider the study to be high quality (actually, low risk of bias is the preferred term). With attrition between 20-30%, we required them to do multiple imputation accounting for at least some demographic covariates plus earlier observations. This would not have been perfect, but it would have been better than what we saw.

        A minority of studies stated that they used linear mixed effect modeling to handle the missing data. I believe that linear mixed models are subject to the missing at random assumption (i.e. missingness is random conditional on all covariates in the model). Analyzing complete cases invokes the much more stringent missing completely at random assumption. Both are unrealistic, but properly-done linear mixed modeling should be better than complete cases or LOCF. I believe this paper by Molenberghs et al (2004) explains the issue (unrestricted PDF here). However, to use a mixed model properly in this fashion, you would not want to impute the data via LOCF, and I'm pretty sure that you would want to add other covariates you measured. I emphasize: this shouldn't relieve you of the need to conduct a sensitivity analysis for data not missing at random, but if I understand correct, this is definitely better than LOCF. Do others agree, or am I misunderstanding something?

        Meanwhile, it's easy to say that we should minimize missingness, but I will nonetheless say it. Also, I wonder why people don't report the treatment effect at each time point measured, using all cases available at that time point. We might decide that attrition is too great at 24 weeks to consider those data reliably, but we would be willing to accept estimated treatment effects at 12 weeks if missingness is low enough that even complete case analysis won't bias the analysis too much (say, under 10% attrition at least, preferably under 5%).
        Please use the code delimiters to show code and results - use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Please use the command -dataex- to show a representative sample of data; it is installed already if you have Stata 14.2 or 15.1, else you can install it by typing

        ssc install dataex


        • #5
          Thanks alot for all you valuable advices! Kindly, Natallia