Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • options for missing values

    Hi,

    I know that the missing values in Stata can be replaced with mean by simple codes which fill in missing values.

    But, I wonder if the same can be done with any option added to the regression codes or any user-written codes.

    Thanks,
    Navid

  • #2
    I don't think so. I can't think of any command that includes an option to replace missing values with means. As you say, it is simple enough to code this directly first.

    But before doing that, think carefully. It is a strong assumption that the mean is even an unbiased predictor of the missing values and it is frequently untrue. Moreover, even if the mean is a good proxy for the missing values, using it fails to capture variation. This is particularly salient in regression analyses. So I would think twice, three times, and more before doing this in most situations. Look into multiple imputation, interpolation, or other approaches to the management of missing data.

    Comment


    • #3
      Navid (as per FAQ, please note the preference on this for for full real names. Just click on the Contact us button, bottom-right of this page and re-register accordingly. Thanks):
      as Clyde warned you about, replacing missing values by filling in the mean of the existing observations is, in general, a methodologically risky approach.
      As it is easy to figure out, if you have a remarkable number of missing values (but, as far as I know, nobody set a quantitative cut-off) that approach would, at best, reduce the variance across your data, affecting, in turn, standard errors, t and p-values of your regression coefficients, making your estimates biased and potentially unuseful..
      Other seemingly easy methods, like last observation carried forward (LOCF) and next observation carried backwards (NOCB) are questionable as well for their methodological weaknesses.
      An interesting website on this topic is www.missingdata.org.uk, which is maintained by Jonathan Bartlett (London School of Hygiene & Tropical Medicine), whose posts appear on this forum from time to time.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Clyde and Carlo make excellent comments, good advice. Yet another good way to deal with missing data is to re-frame your model as a SEM (most regression models can). Then you can use Full Information Maximum Likelihood (FIML) to get a model with unbiased estimates and proper standard errors assuming MAR (Missing At Random). Setting up the SEM can seem like a hassle, but setting up a good model for Multiple Imputation can be a hassle as well.

        Comment


        • #5
          What I'd really like is for regress and other commands to add a fiml option -- and for gsem to support fiml as well. I've never used fiml but I've heard several people say it is better than mi if you have software that supports it.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          StataNow Version: 19.5 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            Well, I'd be surprised if FIML became a simple option -- it takes an ML approach, while -regress- takes a closed-form solution. But making it available outside a SEM framework seems like a good idea, maybe an -fimlregress- command, or fiml: whatever family of commands . And, to some degree, I trust MI more, since it takes in information from variables outside of the model.

            Anyway, MI or FIML, with interpolation under certain circumstances. No good reason for listwise deletion or mean substitution to be around. For convenience, listwise is handy, though the sensitivity tests to justify it, one might as well use a better method to begin with.

            Comment


            • #7
              response to Rich W (#5): (1) note that -mixed-, and many other multi-level models, use FIML (often as the default; see -h mixed)-; (2) IIRC, you have previously cited Paul Allison about this; his main example is longitudinal; while I generally agree with his argument in this (but see below), I think his argument does not generally hold in the cross-sectional multi-level situation; (3) even in the longitudinal case, there may be situations where MI is preferred, including (a) the use of "auxiliary" variables (variables that are not relevant to the final outcome but do help in predicting what the missing data should be); Allison argues that these could be included in the final model but not all readers will accept a final model with predictor variables that have "high" p-values (and some journals won't accept this either); (b) in some cases one can weight the MI replications as a method of approximating the "not missing at random" situation (both MI and ML multi-level models assume "missing at random"); see, e.g., Carpenter, JR, Kenward, MG and White, IR (2007), " Sensitivity analysis after multiple imputation under missing at random: a weighting approach ", _Statistical Methods in Medical Research_, 16: 259-275



              Comment


              • #8
                Originally posted by ben earnhart View Post
                No good reason for listwise deletion or mean substitution to be around.
                I agree that there is no good reason for mean substitution, but I find listwise deletion preferable in many situations. All listwise deletion requires for unbiased estimates is that missingness is independent of the explained/dependent/lef-hand-side/y-variable (Allison 2002, footnote 1), while other methods require additional assumptions. Especially in large datasets listwise deletion is a reasonable default choice. (Having said that, next semester I will be teaching a course on missing data, where I will spend a lot of time on MI, EM, and FIML.)

                Paul D. Allison (2002) Missing Data. Thousand Oaks: Sage.
                ---------------------------------
                Maarten L. Buis
                University of Konstanz
                Department of history and sociology
                box 40
                78457 Konstanz
                Germany
                http://www.maartenbuis.nl
                ---------------------------------

                Comment


                • #9
                  Following Maarten's lines, another interesting contribution on this topic written by Paul Allison is reported at: http://www.statisticalhorizons.com/l...n-its-not-evil
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment

                  Working...
                  X