options for missing values

Navid Asgari

Join Date: Jul 2025

Posts: 30
#1

options for missing values

10 Jan 2015, 18:14

Hi,

I know that the missing values in Stata can be replaced with mean by simple codes which fill in missing values.

But, I wonder if the same can be done with any option added to the regression codes or any user-written codes.

Thanks,
Navid
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

10 Jan 2015, 21:04

I don't think so. I can't think of any command that includes an option to replace missing values with means. As you say, it is simple enough to code this directly first.

But before doing that, think carefully. It is a strong assumption that the mean is even an unbiased predictor of the missing values and it is frequently untrue. Moreover, even if the mean is a good proxy for the missing values, using it fails to capture variation. This is particularly salient in regression analyses. So I would think twice, three times, and more before doing this in most situations. Look into multiple imputation, interpolation, or other approaches to the management of missing data.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#3

11 Jan 2015, 05:22

Navid (as per FAQ, please note the preference on this for for full real names. Just click on the Contact us button, bottom-right of this page and re-register accordingly. Thanks):
as Clyde warned you about, replacing missing values by filling in the mean of the existing observations is, in general, a methodologically risky approach.
As it is easy to figure out, if you have a remarkable number of missing values (but, as far as I know, nobody set a quantitative cut-off) that approach would, at best, reduce the variance across your data, affecting, in turn, standard errors, t and p-values of your regression coefficients, making your estimates biased and potentially unuseful..
Other seemingly easy methods, like last observation carried forward (LOCF) and next observation carried backwards (NOCB) are questionable as well for their methodological weaknesses.
An interesting website on this topic is www.missingdata.org.uk, which is maintained by Jonathan Bartlett (London School of Hygiene & Tropical Medicine), whose posts appear on this forum from time to time.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#4

11 Jan 2015, 13:56

Clyde and Carlo make excellent comments, good advice. Yet another good way to deal with missing data is to re-frame your model as a SEM (most regression models can). Then you can use Full Information Maximum Likelihood (FIML) to get a model with unbiased estimates and proper standard errors assuming MAR (Missing At Random). Setting up the SEM can seem like a hassle, but setting up a good model for Multiple Imputation can be a hassle as well.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4992
#5

11 Jan 2015, 16:40

What I'd really like is for regress and other commands to add a fiml option -- and for gsem to support fiml as well. I've never used fiml but I've heard several people say it is better than mi if you have software that supports it.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

11 Jan 2015, 17:53

Well, I'd be surprised if FIML became a simple option -- it takes an ML approach, while -regress- takes a closed-form solution. But making it available outside a SEM framework seems like a good idea, maybe an -fimlregress- command, or fiml: whatever family of commands . And, to some degree, I trust MI more, since it takes in information from variables outside of the model.

Anyway, MI or FIML, with interpolation under certain circumstances. No good reason for listwise deletion or mean substitution to be around. For convenience, listwise is handy, though the sensitivity tests to justify it, one might as well use a better method to begin with.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4463
#7

12 Jan 2015, 00:37

response to Rich W (#5): (1) note that -mixed-, and many other multi-level models, use FIML (often as the default; see -h mixed)-; (2) IIRC, you have previously cited Paul Allison about this; his main example is longitudinal; while I generally agree with his argument in this (but see below), I think his argument does not generally hold in the cross-sectional multi-level situation; (3) even in the longitudinal case, there may be situations where MI is preferred, including (a) the use of "auxiliary" variables (variables that are not relevant to the final outcome but do help in predicting what the missing data should be); Allison argues that these could be included in the final model but not all readers will accept a final model with predictor variables that have "high" p-values (and some journals won't accept this either); (b) in some cases one can weight the MI replications as a method of approximating the "not missing at random" situation (both MI and ML multi-level models assume "missing at random"); see, e.g., Carpenter, JR, Kenward, MG and White, IR (2007), " Sensitivity analysis after multiple imputation under missing at random: a weighting approach ", _Statistical Methods in Medical Research_, 16: 259-275
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#8

12 Jan 2015, 01:32

Originally posted by ben earnhart View Post

No good reason for listwise deletion or mean substitution to be around.

I agree that there is no good reason for mean substitution, but I find listwise deletion preferable in many situations. All listwise deletion requires for unbiased estimates is that missingness is independent of the explained/dependent/lef-hand-side/y-variable (Allison 2002, footnote 1), while other methods require additional assumptions. Especially in large datasets listwise deletion is a reasonable default choice. (Having said that, next semester I will be teaching a course on missing data, where I will spend a lot of time on MI, EM, and FIML.)

Paul D. Allison (2002) Missing Data. Thousand Oaks: Sage.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#9

12 Jan 2015, 02:05

Following Maarten's lines, another interesting contribution on this topic written by Paul Allison is reported at: http://www.statisticalhorizons.com/l...n-its-not-evil

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

options for missing values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment