Missing data, multiple imputation alternatives?

William Brewer

Join Date: Mar 2018

Posts: 10
#1

Missing data, multiple imputation alternatives?

16 Mar 2018, 05:54

I would greatly appreciate if you could give me a really quick bit of advice around missing values:

A bit of context - I have a panel dataset with 19 variables and approx 250k observations

My dataset has less than 10% missing values, however this leads to stata regressing using only 20% of the overall observations. I have written code for multiple imputation which seems like the best option, however our university computers simply aren’t powerful enough to compute this and would take weeks. Having looked through all alternatives, substituting the mean (and perhaps using a dummy variable indicating missing data) seems the most best solution but has been criticised for artificially decreasing the standard errors, therefore leading to invalid inference.

Are there any alternatives (or simplifications) to multiple imputation which will give unbiased estimates for the missing data without adversely affecting the variance?

Many thanks,

WB
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

16 Mar 2018, 06:10

I gather the main concern here is identifying the pattern of missing data (MCAR, MAR, MNAR).

That said, I don't know whether there is an "alternative strategy" which provides "unbiased estimates" for all patterns of missing data (MI is not capable of that as well), but maybe using previous or following nonmissing values can be helpful.

There is a FAQ on this, here.

Best regards,

Marcos
Comment
William Brewer

Join Date: Mar 2018

Posts: 10
#3

16 Mar 2018, 06:19

That is a bit of a tricky one, most of the missing data is MCAR however there are one or two variables which aren't, such as a dummy variable which is only present every other time period.

It may solve some of the problem copying over previous observations, but would not be appropriate a lot of the time (e.g. copying characteristics to other persons).

My model is:

ln(hourly wages) = b0 + b1exp +b2exp^2 + b3occupation + ... + u
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

16 Mar 2018, 12:06

I think that full information maximum likelihood estimation, which is available in our linear sem command:

Code:

sem wages <- x1 x2 x3 ... , method(mlmv)

is another option for missing covariates, but it is also very processor intensive. That and MI are the only two options I know of to get (asymptotically) unbiased estimates from data missing at random.

Another problem is that efficient multiple imputation for panel/clustered data is a theoretical problem that doesn't appear to have been resolved very well. If you have relatively few clusters, or you're willing to assume multivariate normality among the missing variables, then the options described here may work. I somehow don't think your use case falls into any of the 3 approaches described.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment

Announcement

Missing data, multiple imputation alternatives?

Comment

Comment

Comment