Missing Not Random (MNAR): Heckman correction

Slovenius Mappy

Join Date: Jun 2021

Posts: 4
#1

Missing Not Random (MNAR): Heckman correction

25 Jun 2021, 20:05

Hi Stata users:

I have a case of Missing not random (MNAR) using Rubin's standard classification system. The issue is that one of my covariates is unobserved for many observations (the main data is about 3k observations and I have missing covariate data for 2.2k).

My problem could be written in this way:

regression equation: y_i = x1_i*β1 + x2_i*β2 + u_i

selection equation: x2_i observed <-> z_i*γ + v_i > 0

where x2 is a single variable, i=observation (this is cross section data), cov(u,v) <> 0 (this is the source of the bias).

Now this looks quite a bit like a standard Heckman selection model but the selection equation is for a missing covariate (x2) rather than the dependent variable (y).
I was thinking therefore it might be possible to use Stata's standard Heckman command,
https://www.stata.com/manuals/rheckman.pdf
and to create a depvars_s command which is an indicator for the observations with missing covariate values.

Would this be appropriate or is there anything in the Heckman command which is specific to it being the dependent variable with missing values?

Thanks for any assistance on this,
Slov
Tags: None
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2188
#2

25 Jun 2021, 20:57

I believe the -heckman- command is reserved for cases when data on y are missing. But you can get what you want by setting y to missing whenever x2 is missing -- I think. The drawback to this method is that it's not consistent if the reason the data are missing on x2 is a function of x2: you can't include x2 in z. But if you're willing to assume the missingness of x2 is due to z_i (and v_i) then this trick should work.

You can also do it "by hand" and then bootstrap the two-step procedure. Estimate the selection equation by probit and then construct the inverse Mills ratio, imr_i. Then run the regression of y_i on x1_i, x2_i, imr_i using the selected sample. The heteroskedasticity-robust t statistic on imr_i is valid to test the null of no selection bias. It's better to use -heckman- to get the proper standard errors without bootstrapping. But, again, consistency of the estimators is subject to the caveat I raised above.
2 likes
Comment
Slovenius Mappy

Join Date: Jun 2021

Posts: 4
#3

26 Jun 2021, 09:57

Thanks so much Jeff. I will print this thread out and put it in my copy of your textbook!

I will give both a try and see how it goes. I will try to report back as well.

And just to follow up on your main point:
if missingness is function of x2 it seems there is not much I can do. Since thee is a possibility of this in my case are there any options you might suggest I consider?
Alternatively (and mimicking Heckman), could I consider my selection equation to be based on a latent value of x2, x2*:

x2*_i := z_i*γ + v_i
with
x2_i observed <-> x2*_i*> 0

Under this approach it is factors influencing x2 which determine missingness.

Last edited by Slovenius Mappy; 26 Jun 2021, 10:15. Reason: Updated with question at end
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2188
#4

26 Jun 2021, 11:06

Hi Slov: Unfortunately, one's options are limited. The problem when data are missing on explanatory variables is that it's possible that the complete cases estimator is consistent when imputation or a Heckman correction are not. That's because if E(y|x1,x2,s) = E(y|x1,x2) then you should just use OLS on the complete cases.

There is one other possibility, but it requires an instrumental variable (always observed) for x2. I discuss this as Procedure 19.2 in my MIT Press book. Then all exogenous variables that you fully observe go into the selection equation and you instrument for x2. The IMR acts as its own IV.
1 like
Comment
Slovenius Mappy

Join Date: Jun 2021

Posts: 4
#5

26 Jun 2021, 14:37

Originally posted by Jeff Wooldridge View Post

There is one other possibility, but it requires an instrumental variable (always observed) for x2. I discuss this as Procedure 19.2 in my MIT Press book. Then all exogenous variables that you fully observe go into the selection equation and you instrument for x2. The IMR acts as its own IV.

Thanks Jeff. I have the first edition of Econometric Analysis of Cross Section and Panel Data which has slightly different chapters. Is this Procedure 17.2 in the first edition of your book (Section 17.4 A Probit Selection Equation > 17.4.2 Endogenous Explanatory Variables)? This seems to be along the lines of your point here, especially the reference to Example 17.4 (Nonrandomly Missing IQ Scores) on p567.

My apologies for not having the current edition.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2188
#6

26 Jun 2021, 14:52

Yes, that’s it. Thanks for owning the first edition!
1 like
Comment
Slovenius Mappy

Join Date: Jun 2021

Posts: 4
#7

26 Jun 2021, 15:59

OK great thanks so much Jeff. I really appreciate you taking time out of your Saturday for this!
Comment

Announcement

Missing Not Random (MNAR): Heckman correction

Comment

Comment

Comment

Comment

Comment

Comment