Correct for Selection on Independent Variables

Kailin Gao

Join Date: Sep 2019

Posts: 31
#1

Correct for Selection on Independent Variables

30 Apr 2020, 23:17

Dear Statalists,
I am confused about how to correct for selection on one independent variable.

I want to estimate Y_ft=beta*Certified_ft+Z_ft in a sample from 2000-2013. Here f, t indicate firm and year respectively. Z are exogenous variables. Certified_ft means whether the firm gets a certification in year t. However, Certified_ft is only observed for firms survived in 2018. (I combined two datasets: one reports Y from 2000-2013; the other reports when firms got certified for firms survived in 2018)

So I face two selection issues: 2) the endogeneity of Certified_ft: factors that affect Certified and Y at the same time. I developed an instrument z1 for it; 2) the survivor bias: I only observe Certified_ft for firms survived in 2018. I wondered how to correct for these biases. I thought of two possibilities:

1) Semykina & Wooldrige (2010) corrected for endogeneity and selection. It is similar to Heckman two-stage method. However, it applied to selection on dependent variables, rather than independent variables.
2) Control function approach in Imbens & Wooldrige (2007) (page 4). First estimate a probit model of Prob(Certified_ft) on instruments z2 (hoping to correct for the survivor bias), obtain its predicted probabilities p2, then estimate Y on Certified, Z and p2, probably using 2SLS (with z1 as the instrument for Certified).

It is a little complicated as I face two layers of selection. Do you think method 2 can help me address this problem? Or what approach else would you recommend?
Any comments would be appreciated! Thank you.
Best,
K
Tags: endogeneity, selection
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

04 May 2020, 12:32

If you want to control for endogeneity and selection, you should look at eregress and the related estimators. That is exactly what they're built to do. I don't know whether they are able to do to layers of selection. There is a panel version of eregress although I think it only uses random effects. You might be able to do a Mundlak

estimator.

In general, selection on the right-hand side variables is not a problem as long as it does not create an association between the right-hand side variables and the error term. This can be a problem I suspect with serial correlation but I'm not sure.

I don't see that either of the options you describe is really designed for two stages of selection. I'm not sure why you have or think you have two layers of selection. If you treat certified as endogenous and then have one layer of selection for survivor bias, that might do it.
Comment
Kailin Gao

Join Date: Sep 2019

Posts: 31
#3

04 May 2020, 22:01

Originally posted by Phil Bromiley View Post

If you want to control for endogeneity and selection, you should look at eregress and the related estimators. That is exactly what they're built to do. I don't know whether they are able to do to layers of selection. There is a panel version of eregress although I think it only uses random effects. You might be able to do a Mundlak

estimator.

In general, selection on the right-hand side variables is not a problem as long as it does not create an association between the right-hand side variables and the error term. This can be a problem I suspect with serial correlation but I'm not sure.

I don't see that either of the options you describe is really designed for two stages of selection. I'm not sure why you have or think you have two layers of selection. If you treat certified as endogenous and then have one layer of selection for survivor bias, that might do it.

Thank you very much Phil! I just checked eregress and it is very useful.

Sorry that I may have used the wrong wording. By "two layers of selection", I mean:
First, Certified is endogenous because unobserved variables like firm ability may affect the outcome and whether the firm got certified, i.e. Certified, at the same time;
Second, the status of Certified is only observed for firms that survived and remained certified in 2018. So this causes a survivor bias: I only observed the status for possibly the relatively stronger firms.

In sum, it is one endogeneity issue and one selection on right-hand-side variable issue. Both issues are concerned with the same regressor Certified.

Would you still suggest treating Certified as endogenous and model the two issues above at the same time?

Thanks again. Any comments would be appreciated.
Best,
Kailin
Comment

Announcement

Correct for Selection on Independent Variables

Comment

Comment