Hi all,
I have been asked by a reviewer to estimate a Heckman selection model, because there is a concern that my variable of interest might have selection bias. Prior to the reviewer's comment I knew little of this methodology, and I know less how to estimate the model using Stata. But I have tried to get up to speed but failed to get it to estimate correctly within Stata.
The issue I am having is that I have not been able to estimate the same equation I use in the journal paper submission with the Heckman correction. I cannot understand if the issue is in my data or in the requirements for the Heckman model. I wondered if any of you with more experience might lead me in the right direction.
I believe the core of the issue is that the selection variable, the variable of interest, is a categorical variable (levels 1-3). Specifically I cannot both include the selection variable in the regression as i.<categorical variable> and as the selection variable "select(<categorical variable>=<other variables>) without having it drop one of the levels. The equation I am estimating is represented as follows (dv=dependent variable; cv=continuous variable; fev=fixed effect/categorical variable): reg dv i.fev1 i.fev2 cv1 cv2. And assuming two other variables might help explain the selection bias the Heckman model might be: heckman dv i.fev1 i.fev2 cv1 cv2, select(fev1=fev3 cv3).
The issue is: one of the categories of the variable of interest is being dropped in the Heckman model, and, I suspect not coincidentally, the levels of the variables that ARE included change dramatically between the reg and Heckman models.
I have created a test using sysuse auto dataset to duplicate the problems I am having.
****start****
sysuse auto,clear
g hrint=int(headroom)
replace foreign=2 if runiform()<0.33 //to create a third level
label drop origin //do not need
replace rep78=int(runiform()*5)+1 if rep78==. //to fix the 5 missing cases
reg price ib0.foreign mpg length ib1.rep78 //this work fine
heckman price ib0.foreign mpg length ib1.rep78, ///
select(foreign=hrint length) //one of the categories of the first IVs is omitted.
****end****
I realize I am way out in uncharted territory here, with a model I do not completely understand and Stata code I am not very familiar with, but hopefully the issue is a simple one that, once clued in, I can correct and move on.
So, thanks, in advance, for any help anyone can offer.
Ben
I have been asked by a reviewer to estimate a Heckman selection model, because there is a concern that my variable of interest might have selection bias. Prior to the reviewer's comment I knew little of this methodology, and I know less how to estimate the model using Stata. But I have tried to get up to speed but failed to get it to estimate correctly within Stata.
The issue I am having is that I have not been able to estimate the same equation I use in the journal paper submission with the Heckman correction. I cannot understand if the issue is in my data or in the requirements for the Heckman model. I wondered if any of you with more experience might lead me in the right direction.
I believe the core of the issue is that the selection variable, the variable of interest, is a categorical variable (levels 1-3). Specifically I cannot both include the selection variable in the regression as i.<categorical variable> and as the selection variable "select(<categorical variable>=<other variables>) without having it drop one of the levels. The equation I am estimating is represented as follows (dv=dependent variable; cv=continuous variable; fev=fixed effect/categorical variable): reg dv i.fev1 i.fev2 cv1 cv2. And assuming two other variables might help explain the selection bias the Heckman model might be: heckman dv i.fev1 i.fev2 cv1 cv2, select(fev1=fev3 cv3).
The issue is: one of the categories of the variable of interest is being dropped in the Heckman model, and, I suspect not coincidentally, the levels of the variables that ARE included change dramatically between the reg and Heckman models.
I have created a test using sysuse auto dataset to duplicate the problems I am having.
****start****
sysuse auto,clear
g hrint=int(headroom)
replace foreign=2 if runiform()<0.33 //to create a third level
label drop origin //do not need
replace rep78=int(runiform()*5)+1 if rep78==. //to fix the 5 missing cases
reg price ib0.foreign mpg length ib1.rep78 //this work fine
heckman price ib0.foreign mpg length ib1.rep78, ///
select(foreign=hrint length) //one of the categories of the first IVs is omitted.
****end****
I realize I am way out in uncharted territory here, with a model I do not completely understand and Stata code I am not very familiar with, but hopefully the issue is a simple one that, once clued in, I can correct and move on.
So, thanks, in advance, for any help anyone can offer.
Ben
Comment