Heckman and Binary (or Categorical) Selection Variable

Ben Hoen

Join Date: May 2014

Posts: 85
#1

Heckman and Binary (or Categorical) Selection Variable

21 May 2015, 13:11

Hi all,

I have been asked by a reviewer to estimate a Heckman selection model, because there is a concern that my variable of interest might have selection bias. Prior to the reviewer's comment I knew little of this methodology, and I know less how to estimate the model using Stata. But I have tried to get up to speed but failed to get it to estimate correctly within Stata.

The issue I am having is that I have not been able to estimate the same equation I use in the journal paper submission with the Heckman correction. I cannot understand if the issue is in my data or in the requirements for the Heckman model. I wondered if any of you with more experience might lead me in the right direction.

I believe the core of the issue is that the selection variable, the variable of interest, is a categorical variable (levels 1-3). Specifically I cannot both include the selection variable in the regression as i.<categorical variable> and as the selection variable "select(<categorical variable>=<other variables>) without having it drop one of the levels. The equation I am estimating is represented as follows (dv=dependent variable; cv=continuous variable; fev=fixed effect/categorical variable): reg dv i.fev1 i.fev2 cv1 cv2. And assuming two other variables might help explain the selection bias the Heckman model might be: heckman dv i.fev1 i.fev2 cv1 cv2, select(fev1=fev3 cv3).

The issue is: one of the categories of the variable of interest is being dropped in the Heckman model, and, I suspect not coincidentally, the levels of the variables that ARE included change dramatically between the reg and Heckman models.

I have created a test using sysuse auto dataset to duplicate the problems I am having.

****start****

sysuse auto,clear
g hrint=int(headroom)
replace foreign=2 if runiform()<0.33 //to create a third level
label drop origin //do not need
replace rep78=int(runiform()*5)+1 if rep78==. //to fix the 5 missing cases

reg price ib0.foreign mpg length ib1.rep78 //this work fine
heckman price ib0.foreign mpg length ib1.rep78, ///
select(foreign=hrint length) //one of the categories of the first IVs is omitted.

****end****

I realize I am way out in uncharted territory here, with a model I do not completely understand and Stata code I am not very familiar with, but hopefully the issue is a simple one that, once clued in, I can correct and move on.

So, thanks, in advance, for any help anyone can offer.

Ben
Tags: None
Ben Hoen

Join Date: May 2014

Posts: 85
#2

25 May 2015, 13:35

Just pinging the group again...

If anyone has any insights my day would be made.

Or, if anyone had any advice how I might reconstruct the question to better tap into the list's collective knowledge.

Thanks, as always, in advance.

Ben
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2148
#3

26 May 2015, 06:24

Ben: Can you provide a little more information? Is this more like a regime switching model as opposed to real missing data? What I mean by "regime switching" is that you can observe someone's income, say, at three levels of participation in a job training program: none, part time, full time. By missing data I mean that the wage offer is not observed for people not in the work force.

The difference is important for determining whether you really have to do a Heckman correction. If you do, there are papers that allow for having a multinomial "treatment" and then using a two-step method. The -heckman- command in Stata won't do it for you. It's possible that the -cmp- command (user written) will.
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#4

12 Jun 2015, 12:35

Jeff,

Thanks for your response. I was away from my desk for 2 weeks with work and have just now seen this. As I mentioned in the post, I am very much a neophyte in terms of Heckman and its usefulness and application (and, of course, ability to apply it using Stata), but I will try to answer your questions.

The research question we are hoping to answer is whether home prices near large scale wind energy facilities (in MA) are adversely affected as compared to, either, thier prices prior to the announcement and eventual installation of the facility or average prices further away from the facilitiy. FYI We have ~ 1,500 transactions that occured near (within 1/2 mile of) MA wind facilities after the facilities were installed, and then a set of ~ 5,000 transactions before the announcement of installations but near where they were eventually installed and over 300,000 transactions of homes far from where turbines were installed/are located. We use a difference in difference specification to: 1st compare prices of homes near where turbines eventually are located with transactions further away, but which occurred in the same time period (this to measure any pre-existing price differences - because turbines might be sited in areas with lower pre-existing prices); and, 2nd compare prices of homes close to the turbines but which occurred after installation to transactions far from where turbines were located occuring in the same period. And then we compare these two differences. (We also examined if adding a 3rd difference in - comparing homes far away that occurred before installation to home sales far away that occurred after installation - but did not find any significant difference after controlling for inflation/deflation, which is controlled for separately in the model.) So that is the main research approach.

As a counter-factual, we also included in the models information if the homes were close to a variatey of other amenity and disamenities (roads, high voltage transmission lines, beaches, open space, highways, prisons, etc.). All of these were "installed" prior to the sample period - and therefore for which we did not construct a DiD specification, but rather included them in the models as a fixed effects.

Our analysis found that homes near turbines do indeed sell for less than homes further away in both the pre-announcement/installation and post-installation phases, but that the post-installation impact was not statistically different from the pre-announcement/installation difference. Therefore we concluded were were unable to find evidence of an impact. Alternatively we did find strong evidence of impacts to the set of amenities and disamenities (A&D). These A&D effects were very similar to those that other researchers have found seperately and therefore met our apriori assumptions. They also allowed us to say that although we found effects from living near a set of A&D, we did not find similar evidence from the "treatment" of turbines. We have various reasons to hink as to why that is occuring, which I will not go into detail about here (but glad to if you wanted to know).

OK, that's the backstory....

The reason the reviewer wanted us to include the Heckman model was because he believed that there was something inherently different about the set of transactions near the turbines that was unobserved. (We agree with that - there is, and our model seemed to pick that up). He asked that we seek out additional data to see if we could control for these differences. We did, we collected census data at the block group level of income, age, and education, all of which might be correlated with the sales prices. We re-ran the model with them included and found SS coefficients, but it did not affect our main results. He had asked that we include these variables in a Heckman model. So we did, we included these variables with one of the originally included IVs in the "select" statement. Remember that our DV is binary - one is either close to the turbines or far away - and interacted with the categorical time period (either prior to announcement of installation, after announcement but prior to installation, or after installation). We also created a single categorical variable (i.e., not-interacted) to simply matters. Anyway, this is where the analysis went off the rails. We were never able to get our model to estimate correctly. it dropped large sections of variables.

So, your assertion that the Heckman does not work with categorical variables is the issue. I will look into cmp (user written) to see what I come up with. Any other comments you wanted to offer are welcome.

Thanks,

Ben
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#5

12 Jun 2015, 12:45

Jeff,

One follow-up question...

Can you offer any citation/explanation why the Heck man model will not work with multinomial DVs? That might be helpful to know in terms of responding to the reviewer comments.

BTW, we used your textbook in grad school, I have a copy in my office. (I just double-checked that HEckman was not mentioned in there - but maybe the model is referred to differently).

Ben
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#6

25 Jun 2015, 15:05

Hi all,

Just trying again...

If anyone has any additional insights it would be great. I added some more information because one respondent was curious.

Thanks, as always, in advance.

Ben
Comment

Announcement

Heckman and Binary (or Categorical) Selection Variable

Comment

Comment

Comment

Comment

Comment