2SLS Panel Data Regression with endogenous First-Stage Binary Variable

Robert Niewoehner

Join Date: Jun 2018

Posts: 29
#1

2SLS Panel Data Regression with endogenous First-Stage Binary Variable

20 Mar 2020, 08:37

Hello everyone, I hope you all are staying safe out there.

I am trying to find a way to build a two-stage model where the first stage of the model is a binary dependent variable. I have chosen the probit model to estimate this.

My setup:
X_it = Φ( W_it + e_it ), where Φ is the Normal cumulative distribution function
Y_it = αZ_it + βX_it + u_it, where Z may also contain elements of W

I realize this subject has been discussed ad-nauseum in this forum, but it is hard to collect a single recommendation. First, let me link to relevant articles and my question(s) will follow.

1. 2SLS with Binary Endogenous Variable and linear second stage:
https://www.statalist.org/forums/for...enous-variable
Recommended solution is to use either 2SLS in both stages (which ignores the fact that X is binary) or ..

Use solution from Wooldridge (2002, 2010) which is a 3 step process: probit, then do 2SLS while using predicted values (from the probit model) as an instrument for the first stage
This process is also discussed here https://www.statalist.org/forums/for...-in-panel-data

There is a bit more detail here recommending the use of xtivreg

I assume either version of 2SLS is appropriate, depending on data type (ivreg for cross-sections or xtivreg for panels)

2SLS is consistent in both cases, though you lose some precision in the first case as it ignores the binary nature of X

2. 2SLS: Binary Second Stage with Binary Endogenous Variable, : https://www.statalist.org/forums/for...ndent-variable
Recommended solutions is to use either 2SLS (again) ... this is what Angrist and Pischke recommend in "Mostly Harmless" or...

Use biprobit to joint estimate both maximum likelihood models

Wooldridge notes in that post: "A method that plugs in fitted values into nonlinear second stages should be assumed inconsistent unless you prove otherwise."

3. Probit 2SLS: https://stats.stackexchange.com/ques...t-squares-2sls
We cannot use probit model as it's own first stage because " neither the conditional expectation nor the linear projection operator passes through nonlinear functions" as discussed in Wooldridge (2010, p267).

4. The other question that comes up is whether we can use a control function approach. But, as Wooldridge notes here on page 10 (https://www.nber.org/WNE/Slides7-31-...ntrolfuncs.pdf): "CF approaches are more difficult to apply to nonlinear models, even relatively simple ones. Methods are available when the endogenous explanatory variables are continuous, but few if any results apply to cases with discrete first stages."
Despite this fact, we also have access to the etregress command in Stata. This was first mentioned in link in #1 above.

Also mentioned here in a Statalist archive: https://www.stata.com/statalist/arch.../msg00339.html

etregress gives the option to use MLE, two-step estimation, or a control function approach (as of Stata 14, I think)

Ok, with this information out there (and countless other posts that I read through, here are my questions:
People commonly refer to the procedure in Wooldridge (2010), but I cannot find an explicit page number reference to this procedure in the 2010 version. In Section 9.5.2 on page 268, there is a similar discussion regarding a squared first-stage covariate, but not a binary first-stage covariate... but perhaps this is what everyone is referring to? I have combed over the book and cannot seem to find it in the right place. Can someone provide me the exact reference to this procedure so I can correctly cite?

Similarly, is there a parallel discussion for this procedure for panel data in the book? The context on p268 is cross-sectional.

Given Wooldridge's comments about the difficulties surrounding a CF approach with a non-linear model, how do I trust the outputs of etregress if I select the CF option?

Thank you all for your time... I hope my post is helpful for aggregating some of this information and can be useful going forward.

Best regards,
RJ
Tags: None
Robert Niewoehner

Join Date: Jun 2018

Posts: 29
#2

20 Mar 2020, 10:09

I believe I may have answered my own Question #3: etregress with the control function option implements Procedure 21.4 from Wooldridge (2010, p949). This procedure relies on a couple key assumptions, most notably a tri-variate normality assumption on several error terms. So it is a different flavor of a CF approach, then (commonly called the "endogenous switching regression" model).

Questions 1 and 2 remain with respect to Wooldridge's exact reference to the 3-step process.
Comment
Robert Niewoehner

Join Date: Jun 2018

Posts: 29
#3

31 Mar 2020, 09:39

I have also found reference to Wooldridge (2010) Chapter 20 here: https://www.stata.com/statalist/arch.../msg00188.html

Although the procedure discussed on pages 892-894 seems similar, it does not quite appear to be what I have, as I have the final model as a linear model, not a binary response model. Any further thoughts are welcome, as always.
Comment
Robert Niewoehner

Join Date: Jun 2018

Posts: 29
#4

31 Mar 2020, 13:39

Despite staring at this for some time, I believe that I have finally answered my own question.

The procedure I was looking for is Procedure 21.1 in Wooldridge (2010) page 939 under the discussion of “Estimating the Average Treatment Effect Using IV.” In Stata, we can manually use Probit and 2SLS (using the predicted probabilities as instruments) OR we can also leverage the command etregress using the MLE approach. The other option with etregress is to use the two-step approach of Maddala (1983) which augments the regression equation with the hazard.

The final option under etregress is to use the Control Function approach, which may correspond to Procedure 21.4 on page 949, though the Stata help file for etregress under CF also makes reference to Wooldridge (2010), Section 14.2. Hopefully this thread is helpful down the line for someone else with a similar question.
Comment
Sunshine bae

Join Date: Jul 2022

Posts: 9
#5

14 Jul 2022, 02:25

Hello, if I use the below codes, I’d be able to get the same result right?
probit w x1-xn z > . predict ghat > . ivreg y x1-xn (w = ghat) > > where: > y ==> outcome > x1-xn ==> exogenous variables > w ==> endogenous binary variable > z ==> instrument
Comment
William Greenland

Join Date: Apr 2023

Posts: 1
#6

27 Apr 2023, 21:00

Hi Robert,

I'm also attempting to implement this approach (i.e. Procedure 21.4 in Wooldridge (2010), pg. 939) in the context of cross-sectional survey data. One uncertainty I have regards the calculation of standard errors. Given that we are using a generated instrument in the 2SLS command, should we not be bootstrapping standard errors as well? Interested to hear your thoughts on this.

Will
Comment
Robert Niewoehner

Join Date: Jun 2018

Posts: 29
#7

28 Apr 2023, 08:08

Hi William Greenland ... yes, I believe that would not be a bad idea. My impression, based largely on a number of things I've read of Wooldridge's other articles, is that you basically can never go wrong by bootstrapping your standard errors. You may not need to do this, but I don't think it will hurt. If you're doing 21.1, then 2SLS should take care of your SE's. If you're doing 21.4, then perhaps use etregress (and read the manual on what it does with SE's) or just bootstrap. Hope this helps!
Comment

Announcement

2SLS Panel Data Regression with endogenous First-Stage Binary Variable

Comment

Comment

Comment

Comment

Comment

Comment