forbidden regression (quadratic regression)

River Huang

Join Date: Mar 2016

Posts: 1908
#1

forbidden regression (quadratic regression)

15 Mar 2021, 22:15

Dear All, Suppose that I regress y on x and x^2, along with other covariates. In addition, x is endogenous, and I have an IV, say, z. I just learned that it seems incorrect to use z, and its squared term z^2, to be valid IVs for x and x^2? If this correct, does it mean that I need to find an additional IV for x^2? Thanks for any comments.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

16 Mar 2021, 01:03

It is a common practice to instrument for x and x^2 with z and z^2, where z is a valid instrument for x. There is nothing wrong with what you are doing.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#3

16 Mar 2021, 01:40

Dear Joro, Thank you for your reply. I thought that it is a common practice to do so. However, Angrist and Pischke (2009) argue that

Did I miss (misunderstand) the point?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#4

16 Mar 2021, 02:10

Yes, I think you misunderstood their point.

They are saying that we should not plug in predicted values in nonlinear model, and this is so because the expectation operator does not pass through nonlinear functions, E[g(X)] is not equal to g[E(X)] generally for a nonlinear function g(.).

What they are saying is that the following regression is forbidden:

Y = a + b*Xhat + c*Xhat^2 + e.

They are not saying that you cannot use z and z^2, or for that matter xhat and xhat^2 as instruments for x and x^2.
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#5

16 Mar 2021, 02:11

In the above Xhat is the predicted value from a regression of X on Z.
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#6

16 Mar 2021, 02:21

Dear Joro, Got it and thanks a lot. To be sure, suppose that we first obtain Xhat (the predicted value from a regression of X on Z, i.e., using only one IV, Z). It is incorrect to plug in Xhat and "its" squared term Xhat^2 into the second stage regression. Am I right? Thanks.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

16 Mar 2021, 03:07

Yes, exactly. Lets say you have only the two endogenous variables x and x^2. And you have an instrument z. Lets say that xhat is the predicted value from

regress x z

Then, you can do

ivregress 2sls y (x x^2 = z z^2)

or you can do

ivregress 2sls y (x x^2 = xhat xhat^2)

but you should not do

regress y xhat xhat^2.

Note that these estimates are different, lets say that in the auto data mpg is endogenous, and we want to instrument it with headroom:

Code:

. sysuse auto
(1978 Automobile Data)

. ivregress 2sls price (mpg c.mpg#c.mpg = headroom c.headroom#c.headroom)

Instrumental variables (2SLS) regression          Number of obs   =         74
                                                  Wald chi2(2)    =       2.59
                                                  Prob > chi2     =     0.2743
                                                  R-squared       =          .
                                                  Root MSE        =     3707.2

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -3003.265   2002.773    -1.50   0.134    -6928.627    922.0975
             |
 c.mpg#c.mpg |   64.63511   44.86007     1.44   0.150    -23.28901    152.5592
             |
       _cons |   38675.59   21011.68     1.84   0.066    -2506.554    79857.73
------------------------------------------------------------------------------
Instrumented:  mpg c.mpg#c.mpg
Instruments:   headroom c.headroom#c.headroom

. qui reg mpg headroom

. predict mpghat
(option xb assumed; fitted values)

. ivregress 2sls price (mpg c.mpg#c.mpg = mpghat c.mpghat#c.mpghat)

Instrumental variables (2SLS) regression          Number of obs   =         74
                                                  Wald chi2(2)    =       2.59
                                                  Prob > chi2     =     0.2743
                                                  R-squared       =          .
                                                  Root MSE        =     3707.2

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -3003.265   2002.773    -1.50   0.134    -6928.628     922.097
             |
 c.mpg#c.mpg |   64.63512   44.86007     1.44   0.150      -23.289    152.5592
             |
       _cons |   38675.59   21011.68     1.84   0.066    -2506.549    79857.74
------------------------------------------------------------------------------
Instrumented:  mpg c.mpg#c.mpg
Instruments:   mpghat c.mpghat#c.mpghat

.

What we did so far is allowed. They give the same estimates, but this is coincidential because the model is exactly identified. If it were overidentified it would not come out like this.

The following is forbidden:

Code:

. reg price mpghat c.mpghat#c.mpghat

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =      2.11
       Model |  35556811.7         2  17778405.8   Prob > F        =    0.1293
    Residual |   599508584        71  8443782.88   R-squared       =    0.0560
-------------+----------------------------------   Adj R-squared   =    0.0294
       Total |   635065396        73  8699525.97   Root MSE        =    2905.8

-----------------------------------------------------------------------------------
            price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
           mpghat |   4017.599   2320.138     1.73   0.088    -608.6247    8643.822
                  |
c.mpghat#c.mpghat |  -98.40761   54.79896    -1.80   0.077    -207.6736    10.85842
                  |
            _cons |  -34207.12   24345.77    -1.41   0.164     -82751.2    14336.97
-----------------------------------------------------------------------------------

Note that the estimates are rather different,

Originally posted by River Huang View Post

Dear Joro, Got it and thanks a lot. To be sure, suppose that we first obtain Xhat (the predicted value from a regression of X on Z, i.e., using only one IV, Z). It is incorrect to plug in Xhat and "its" squared term Xhat^2 into the second stage regression. Am I right? Thanks.

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#8

16 Mar 2021, 03:20

Dear Joro, Thanks again for your helpful clarification.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
1 like
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2459
#9

16 Mar 2021, 04:57

Hi River
For situations like this, you also have the option to use the control function approach. In which the residuals, rather than predicted values, of the first state are used in your second stage.
Of course, the main problem with this is that ready made commands for its application are not yet available. Although, Last year, Enrique Pinzon hinted that there could be a "CFregress" command coming to Stata,
Fernando
1 like
Comment
River Huang

Join Date: Mar 2016

Posts: 1908
#10

16 Mar 2021, 16:37

Dear FernandoRios, Thanks for the additional information.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#11

16 Sep 2021, 00:29

Originally posted by Joro Kolev View Post

In the above Xhat is the predicted value from a regression of X on Z.

Dear Joro Kolev and other members,

I also would like to have a question regarding the forbidden regression. Specifically, I have read some papers and authors of those papers stated that 2SLS would produce inconsistent estimates in case if a dependent variable is continuous, while an endogenous variable is a binary (IV used to instrument for the endogenous is also a binary) and that they suggested to use a three-stage procedure. So, the followings are what I am confusing and I hope to have your insightful advice:

1) If both the dependent variable and independent variable are binary, using 2SLS would be appropriate, is my understanding correct? Additionally, is there any conditions for the measurement of IV (e.g., should it be binary or continuous)?
2) If the dependent variable is continuous and independent variable is binary, using 2SLS might not be appropriate and the three-stage procedure should be used instead. Is this correct?
3) I would highly appreciate if you have advice on how to interpret results in this case. Let's take an example as follows: the dependent variable, Y is the log of income; a binary endogenous variable edu (0=high school or above; 1=lower than high school); and Z (1=exposed; 0 otherwise) is an education reform which is assumed to be exogenous. I am interested in the effect of education on income and Z is used to instrument for edu. Below is the result:

-------------------------------------------------------------------------------

lninc | Coefficient std. err. z P>|z| [95% conf. interval]
-----------------+----------------------------------------------------------------
edu | -1.315783 .4834666 -2.72 0.006 -2.26336 -.3682062

Is it accurate to interpret that being in the lower high school group lead to [exp(-1.316)-1]*100 = 73.2% lower income than individuals with high school or above education?

Thank you.

Last edited by Duong Le; 16 Sep 2021, 00:32.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#12

16 Sep 2021, 01:21

2SLS is always appropriate, regardless of the nature of any of the variables, in the same sense in which the Linear Probability Model is appropriate for estimating probabilities. They do not properly take into account the nature of the variables involved, but provide reasonable approximation of the conditional mean at the means of the variables.

There are never any conditions on the nature of your instrumental variable.

Therefore:

1) “If both the dependent variable and independent variable are binary, using 2SLS would be appropriate.”

Yes, you can use 2SLS (2SLS is appropriate) in the same sense in which you can use Linear Probability Model without instrumentation. If you want to properly take into account the binary nature of the dependent and independent variables, you can also try -biprobit-.

2) “If the dependent variable is continuous and independent variable is binary, using 2SLS might not be appropriate and the three-stage procedure should be used instead.”

No, 2SLS is still appropriate. You can also do the following procedure (which you probably mean by “three-stage procedure”): i) Fit a probit or logit model of the endogenous regressor on the instruments, and generate the predictions. ii) Use these generated predictions as instrumental variables in the 2SLS analysis. Do not plug the predictions in directly, this is the forbidden regression, but use them as instruments.

3) Yes, your interpretation of the estimated parameter is correct, according to Professor Wooldridge. In his “introductory econometrics” he has a section on “More on using logarithmic functional forms” where he states this interpretation without proof, and the same interpretation appears in his “econometrics of cross sectional and panel data,” again without proof.

There is also another formula adjusting for curvature, which I cannot recall ever seeing used in published research (so I guess you can safely disregard it, although this looks more like the correct formula to me):
if b is the estimated coefficient on a dummy variable and V(b) is the estimated variance of b then:

g = 100 (exp(b - V(b)/2) - 1)

gives an estimate of the percentage impact of the dummy variable on the variable being explained.

http://www.econometrics.com/intro/dumlog.htm
1 like
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#13

16 Sep 2021, 08:05

Dear Prof. Joro,

Thank you so much for your explicit explanations and insightful advice. I appreciate that.

As for question 3, I would like to have a follow-up question:

No, 2SLS is still appropriate

Do you mean here that using either 2SLS or the three-stage procedure is appropriate in this case?

Additionally, I would like to seek your advice on interpretation of an interaction term in IV regressions. Specifically, I want to examine the heterogeneous effects of education by location of residence (rural and urban areas). Let area (1=urban; 0=rural) denote whether an individual lives in rural or urban areas, my regression is as follows:

Code:

xtivreg2 Y x1 area (x2 x2*area = Z Z*area), fe robust cluster(id)

Results:

Code:

---------------------------------------------------------------------------------- | Robust lninc | Coefficient std. err. z P>|z| [95% conf. interval] -----------------+---------------------------------------------------------------- edu | .0209683 .1946696 0.11 0.914 -.3605771 .4025137 edu*area | .4151423 .2424001 1.71 0.087 -.0599531 .8902377

Since xtivreg2 from SSC does not support varlist function in Stata, so I created the interaction terms (x2*area and Z*area) manually. Because I am not sure the interaction term edu*area indicates which groups in my sample, I guess that is individuals living in urban area and with a higher education (high school and above), so my interpretation is that individuals living in urban areas and with higher education have a higher income than their counterparts living in rural and with lower income by [exp(0.415)-1]*100. Am I correct about group identifications of the interaction term?

Thank you.
P/s" I am sorry for the messy output. I tried to format Table nicely but failed.

Last edited by Duong Le; 16 Sep 2021, 08:07.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#14

17 Sep 2021, 01:39

“Do you mean here that using either 2SLS or the three-stage procedure is appropriate in this case?”

Yes, both procedures are consistent, the three step procedure might be more “natural” than the 2SLS, and therefore preferable. But 2SLS is consistent in this case too, so there is nothing wrong just using 2SLS.

“my interpretation is that individuals living in urban areas and with higher education have a higher income than their counterparts living in rural and with lower income by [exp(0.415)-1]*100. ”

I am not sure what you are saying. What you have is 2 dummy variables. The base /omitted category is (low education & rural), the estimate on (high education & urban) is the difference between (high education & urban) and the omitted/base category (low education & rural). So the estimate [exp(0.415)-1]*100 is the premium in earning of being educated and urban, compared to being uneducated and rural.
1 like
Comment
Duong Le

Join Date: Apr 2020

Posts: 66
#15

17 Sep 2021, 02:32

Thank you, Professor. I got it. Really appreciate your enthusiastic support.

Sorry to add one more question. Is it a standard way to examine the heterogeneous effects of education by area? I am thinking of another approach, that is I split my data into two groups: rural and urban areas. The, I estimate the effects of education separately for each group. Finally, I test whether the coefficients of education estimated from the two groups are significantly different from zero.

Thank you.

Last edited by Duong Le; 17 Sep 2021, 02:38.
Comment

Announcement