Generated regressor problem and bootstrapping?

Hakan Gunduz

Join Date: Dec 2018

Posts: 49
#1

Generated regressor problem and bootstrapping?

16 Dec 2018, 07:46

Hello,

I have the fitted values of a regression to be used as a regressor for another set of regressions. I've been told that I need to be careful with the generated regressor problem and I might need to correct my standard errors by bootstrapping. The regressor I generate is the difference between the fitted/predicted values and the actual values.

I tried reading the related section in Stata but did not really understand. How and why I need to perform the bootstrapping task?

Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

16 Dec 2018, 10:38

As for the why, the usual calculation of standard errors in a regression assumes that all of the regressors (predictors) are actually fixed constants and that the only random variation in the data is in the outcome variable. But when your regressor is actually a random variable (and, in your case, is actually defined to be the random variation in the outcome of another regression) that assumption is severely violated. The usual standard error calculations completely overlook that important variation, and may greatly underestimate the actual sampling variation in your second regression results.

As for the how, without seeing your code and some example data, I can't give you specific advice. But here's an example of something that is a bit like what I think you are trying to do:

Code:

sysuse auto, clear capture program drop one_rep program define one_rep regress displacement weight length capture drop new_var predict new_var, resid regress price mpg new_var exit end bootstrap, reps(50): one_rep
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

16 Dec 2018, 10:59

Another way to look at the generated regressor problem is that your generated regressor is not the actual variable that you want to include in your regression, but an estimator of that variable. As an estimator, the generated regressor has additional sampling variance that needs to be taken into account when we calculate the variance of our final parameter estimates.

The key to the excellent example Clyde provides is that Clyde bootstraps both stages of the procedure, which is the correct way to overcome the generated regressor problem by bootstrapping. (If incorrectly only the second stage is bootstrapped, this only second stage bootstrapping does not resolve the problem.)
Comment
Hakan Gunduz

Join Date: Dec 2018

Posts: 49
#4

16 Dec 2018, 11:36

Clyde, thank you very much for the example, I understand the concept now. Thank you Joro for the explanation.
Comment
Victoria Consolvo

Join Date: Mar 2019

Posts: 31
#5

31 Jan 2021, 14:01

Thanks to Clyde and Joro for helpful comments on this issue; I have a minor question related to this. Is this method valid for time-series data?
Comment
Sayoree Gooptu

Join Date: Sep 2020

Posts: 42
#6

24 Jun 2021, 22:03

In case I use an IV regress command for endogeneity problem instead of regressing two steps seperately, do I need to include the vce(bootstrap) in the command? How can I correct the standard errors in both stages in that case?

Last edited by Sayoree Gooptu; 24 Jun 2021, 22:25.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#7

25 Jun 2021, 00:53

Yes, bootstrapping both stages to resolve the generated regressor problem is valid as long as one uses a valid bootstrap scheme.

Time series data are harder to bootstrap, the simplest scheme for time series data probably being the residual bootstrap. See
Kolev, Gueorgui I., and Rasa Karapandza. "Out-of-sample equity premium predictability and sample split–invariant inference." Journal of Banking & Finance 84 (2017): 188-201.
for an application in assessing stock returns predictability.

Originally posted by Victoria Consolvo View Post

Thanks to Clyde and Joro for helpful comments on this issue; I have a minor question related to this. Is this method valid for time-series data?
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

25 Jun 2021, 00:56

If you can use IV procedure in your problem, this would typically resolve the generated regressor problem.

So no, you do not need to use vce(bootstrap).

Originally posted by Sayoree Gooptu View Post

In case I use an IV regress command for endogeneity problem instead of regressing two steps seperately, do I need to include the vce(bootstrap) in the command? How can I correct the standard errors in both stages in that case?
1 like
Comment
Sayoree Gooptu

Join Date: Sep 2020

Posts: 42
#9

25 Jun 2021, 01:28

Actually, I am addressing both sample selection and endogeneity problem following Woolridge (2010) section 19.6.2. I ran a probit for the selection model and then I found out the IMR and incorporated in the ivregress 2sls. In that case, there is mention of bootstrapping because the IMR is a generated regressor. How do I use my bootstrap command in that situation. Also, my data is a weighted data. When I am running the regression with vce(bootstrap, cluster(id)), it says that weights are not supported.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#10

30 Aug 2021, 05:00

See what Clyde shows in #3, you need to bootstrap both your stages. The way how you are doing it does not resolve the generated regressor problem, because you are bootstrapping only the second stage.

About the weights, you need to say more what weights you are using and why your data is weighted.

Originally posted by Sayoree Gooptu View Post

Actually, I am addressing both sample selection and endogeneity problem following Woolridge (2010) section 19.6.2. I ran a probit for the selection model and then I found out the IMR and incorporated in the ivregress 2sls. In that case, there is mention of bootstrapping because the IMR is a generated regressor. How do I use my bootstrap command in that situation. Also, my data is a weighted data. When I am running the regression with vce(bootstrap, cluster(id)), it says that weights are not supported.
Comment
Mengqian Chen

Join Date: Feb 2020

Posts: 20
#11

30 Aug 2021, 07:06

Dear Joro,

Following the above bootstrap discussion, I have one question.
In my two-stage regressions, I have multinomial logit as the first stage and OLS as the second stage. I use the predicted probabilities estimated from -mlogit- as independent variables in the second stage OLS regression. The estimated probabilities are generated regressors because they come from another model. Therefore, I plan to use bootstrap to correct the standard errors.

Please find my codes below

Code:

1. mlogit Choice x1 x2 x3 //Choice has three catagories 2. predict Prob0 Prob1 Prob2 //generated regressors 3. capture program drop bootstrap 4. program define bootstrap 5. mlogit Choice x1 x2 x3 6. capture drop new_var 7. predict new_var, resid 8. reg Y Prob1 Prob2 x1 x2 9. exit 10. end 11. bootstrap, reps(50): bootstrap

Lines 1 and 2 generate the predicted probabilities.
I then use lines 3-11 to program bootstrap. However, there is an error when I run line 7. It seems that option -resid- is not allowed for -mlogit-. How should I obtain residual from -mlogit- to proceed with the bootstrap?

I appreciate your kind help!
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2470
#12

30 Aug 2021, 07:14

Hi Mengqian
The procedure you are using is incorrect for a few reasons.
1. it isn't a good idea to use predicted probabilities from a first model as regressors of the second model. This is akin to the forbidden regression problem.
2. while your lines 1 and 2 are correct (predict probabilities), lines 5/7 are not because you are not predicting those probabilities. you are trying to predict residuals.
3. Mlogit and other non linear models do not have residuals as we are accustomed to see. (y-xb)
4. I think in this case, the best option may be a control function.
THis means, change line 6 with
capture drop r1 r2 r3
change line 7 with
predict r*, score
change line 8 with
reg y x1 x2 i.choice r1 r2 r3

This may do what you need to do.

Best wishes
Comment
Mengqian Chen

Join Date: Feb 2020

Posts: 20
#13

30 Aug 2021, 08:13

Dear FernandoRios,

Thanks a lot for your kind suggestion. I realized that I should predict probabilities rather than residual.

I adjusted my codes according to your comments:

Code:

1. mlogit Choice x1 x2 x3 //Choice has three categories 2. predict Prob0 Prob1 Prob2 //generated regressors 3. capture program drop bootstrap 4. program define bootstrap 5. mlogit Choice x1 x2 x3 6. capture drop r1 r2 r3 7. predict r*, score 8. reg Y x1 x2 i.choice r1 r2 r3 9. exit 10. end 11. bootstrap, reps(50): bootstrap

Q1 I note that Line 7 predict score rather than probability. Is that because of your first comment that "it isn't a good idea to use predicted probabilities from a first model as regressors of the second model."?

Q2 Line 8: independent variables of the second stage model include x1 x2 i.Choice r1 r2 r3. My variables of interest are r2 and r3, so can I just drop r1? Also, may I ask the reason why i.Choice should be included in the regression?

Many thanks!

Originally posted by FernandoRios View Post

Hi Mengqian
The procedure you are using is incorrect for a few reasons.
1. it isn't a good idea to use predicted probabilities from a first model as regressors of the second model. This is akin to the forbidden regression problem.
2. while your lines 1 and 2 are correct (predict probabilities), lines 5/7 are not because you are not predicting those probabilities. you are trying to predict residuals.
3. Mlogit and other non linear models do not have residuals as we are accustomed to see. (y-xb)
4. I think in this case, the best option may be a control function.
THis means, change line 6 with
capture drop r1 r2 r3
change line 7 with
predict r*, score
change line 8 with
reg y x1 x2 i.choice r1 r2 r3

This may do what you need to do.

Best wishes
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2470
#14

30 Aug 2021, 13:30

Hi Mengqian
Q1. Yes, that is why it isn't getting probabilities, but rather Scores, which I think its akin to generalized residuals. Otherwise you enter the Problem I mentioned before.
Q2. R2 and r3 (and r1) are just residuals, not probabilities. They help to address the problem of endogeneity by "controlling" for the endogenous component.
That is why you include Choice, since that is the variable you are really interested in.
That being said. what I'm suggesting is very general, and I'm unaware of empirical or theoretical work that handles endogeneity when the endogenous variable is categorical, and the first step is multinomial logit.
So it may be that this strategy isn't valid.
HTH
Comment
Johanna Krenz

Join Date: Oct 2022

Posts: 33
#15

16 May 2024, 07:51

Hi,

I'm trying to nest IV into bootstrapping, but get the error message:
insufficient observations to compute bootstrap standard errors
no results will be saved
r(2000);

Let me try to explain the background: In empirical macro it is common to use IV estimation to instrument the change in interest rates by a high-frequency (or other) identified monetary policy shock. This allows us to estimate the reaction to an exogenous change in the policy rate.

Now, I want to first estimate the reaction of the stock market to a monetary policy shock by using IV (ivreghdfe). In a next step, I want to use this stock market elasticity as a regressor in another IV estimation (ivreghdfe again). However, whenever I try to do that, I get the above error message

If I just use reghdfe in both steps, i.e. pretending that the identified monetary policy shock really is my policy rate change, the bootstrapping procedure works fine.

What could be the problem here?

Thanks in advance!
Comment

Announcement

Generated regressor problem and bootstrapping?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment