Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generated regressor problem and bootstrapping?

    Hello,

    I have the fitted values of a regression to be used as a regressor for another set of regressions. I've been told that I need to be careful with the generated regressor problem and I might need to correct my standard errors by bootstrapping. The regressor I generate is the difference between the fitted/predicted values and the actual values.

    I tried reading the related section in Stata but did not really understand. How and why I need to perform the bootstrapping task?

    Thanks!

  • #2
    As for the why, the usual calculation of standard errors in a regression assumes that all of the regressors (predictors) are actually fixed constants and that the only random variation in the data is in the outcome variable. But when your regressor is actually a random variable (and, in your case, is actually defined to be the random variation in the outcome of another regression) that assumption is severely violated. The usual standard error calculations completely overlook that important variation, and may greatly underestimate the actual sampling variation in your second regression results.

    As for the how, without seeing your code and some example data, I can't give you specific advice. But here's an example of something that is a bit like what I think you are trying to do:

    Code:
    sysuse auto, clear
    
    capture program drop one_rep
    program define one_rep
        regress displacement weight length
        capture drop new_var
        predict new_var, resid
        regress price mpg new_var
        exit
    end
    
    bootstrap, reps(50): one_rep

    Comment


    • #3
      Another way to look at the generated regressor problem is that your generated regressor is not the actual variable that you want to include in your regression, but an estimator of that variable. As an estimator, the generated regressor has additional sampling variance that needs to be taken into account when we calculate the variance of our final parameter estimates.

      The key to the excellent example Clyde provides is that Clyde bootstraps both stages of the procedure, which is the correct way to overcome the generated regressor problem by bootstrapping. (If incorrectly only the second stage is bootstrapped, this only second stage bootstrapping does not resolve the problem.)



      Comment


      • #4
        Clyde, thank you very much for the example, I understand the concept now. Thank you Joro for the explanation.

        Comment


        • #5
          Thanks to Clyde and Joro for helpful comments on this issue; I have a minor question related to this. Is this method valid for time-series data?

          Comment


          • #6
            In case I use an IV regress command for endogeneity problem instead of regressing two steps seperately, do I need to include the vce(bootstrap) in the command? How can I correct the standard errors in both stages in that case?
            Last edited by Sayoree Gooptu; 24 Jun 2021, 22:25.

            Comment


            • #7
              Yes, bootstrapping both stages to resolve the generated regressor problem is valid as long as one uses a valid bootstrap scheme.

              Time series data are harder to bootstrap, the simplest scheme for time series data probably being the residual bootstrap. See
              Kolev, Gueorgui I., and Rasa Karapandza. "Out-of-sample equity premium predictability and sample split–invariant inference." Journal of Banking & Finance 84 (2017): 188-201.
              for an application in assessing stock returns predictability.


              Originally posted by Victoria Consolvo View Post
              Thanks to Clyde and Joro for helpful comments on this issue; I have a minor question related to this. Is this method valid for time-series data?

              Comment


              • #8
                If you can use IV procedure in your problem, this would typically resolve the generated regressor problem.

                So no, you do not need to use vce(bootstrap).

                Originally posted by Sayoree Gooptu View Post
                In case I use an IV regress command for endogeneity problem instead of regressing two steps seperately, do I need to include the vce(bootstrap) in the command? How can I correct the standard errors in both stages in that case?

                Comment


                • #9
                  Actually, I am addressing both sample selection and endogeneity problem following Woolridge (2010) section 19.6.2. I ran a probit for the selection model and then I found out the IMR and incorporated in the ivregress 2sls. In that case, there is mention of bootstrapping because the IMR is a generated regressor. How do I use my bootstrap command in that situation. Also, my data is a weighted data. When I am running the regression with vce(bootstrap, cluster(id)), it says that weights are not supported.

                  Comment


                  • #10
                    See what Clyde shows in #3, you need to bootstrap both your stages. The way how you are doing it does not resolve the generated regressor problem, because you are bootstrapping only the second stage.

                    About the weights, you need to say more what weights you are using and why your data is weighted.

                    Originally posted by Sayoree Gooptu View Post
                    Actually, I am addressing both sample selection and endogeneity problem following Woolridge (2010) section 19.6.2. I ran a probit for the selection model and then I found out the IMR and incorporated in the ivregress 2sls. In that case, there is mention of bootstrapping because the IMR is a generated regressor. How do I use my bootstrap command in that situation. Also, my data is a weighted data. When I am running the regression with vce(bootstrap, cluster(id)), it says that weights are not supported.

                    Comment


                    • #11
                      Dear Joro,

                      Following the above bootstrap discussion, I have one question.
                      In my two-stage regressions, I have multinomial logit as the first stage and OLS as the second stage. I use the predicted probabilities estimated from -mlogit- as independent variables in the second stage OLS regression. The estimated probabilities are generated regressors because they come from another model. Therefore, I plan to use bootstrap to correct the standard errors.

                      Please find my codes below
                      Code:
                      1. mlogit Choice x1 x2 x3      //Choice has three catagories
                      2. predict Prob0 Prob1 Prob2  //generated regressors
                      
                      3. capture program drop bootstrap
                      4. program define bootstrap
                      5. mlogit Choice x1 x2 x3
                      6. capture drop new_var
                      7. predict new_var, resid
                      8. reg Y Prob1 Prob2 x1 x2 
                      9. exit
                      10. end
                      11. bootstrap, reps(50): bootstrap
                      Lines 1 and 2 generate the predicted probabilities.
                      I then use lines 3-11 to program bootstrap. However, there is an error when I run line 7. It seems that option -resid- is not allowed for -mlogit-. How should I obtain residual from -mlogit- to proceed with the bootstrap?

                      I appreciate your kind help!




                      Comment


                      • #12
                        Hi Mengqian
                        The procedure you are using is incorrect for a few reasons.
                        1. it isn't a good idea to use predicted probabilities from a first model as regressors of the second model. This is akin to the forbidden regression problem.
                        2. while your lines 1 and 2 are correct (predict probabilities), lines 5/7 are not because you are not predicting those probabilities. you are trying to predict residuals.
                        3. Mlogit and other non linear models do not have residuals as we are accustomed to see. (y-xb)
                        4. I think in this case, the best option may be a control function.
                        THis means, change line 6 with
                        capture drop r1 r2 r3
                        change line 7 with
                        predict r*, score
                        change line 8 with
                        reg y x1 x2 i.choice r1 r2 r3

                        This may do what you need to do.

                        Best wishes

                        Comment


                        • #13
                          Dear FernandoRios,

                          Thanks a lot for your kind suggestion. I realized that I should predict probabilities rather than residual.

                          I adjusted my codes according to your comments:
                          Code:
                          1. mlogit Choice x1 x2 x3      //Choice has three categories
                          2. predict Prob0 Prob1 Prob2  //generated regressors
                          
                          3. capture program drop bootstrap
                          4. program define bootstrap
                          5. mlogit Choice x1 x2 x3
                          6. capture drop r1 r2 r3
                          7. predict r*, score
                          8. reg Y x1 x2 i.choice r1 r2 r3
                          9. exit
                          10. end
                          11. bootstrap, reps(50): bootstrap
                          Q1 I note that Line 7 predict score rather than probability. Is that because of your first comment that "it isn't a good idea to use predicted probabilities from a first model as regressors of the second model."?

                          Q2 Line 8: independent variables of the second stage model include x1 x2 i.Choice r1 r2 r3. My variables of interest are r2 and r3, so can I just drop r1? Also, may I ask the reason why i.Choice should be included in the regression?

                          Many thanks!

                          Originally posted by FernandoRios View Post
                          Hi Mengqian
                          The procedure you are using is incorrect for a few reasons.
                          1. it isn't a good idea to use predicted probabilities from a first model as regressors of the second model. This is akin to the forbidden regression problem.
                          2. while your lines 1 and 2 are correct (predict probabilities), lines 5/7 are not because you are not predicting those probabilities. you are trying to predict residuals.
                          3. Mlogit and other non linear models do not have residuals as we are accustomed to see. (y-xb)
                          4. I think in this case, the best option may be a control function.
                          THis means, change line 6 with
                          capture drop r1 r2 r3
                          change line 7 with
                          predict r*, score
                          change line 8 with
                          reg y x1 x2 i.choice r1 r2 r3

                          This may do what you need to do.

                          Best wishes

                          Comment


                          • #14
                            Hi Mengqian
                            Q1. Yes, that is why it isn't getting probabilities, but rather Scores, which I think its akin to generalized residuals. Otherwise you enter the Problem I mentioned before.
                            Q2. R2 and r3 (and r1) are just residuals, not probabilities. They help to address the problem of endogeneity by "controlling" for the endogenous component.
                            That is why you include Choice, since that is the variable you are really interested in.
                            That being said. what I'm suggesting is very general, and I'm unaware of empirical or theoretical work that handles endogeneity when the endogenous variable is categorical, and the first step is multinomial logit.
                            So it may be that this strategy isn't valid.
                            HTH

                            Comment

                            Working...
                            X