Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • IV with control function method - controls and interactions with endogenous regressor

    Hi all,

    Apologies if I am double-posting, but I have struggled to find clear guidance or a post directly related to the below IV related questions - if someone could point me in the right direction I would greatly appreciate it.

    I am implementing an IV estimation using the control function method, with different outcome variables Y. Some models are poisson, others OLS, but the endogenous regressor and the instrument are the same. I have always thought that
    - in all IV estimations, the first and second stage need to include the same control variables
    - if the second stage includes an interaction term between the endogenous regressor and some other variable, I need to have two first stages: one on the endogenous regressor itself, the other on the interaction term, where the instrument is interacted with the same variable.

    I would like to confirm if the above are necessarily always correct. When using the control function method, I am including the residual from the first stage in the second stage, so would this not control for the endogenous part of the endogenous regressor as well as its interacted version? I understand that each additional endogenous regressor requires an additional instrument, but is an interaction of the same variable really an additional endogenous regressor? Moreover, if the control variables used in the second stage are not required for the validity of the instrument, do they need to be in the first stage?

    So, if the second stage is

    Y = b0 + b1 X1 + b2 X1*X2 + b3 X2 + b4 X3 + u + e

    where u is the residual from a regression

    X1 = a0 + a1 Z + u

    would I need to also run

    X1*X2 = a0 + a1 Z*X2 + u_int

    and include u_int in the second stage? and would the first stage regressions have to necessarily include X2 and X3?
    Last edited by Pia Andres; 21 Jul 2025, 07:30.

  • #2
    I think you only need one first stage. That's a nice thing about the CF approach when you have interactions.

    Comment


    • #3
      Thank you, that's very helpful! And would the first stage need to include all the control variables as the second stage, or could it have a different set depending on what is needed for the instrument to be valid?

      Comment


      • #4
        In the first stage, you regress the endogenous variable on instruments + exogenous variables and in the second stage, you regress the outcome on endogenous regressor, exogenous variables, and residual. So the exogenous variables have to be in both stages. In a linear model, 2SLS and the CF approach are equivalent. Here is an example:

        Code:
        webuse hsng2, clear
        *IV2SLS
        ivregress 2sls rent pcturban (hsngval = faminc i.region), vce(bootstrap, seed(07212025))
        
        *CF
        cap prog drop mybootstrap_prog
        prog mybootstrap_prog
        regress hsngval faminc i.region pcturban
        predict vhat, resid
        regress rent hsngval pcturban vhat
        drop vhat
        end
        bootstrap _b , reps(50) nowarn nodots nodrop seed(07212025): mybootstrap_prog
        Res.:

        Code:
        . ivregress 2sls rent pcturban (hsngval = faminc i.region), vce(bootstrap, seed(07212025))
        (running ivregress on estimation sample)
        
        Bootstrap replications (50): .........10.........20.........30.........40.........50 done
        
        Instrumental-variables 2SLS regression            Number of obs   =         50
                                                          Wald chi2(2)    =      36.58
                                                          Prob > chi2     =     0.0000
                                                          R-squared       =     0.5989
                                                          Root MSE        =     22.166
        
        ------------------------------------------------------------------------------
                     |   Observed   Bootstrap                         Normal-based
                rent | coefficient  std. err.      z    P>|z|     [95% conf. interval]
        -------------+----------------------------------------------------------------
             hsngval |   .0022398   .0005969     3.75   0.000     .0010699    .0034098
            pcturban |    .081516    .437207     0.19   0.852    -.7753941     .938426
               _cons |   120.7065   17.98488     6.71   0.000      85.4568    155.9562
        ------------------------------------------------------------------------------
        Endogenous: hsngval
        Exogenous:  pcturban faminc 2.region 3.region 4.region
        
        . 
        . 
        . 
        . *CF
        
        . 
        . cap prog drop mybootstrap_prog
        
        . 
        . prog mybootstrap_prog
          1. 
        . regress hsngval faminc i.region pcturban
          2. 
        . predict vhat, resid
          3. 
        . regress rent hsngval pcturban vhat
          4. 
        . drop vhat
          5. 
        . end
        
        . 
        . bootstrap _b , reps(50) nowarn nodots nodrop seed(07212025): mybootstrap_prog
        
        Linear regression                                      Number of obs =      50
                                                               Replications  =      50
                                                               Wald chi2(3)  =   51.76
                                                               Prob > chi2   =  0.0000
                                                               R-squared     =  0.7542
                                                               Adj R-squared =  0.7382
                                                               Root MSE      = 18.0903
        
        ------------------------------------------------------------------------------
                     |   Observed   Bootstrap                         Normal-based
                rent | coefficient  std. err.      z    P>|z|     [95% conf. interval]
        -------------+----------------------------------------------------------------
             hsngval |   .0022398   .0005969     3.75   0.000     .0010699    .0034098
            pcturban |    .081516    .437207     0.19   0.852    -.7753941     .938426
                vhat |  -.0015889     .00073    -2.18   0.030    -.0030196   -.0001582
               _cons |   120.7065   17.98488     6.71   0.000      85.4568    155.9562
        ------------------------------------------------------------------------------
        
        .

        Comment


        • #5
          Theoretically, I don't think they have to be exactly the same (you can imagine a coefficient of zero), but that's the usual practice and typically how Stata will do it with its canned variants. Leaving some out may lead to bias, so I'd stick with the standard format.

          Comment

          Working...
          X