Interaction with instrumented variable

Javier Gutierrez

Join Date: May 2018

Posts: 14
#1

Interaction with instrumented variable

09 May 2019, 03:44

I am measuring the impact of distance to the nearest city on employment outcomes.

Y = b0 + b1*X1 + b2*X2 + b3*X1*X2 + controls + e
Y is a binary variable.

X1 is a categorical variable (income categories)

X2 is an continuous endogenous variable (distance), instrumented by Z.

I would like to interact the categorical variable and the instrumented variable to see the impact of distance on employment outcomes given different income categories. When I do ivprobit regression with this interaction, it increases coefficients a lot. I was wondering whether I am doing it correctly or is there any specific command I should use for this?

Thank you very much!
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

10 May 2019, 10:38

You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output and sample data using dataex.

You don't even tell us exactly what you ran which makes it impossible to tell you if you're doing it correctly. Generally, for interactions with endogenous variables, you need to create the interaction before the estimation and include the interaction among the endogenous variables. If there is endogeneity, then controlling for it should change the parameters.
Comment

Javier Gutierrez

Join Date: May 2018
Posts: 14

21 May 2019, 07:57

Dear Phil,

Thanks for answering. Sorry, I did not report a sample dataset because it's too big. Hope this code will help to understand my question:

This is the simple ivprobit I am running:

Code:

 

 ivprobit empl age agesq edu gender i.wealth (log_dist = log_dist2), vce(robust)

Fitting exogenous probit model

Iteration 0:   log likelihood = -5964.9268  
Iteration 1:   log likelihood = -4845.5347  
Iteration 2:   log likelihood = -4825.4807  
Iteration 3:   log likelihood = -4825.4677  
Iteration 4:   log likelihood = -4825.4677  

Fitting full model

Iteration 0:   log pseudolikelihood = -20854.293  
Iteration 1:   log pseudolikelihood =  -20854.29  
Iteration 2:   log pseudolikelihood =  -20854.29  

Probit model with endogenous regressors         Number of obs     =      8,868
                                                Wald chi2(9)      =    2301.30
Log pseudolikelihood =  -20854.29               Prob > chi2       =     0.0000

-----------------------------------------------------------------------------------------
                        |               Robust
                        |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------+----------------------------------------------------------------
               log_dist |   .2800374    .047623     5.88   0.000     .1866981    .3733768
                    age |   .2348998    .013861    16.95   0.000     .2077328    .2620667
                  agesq |  -.0028701   .0001952   -14.70   0.000    -.0032528   -.0024874
                    edu |   .1254677   .0071261    17.61   0.000     .1115009    .1394346
                 gender |  -.7290411   .0406152   -17.95   0.000    -.8086454   -.6494369
                        |
             wealth_cat |
                poorer  |   .0908315   .0452128     2.01   0.045     .0022162    .1794469
                middle  |   .1695209   .0508201     3.34   0.001     .0699153    .2691264
                richer  |   .3361315   .0557483     6.03   0.000     .2268668    .4453963
               richest  |   .3973981   .0584359     6.80   0.000     .2828659    .5119302
                        |
                  _cons |  -5.596588   .2586441   -21.64   0.000    -6.103521   -5.089654
------------------------+----------------------------------------------------------------
 corr(e.log_dist,e.empl)|  -.4346023   .0708039                     -.5626107   -.2862428
          sd(e.log_dist)|   1.474843   .0160344                      1.443749    1.506607
-----------------------------------------------------------------------------------------
Instrumented:  log_dist
Instruments:   age agesq edu gender 2.wealth_cat 3.wealth_cat 4.wealth_cat 5.wealth_cat
               log_dist2
-----------------------------------------------------------------------------------------
Wald test of exogeneity (corr = 0): chi2(1) = 28.44       Prob > chi2 = 0.0000

Then I would like to interact the binary variable (easier) and the instrumented variable to see the impact of distance on employment outcomes given genders:

Code:

ivprobit empl age agesq edu i.gender#c.log_dist  i.wealth (log_dist = log_dist2), vce(robust)

Fitting exogenous probit model

Iteration 0:   log likelihood = -5964.9268  
Iteration 1:   log likelihood = -5123.7107  
Iteration 2:   log likelihood = -5107.2074  
Iteration 3:   log likelihood = -5107.1729  
Iteration 4:   log likelihood = -5107.1729  

Fitting full model

Iteration 0:   log pseudolikelihood = -16489.935  
Iteration 1:   log pseudolikelihood = -16489.932  
Iteration 2:   log pseudolikelihood = -16489.932  

Probit model with endogenous regressors         Number of obs     =      8,868
                                                Wald chi2(9)      =    2202.72
Log pseudolikelihood = -16489.932               Prob > chi2       =     0.0000

-----------------------------------------------------------------------------------------
                        |               Robust
                        |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------+----------------------------------------------------------------
               log_dist |   .5569142   .1021977     5.45   0.000     .3566104     .757218
                    age |   .1996109   .0172447    11.58   0.000      .165812    .2334099
                  agesq |  -.0024388    .000229   -10.65   0.000    -.0028876   -.0019901
                    edu |   .0855257   .0083696    10.22   0.000     .0691216    .1019298
                        |
      gender#c.log_dist |
                     1  |  -.4911309   .1000764    -4.91   0.000     -.687277   -.2949849
                        |
             wealth_cat |
                poorer  |   .0859501   .0418564     2.05   0.040      .003913    .1679871
                middle  |   .1418404   .0447074     3.17   0.002     .0542155    .2294653
                richer  |   .2961563   .0478771     6.19   0.000     .2023189    .3899936
               richest  |   .3370489   .0497717     6.77   0.000     .2394981    .4345998
                        |
                  _cons |  -4.960362   .3857274   -12.86   0.000    -5.716374    -4.20435
------------------------+----------------------------------------------------------------
 corr(e.log_dist,e.empl)|  -.5738943   .0843105                     -.7161557   -.3858481
          sd(e.log_dist)|   .8733999   .0176681                      .8394486    .9087244
-----------------------------------------------------------------------------------------
Instrumented:  log_dist
Instruments:   age agesq edu 1.gender#c.log_dist 2.wealth_cat 3.wealth_cat 4.wealth_cat
               5.wealth_cat log_dist2
-----------------------------------------------------------------------------------------
Wald test of exogeneity (corr = 0): chi2(1) = 27.01       Prob > chi2 = 0.0000

Here we see that the log_dist coefficient increases a lot. Also since it's an instrumented variable, I was not sure I was doing it right.

As you suggested I also created the interaction before the estimation and then included it among the endogenous variables. However, since I only have one instrument only, it won't do the estimation.

What would you suggest? Is any of these ways correct?

Thanks a lot!

Comment

FernandoRios

Join Date: Apr 2014

Posts: 2469
#4

21 May 2019, 08:43

Hi Javier,
I dont think your setup is correct. I dont think i have seen before the use of polynomials of an edogenous variable to be used as instrument of itself. This basically violates the assumption that the instrument is uncorrelated with the main outcome error.
Assuming your instrument is correct, I have seen previous research and threads for linear regressions here in the forum that use something like:

Code:

ivprobit y x1 x2 (y2 y2#x2=z1 z2 z3)

In other words, that need to control for the endogeneity of both the original endogenous variable, and the interaction.
HTH
Fernando
Comment
Javier Gutierrez

Join Date: May 2018

Posts: 14
#5

21 May 2019, 08:54

Hi Fernando,

Thanks for your answer. The instrument is not a polynomial, it's an exogenous variable and a quite strong instrument. I kept the name like that just for simplicity.

As I understand, you suggest something like this?

Code:

ivprobit empl age agesq edu gender i.wealth (gender#c.log_dist log_dist = log_dist2), vce(robust)

when I run it, it shows an error:

depvars may not be interactions
The endogenous variables are incorrectly specified

Have I understood it correctly?
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#6

21 May 2019, 09:07

I see. Sorry for the confusion. Since you named your IV as log_dist2, my first impression was that you were using log_dist^2 as the instrument.
For your specification. Yes, that is what i have seen done before, but only in terms of linear IV.
So this may be a hunch, but what about trying your baseline model with constraining the sample by gender.
HTH
Comment
Javier Gutierrez

Join Date: May 2018

Posts: 14
#7

21 May 2019, 09:46

Sure, I could run separate regressions for each gender but then I can't answer the question I am asking, whether there is any gender-related difference in the -empl-, other things being equal.
Comment
Stephanie Funk

Join Date: Apr 2021

Posts: 1
#8

16 Apr 2021, 02:13

Hello Javier,
have you/or anyone found a solution to this problem since? Because I have the same problem (wanting to run ivprobit with interactions of the endogenous regressor as specified in your code) and I get the same error message from Stata "depvars may not be interactions
The endogenous variables are incorrectly specified"

Thank you very much!
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10195

16 Apr 2021, 02:41

I do not think that the error message implies any econometric problem with the model. It is just a convention in Stata not to allow dependent variables to be explicit interactions. In 2SLS (two-stage least squares), the endogenous variables are dependent variables in the first-stage regressions. Therefore, just create the interacted variables yourself.

Code:

sysuse auto, clear
regress c.mpg#c.weight turn disp
gen mpg_weight= mpg*weight
regress mpg_weight turn disp

Res.:

Code:

. regress c.mpg#c.weight turn disp
depvar may not be an interaction
r(198);

. 
. gen mpg_weight= mpg*weight

. 
. regress mpg_weight turn disp

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =      1.80
       Model |   301554849         2   150777425   Prob > F        =    0.1729
    Residual |  5.9503e+09        71  83806483.6   R-squared       =    0.0482
-------------+----------------------------------   Adj R-squared   =    0.0214
       Total |  6.2518e+09        73  85641303.9   Root MSE        =    9154.6

------------------------------------------------------------------------------
  mpg_weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        turn |   37.75763    386.716     0.10   0.922    -733.3322    808.8474
displacement |    20.6968   18.52517     1.12   0.268    -16.24134    57.63495
       _cons |   55145.48   12748.49     4.33   0.000     29725.71    80565.26
------------------------------------------------------------------------------

.

Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

#10

16 Apr 2021, 03:07

This approach does not work with -ivprobit-

Code:

. webuse laborsup

. ivprobit fem_work fem_educ kids (other_inc c.other_inc#c.kids = male_educ c.male_educ#c.kids), nolog
depvars may not be interactions
    The endogenous variables are incorrectly specified
r(198);

However it works perfectly fine in linear IV

Code:

. ivregress 2sls fem_work fem_educ kids (other_inc c.other_inc#c.kids = male_educ c.male_educ#c.kids)

Instrumental variables (2SLS) regression          Number of obs   =        500
                                                  Wald chi2(4)    =     135.82
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.2579
                                                  Root MSE        =     .42905

------------------------------------------------------------------------------------
          fem_work |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
         other_inc |  -.0215009   .0043956    -4.89   0.000    -.0301161   -.0128858
                   |
c.other_inc#c.kids |   .0023376   .0017589     1.33   0.184    -.0011098    .0057851
                   |
          fem_educ |   .0633497   .0073805     8.58   0.000     .0488843    .0778152
              kids |  -.1749977   .0892116    -1.96   0.050    -.3498493   -.0001462
             _cons |   .8728716   .2404446     3.63   0.000     .4016089    1.344134
------------------------------------------------------------------------------------
Instrumented:  other_inc c.other_inc#c.kids
Instruments:   fem_educ kids male_educ c.male_educ#c.kids

I think that Professor Jeff Wooldridge explained that it does not make sense to try and mimic the strategy that works with linear -ivregress-, and apply it in -ivprobit- (which is not IV, it is control function approach). Here is the thread:
https://www.statalist.org/forums/for...quadratic-term

You can mechanically overrule the Stata -ivprobit- limitation by manually generating the variables and avoiding factor variable notation. But my understaning is that Professor Wooldridge explained that the assumptions necessary for this to work are internally inconsistent. I.e., what is needed for this approach to work, just cannot possibly be.

Code:

. gen otherinckids =  c.other_inc#c.kids

. ivprobit fem_work fem_educ kids (other_inc otherinckids = male_educ c.male_educ#c.kids), nolog

Probit model with endogenous regressors         Number of obs     =        500
                                                Wald chi2(4)      =     174.04
Log likelihood = -4607.7924                     Prob > chi2       =     0.0000

--------------------------------------------------------------------------------------------------
                                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------------------------+----------------------------------------------------------------
                       other_inc |  -.0764133   .0129582    -5.90   0.000    -.1018109   -.0510158
                    otherinckids |   .0100711   .0058323     1.73   0.084    -.0013599    .0215021
                        fem_educ |   .2081473   .0277055     7.51   0.000     .1538455    .2624491
                            kids |  -.6832745   .2934867    -2.33   0.020    -1.258498   -.1080511
                           _cons |   1.506261   .7630166     1.97   0.048     .0107759    3.001746
---------------------------------+----------------------------------------------------------------
     corr(e.other_inc,e.fem_work)|   .3919935   .1283349                      .1164238    .6115242
  corr(e.otherinckids,e.fem_work)|   .2879554    .128871                      .0209125    .5166471
 corr(e.otherinckids,e.other_inc)|   .8314174   .0138075                      .8023083    .8565813
                  sd(e.other_inc)|   16.66556   .5270111                      15.66399    17.73116
               sd(e.otherinckids)|   38.69173    1.22354                      36.36644     41.1657
--------------------------------------------------------------------------------------------------
Instrumented:  other_inc otherinckids
Instruments:   fem_educ kids male_educ c.male_educ#c.kids
--------------------------------------------------------------------------------------------------
Wald test of exogeneity: chi2(2) = 7.48                   Prob > chi2 = 0.0237

So I managed to "make it work" mechanically, but whether this makes sense as a model is a whole different story.

Comment

Ibai Ostolozaga Falcon

Join Date: May 2021
Posts: 36

#11

22 May 2024, 04:31

Originally posted by Joro Kolev View Post

This approach does not work with -ivprobit-

Code:

. webuse laborsup

. ivprobit fem_work fem_educ kids (other_inc c.other_inc#c.kids = male_educ c.male_educ#c.kids), nolog
depvars may not be interactions
The endogenous variables are incorrectly specified
r(198);

However it works perfectly fine in linear IV

Code:

. ivregress 2sls fem_work fem_educ kids (other_inc c.other_inc#c.kids = male_educ c.male_educ#c.kids)

Instrumental variables (2SLS) regression Number of obs = 500
Wald chi2(4) = 135.82
Prob > chi2 = 0.0000
R-squared = 0.2579
Root MSE = .42905

------------------------------------------------------------------------------------
fem_work | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
other_inc | -.0215009 .0043956 -4.89 0.000 -.0301161 -.0128858
|
c.other_inc#c.kids | .0023376 .0017589 1.33 0.184 -.0011098 .0057851
|
fem_educ | .0633497 .0073805 8.58 0.000 .0488843 .0778152
kids | -.1749977 .0892116 -1.96 0.050 -.3498493 -.0001462
_cons | .8728716 .2404446 3.63 0.000 .4016089 1.344134
------------------------------------------------------------------------------------
Instrumented: other_inc c.other_inc#c.kids
Instruments: fem_educ kids male_educ c.male_educ#c.kids

Code:

. gen otherinckids = c.other_inc#c.kids

. ivprobit fem_work fem_educ kids (other_inc otherinckids = male_educ c.male_educ#c.kids), nolog

Probit model with endogenous regressors Number of obs = 500
Wald chi2(4) = 174.04
Log likelihood = -4607.7924 Prob > chi2 = 0.0000

--------------------------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------------------------+----------------------------------------------------------------
other_inc | -.0764133 .0129582 -5.90 0.000 -.1018109 -.0510158
otherinckids | .0100711 .0058323 1.73 0.084 -.0013599 .0215021
fem_educ | .2081473 .0277055 7.51 0.000 .1538455 .2624491
kids | -.6832745 .2934867 -2.33 0.020 -1.258498 -.1080511
_cons | 1.506261 .7630166 1.97 0.048 .0107759 3.001746
---------------------------------+----------------------------------------------------------------
corr(e.other_inc,e.fem_work)| .3919935 .1283349 .1164238 .6115242
corr(e.otherinckids,e.fem_work)| .2879554 .128871 .0209125 .5166471
corr(e.otherinckids,e.other_inc)| .8314174 .0138075 .8023083 .8565813
sd(e.other_inc)| 16.66556 .5270111 15.66399 17.73116
sd(e.otherinckids)| 38.69173 1.22354 36.36644 41.1657
--------------------------------------------------------------------------------------------------
Instrumented: other_inc otherinckids
Instruments: fem_educ kids male_educ c.male_educ#c.kids
--------------------------------------------------------------------------------------------------
Wald test of exogeneity: chi2(2) = 7.48 Prob > chi2 = 0.0237

So I managed to "make it work" mechanically, but whether this makes sense as a model is a whole different story.

Hi Joro! So in that situation, as you mentioned ivprobit does not work with that approach, So, what would you recommend so as to solve the problem of #1?

Thank you in advances.

Comment

Ibai Ostolozaga Falcon

Join Date: May 2021

Posts: 36
#12

22 May 2024, 06:20

Originally posted by Javier Gutierrez View Post

I am measuring the impact of distance to the nearest city on employment outcomes.

Y = b0 + b1*X1 + b2*X2 + b3*X1*X2 + controls + e
Y is a binary variable.

X1 is a categorical variable (income categories)

X2 is an continuous endogenous variable (distance), instrumented by Z.

I would like to interact the categorical variable and the instrumented variable to see the impact of distance on employment outcomes given different income categories. When I do ivprobit regression with this interaction, it increases coefficients a lot. I was wondering whether I am doing it correctly or is there any specific command I should use for this?

Thank you very much!

Hello Javier.
Did you solve this issue?
Comment

Announcement