Potential bug in IVREGRESS: instruments dropped due to collinearity when they should not be dropped

Sebastian Kripfganz

Join Date: May 2014
Posts: 2601

Potential bug in IVREGRESS: instruments dropped due to collinearity when they should not be dropped

25 Aug 2020, 05:57

In a private conversation, haiyan lin notified me of a potential bug in ivreg2. The same problem occurs in ivregress, so I will use the latter to demonstrate it.

In the following simplified example, I want to estimate a linear dynamic panel data model with 2 lags of the dependent variable instrumented by 2 lags of the first-differenced dependent variable.
(Using first differences as instruments is a standard procedure as part of a system GMM estimator in dynamic panel models, where it is assumed that the levels are correlated with the unobserved unit-specific effects, but the first differences are not.)

Code:

. webuse abdata, clear
. ivregress 2sls n (L.n L2.n = DL.n DL2.n)
note: LD.n dropped due to collinearity
equation not identified; must have at least as many instruments not in
the regression as there are instrumented variables
r(481);

As you can see, ivregress drops one of the instruments, reportedly due to collinearity. As a consequence, the model becomes underidentified and ivregress exits with error.

Yet, if you manually run the first-stage regressions, there is no evidence of any collinearity problem:

Code:

. regress L.n DL.n DL2.n

      Source |       SS           df       MS      Number of obs   =       611
-------------+----------------------------------   F(2, 608)       =      5.50
       Model |  19.4241409         2  9.71207047   Prob > F        =    0.0043
    Residual |  1074.20669       608  1.76678733   R-squared       =    0.0178
-------------+----------------------------------   Adj R-squared   =    0.0145
       Total |  1093.63084       610  1.79283744   Root MSE        =    1.3292

------------------------------------------------------------------------------
         L.n |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           n |
         LD. |   .7793204   .4007435     1.94   0.052    -.0076892     1.56633
        L2D. |   .9261032   .4324795     2.14   0.033     .0767682    1.775438
             |
       _cons |   1.101659   .0575023    19.16   0.000     .9887318    1.214586
------------------------------------------------------------------------------

. regress L2.n DL.n DL2.n

      Source |       SS           df       MS      Number of obs   =       611
-------------+----------------------------------   F(2, 608)       =      2.29
       Model |  8.10427021         2  4.05213511   Prob > F        =    0.1018
    Residual |  1074.20669       608  1.76678733   R-squared       =    0.0075
-------------+----------------------------------   Adj R-squared   =    0.0042
       Total |  1082.31096       610  1.77428027   Root MSE        =    1.3292

------------------------------------------------------------------------------
        L2.n |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           n |
         LD. |  -.2206796   .4007435    -0.55   0.582    -1.007689    .5663299
        L2D. |   .9261032   .4324795     2.14   0.033     .0767682    1.775438
             |
       _cons |   1.101659   .0575023    19.16   0.000     .9887318    1.214586
------------------------------------------------------------------------------

Also, if you estimate the same model with xtdpdgmm, there is no problem and all coefficients are identified:

Code:

. xtdpdgmm n L.n L2.n, iv(DL.n DL2.n)
note: standard errors may not be valid

Generalized method of moments estimation

Fitting full model:
Step 1         f(b) =  1.545e-27

Group variable: id                           Number of obs         =       751
Time variable: year                          Number of groups      =       140

Moment conditions:     linear =       3      Obs per group:    min =         5
                    nonlinear =       0                        avg =  5.364286
                        total =       3                        max =         7

------------------------------------------------------------------------------
           n |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           n |
         L1. |    1.25092   .0366347    34.15   0.000     1.179117    1.322722
         L2. |  -.2253948    .050786    -4.44   0.000    -.3249334   -.1258561
             |
       _cons |  -.0743238   .0497764    -1.49   0.135    -.1718837    .0232362
------------------------------------------------------------------------------
Instruments corresponding to the linear moment conditions:
 1, model(level):
   LD.n L2D.n
 2, model(level):
   _cons

The reason why ivregress flags the instrument DL.n as collinear is the following:
There is a perfect linear relationship between the two (!) endogenous variables and the instruments: L.n - L2.n = DL.n.

However, I would argue that it should not be of concern that the instrument is collinear with a combination of endogenous regressors. There should only be a concern if there is collinearity among the instruments themselves (or among the regressors themselves), or if any single endogenous regressor is perfectly predicted by one or more of the instruments. The latter is not the case as we could see above in the first-stage regressions.

Hence, I would call this a bug in ivregress.

I am already in private conversation with KitBaum and Mark Schaffer about the same problem in ivreg2, but I want to invite others on Statalist to add your opinion to the discussion before I send an e-mail to StataCorp Tech Support.

https://www.kripfganz.de/stata/

Tags: None

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2188
#2

25 Aug 2020, 08:15

I must admit I haven't thought about this situation before. It's peculiar to instrument levels with lags, but it shouldn't be ruled out. I guess what Stata is picking up is the fact that there really are not two endogenous explanatory variables; there is only one. You can re-parameterize the model to include L.n and DL.n and then DL.n is exogenous and does not need an IV. Then, ivregress will not allow you to include DL.n as an IV because it (rightly) suspects that you're doing something unintended. You can include DL2.n as the IV for L.n and that's it. Stata's error message is cryptic but it does make one think about the nature of the endgogeneity of the explanatory variables.

So I'm not convinced it's a bug. Stata does not allow you to list the same variables as endogenous and exogenous (included or excluded) and this example is a slight variation on that theme because it is the same problem if you re-parameterize the model.

I'm glad I don't have to make the call ....
1 like
Comment
Mark Schaffer

Join Date: Mar 2014

Posts: 324
#3

25 Aug 2020, 08:20

Here is a minimal example using the built-in toy dataset. Two endogenous regressors, two excluded instruments, no exogenous regressors, no constant as a regressor. One of the excluded instruments is the constant. (The example is not completely crazy: Friedman's famous IV estimation for the permanent income hypothesis used the constant as an excluded instrument; there's a discussion in Hayashi's 2000 textbook.)

Stata's ivregress gets it wrong and drops the constant before exiting with an error since it thinks (incorrectly) the equation is underidentified. (ivreg2 also gets it wrong, but in a different way.) But Stata's regress, using the old-fashioned synax for IV estimation, gets it right. This is verifiable via doing IV by hand using Mata.

Code:

sysuse auto, clear gen domestic = 1-foreign gen one = 1 cap noi ivregress 2sls mpg (domestic foreign = one weight), nocons putmata y=mpg X=(domestic foreign) Z=(one weight), replace mata: pinv(Z'X)*Z'y regress mpg domestic foreign (one weight), nocons

ivregress output:

By hand using Mata:

regress, using the old-fashioned IV syntax:
Comment

Sebastian Kripfganz

Join Date: May 2014
Posts: 2601

25 Aug 2020, 08:45

Originally posted by Jeff Wooldridge View Post

I must admit I haven't thought about this situation before. It's peculiar to instrument levels with lags, but it shouldn't be ruled out. I guess what Stata is picking up is the fact that there really are not two endogenous explanatory variables; there is only one. You can re-parameterize the model to include L.n and DL.n and then DL.n is exogenous and does not need an IV. Then, ivregress will not allow you to include DL.n as an IV because it (rightly) suspects that you're doing something unintended. You can include DL2.n as the IV for L.n and that's it. Stata's error message is cryptic but it does make one think about the nature of the endgogeneity of the explanatory variables.

So I'm not convinced it's a bug. Stata does not allow you to list the same variables as endogenous and exogenous (included or excluded) and this example is a slight variation on that theme because it is the same problem if you re-parameterize the model.

I'm glad I don't have to make the call ....

That the model can be re-parameterized is a good point. Yet, consider an overidentified model by adding DL3.n as another instrument to the initial example. Now, ivregress still drops DL.n from the list of instruments but no longer exits with error because the parameters are still identified. But the resulting estimates are now definitely wrong:

Code:

. ivregress 2sls n (L.n L2.n = DL.n DL2.n DL3.n)
note: LD.n dropped due to collinearity

Instrumental variables (2SLS) regression          Number of obs   =        471
                                                  Wald chi2(2)    =    1036.63
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =     0.9872
                                                  Root MSE        =      .1512

------------------------------------------------------------------------------
           n |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           n |
         L1. |   1.309919   .3175556     4.13   0.000     .6875211    1.932316
         L2. |   -.335349   .3507221    -0.96   0.339    -1.022752    .3520536
             |
       _cons |  -.0394477   .0666491    -0.59   0.554    -.1700775    .0911821
------------------------------------------------------------------------------
Instrumented:  L.n L2.n
Instruments:   L2D.n L3D.n

. xtdpdgmm n L.n L2.n if e(sample), iv(DL.n DL2.n DL3.n)
note: standard errors may not be valid

Generalized method of moments estimation

Fitting full model:
Step 1         f(b) =  .00004035

Group variable: id                           Number of obs         =       471
Time variable: year                          Number of groups      =       140

Moment conditions:     linear =       4      Obs per group:    min =         3
                    nonlinear =       0                        avg =  3.364286
                        total =       4                        max =         5

------------------------------------------------------------------------------
           n |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           n |
         L1. |   1.153779   .0455171    25.35   0.000     1.064567    1.242991
         L2. |  -.1632875   .0551489    -2.96   0.003    -.2713773   -.0551976
             |
       _cons |  -.0669634   .0363085    -1.84   0.065    -.1381267    .0041999
------------------------------------------------------------------------------
Instruments corresponding to the linear moment conditions:
 1, model(level):
   LD.n L2D.n L3D.n
 2, model(level):
   _cons

(You can replicate the ivregress results by dropping DL.n from the xtdpdgmm instrument list.)

Edit: With the perfect option, ivregress produces the correct estimates.

Last edited by Sebastian Kripfganz; 25 Aug 2020, 09:28.

https://www.kripfganz.de/stata/

Comment

Mark Schaffer

Join Date: Mar 2014

Posts: 324
#5

25 Aug 2020, 09:07

Jeff,

Does what you say above:

Originally posted by Jeff Wooldridge View Post

I guess what Stata is picking up is the fact that there really are not two endogenous explanatory variables; there is only one.

apply to the minimal example I posted above using the toy auto dataset? I don't think so but would be happy to be shown wrong.

--Mark
Comment

Sebastian Kripfganz

Join Date: May 2014
Posts: 2601

25 Aug 2020, 09:19

Mark:
Yes, Jeff's comments do apply to your example:

Code:

. ivregress 2sls mpg (foreign = weight)

Instrumental variables (2SLS) regression          Number of obs   =         74
                                                  Wald chi2(1)    =      27.05
                                                  Prob > chi2     =     0.0000
                                                  R-squared       =          .
                                                  Root MSE        =     7.6721

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |    17.1176   3.291427     5.20   0.000     10.66653    23.56868
       _cons |   16.20828   1.323985    12.24   0.000     13.61332    18.80324
------------------------------------------------------------------------------
Instrumented:  foreign
Instruments:   weight

. lincom foreign + _cons

 ( 1)  foreign + _cons = 0

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |   33.32588   2.478889    13.44   0.000     28.46735    38.18442
------------------------------------------------------------------------------

https://www.kripfganz.de/stata/

Comment

Mark Schaffer

Join Date: Mar 2014

Posts: 324
#7

25 Aug 2020, 09:34

Sebastian - your example isn't a different parameterisation of the same model. They are different models.

Original model:

Code:

regress mpg domestic foreign (one weight), nocons

Root MSE = 7.7779

Your model:

Code:

ivregress 2sls mpg (foreign = weight)

Root MSE = 7.6721
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2601
#8

25 Aug 2020, 09:40

Mark:
This difference is just due to different degrees-of-freedom corrections. If you add the small option to ivregress, the Root MSEs coincide.

https://www.kripfganz.de/stata/
Comment
Mark Schaffer

Join Date: Mar 2014

Posts: 324
#9

25 Aug 2020, 09:43

Ah - now that is interesting!
Comment
Mark Schaffer

Join Date: Mar 2014

Posts: 324
#10

25 Aug 2020, 10:22

So to take the toy auto dataset example further, and using the old-fashioned syntax for regress, start again with

Code:

regress mpg domestic foreign (one weight), nocons

and we detect that there is a collinearity in there ("there really are not two endogenous explanatory variables; there is only one").

Is there a way to reparameterise the model by declaring that one of the two provided endogenous explanatory variables is actually exogenous? If there were, it would be the kind of thing that an estimator could do automatically, but I don't think it's possible without in effect changing the model (unlike the usual case where dropping collinear variables doesn't change the model). For example, continuing in the old-fashioned regress syntax, reassigning foreign to be exogenous would be

Code:

regress mpg domestic foreign (one weight foreign), nocons

but it is a different model (RMSE is much different). Same applies if you reassign domestic instead (of course).
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2601
#11

25 Aug 2020, 10:37

I do not think you can easily automatize such a re-parameterization, and probably you should not. But that leads to a practical problem:
If ivregress exits with error because the model is no longer identified after dropping an instrument, that might be acceptable because the users now need to think about what is going on.

If ivregress in the overidentified case just drops the instrument and then moves on to produce estimates for the model with the original endogenous variables, can we really expect the users to think about the reason for the dropped instrument, and to figure out by themselves that they should either re-parameterize the model or use the option perfect to avoid the collinearity check? In larger models, these linear dependencies may not be obvious any more. I might seriously consider to use the perfect option by default whenever I run an ivregress estimation. I find it difficult to imagine a situation where this option does some harm.

https://www.kripfganz.de/stata/
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2188
#12

25 Aug 2020, 10:45

Originally posted by Mark Schaffer View Post

So to take the toy auto dataset example further, and using the old-fashioned syntax for regress, start again with

Code:

regress mpg domestic foreign (one weight), nocons

and we detect that there is a collinearity in there ("there really are not two endogenous explanatory variables; there is only one").

Is there a way to reparameterise the model by declaring that one of the two provided endogenous explanatory variables is actually exogenous? If there were, it would be the kind of thing that an estimator could do automatically, but I don't think it's possible without in effect changing the model (unlike the usual case where dropping collinear variables doesn't change the model). For example, continuing in the old-fashioned regress syntax, reassigning foreign to be exogenous would be

Code:

regress mpg domestic foreign (one weight foreign), nocons

but it is a different model (RMSE is much different). Same applies if you reassign domestic instead (of course).

Logically, the second example doesn't make sense. If one and foreign are exogenous then domestic must be exogenous, too. So then 2SLS should (and does) reduce to OLS.

At some point, the user has to be responsible for doing something sensible.
Comment
Mark Schaffer

Join Date: Mar 2014

Posts: 324
#13

25 Aug 2020, 11:15

Originally posted by Jeff Wooldridge View Post

At some point, the user has to be responsible for doing something sensible.

Absolutely! The issue for me is how to code ivreg2 to respond in these cases. (Right now it detects a problem and reports it, and then tries to reclassify variables across lists of endogenous, exogenous and IV, but there's a bug in the code for the latter.)

Should ivreg2 report the errors, reclassify variables with a warning, and rely on the user to sort it out? Or exit with error straight away? The advantage of the former is that the intermediate or final estimation results might help the user work out what the problem is. When Sebastian, Kit and I were trying to diagnose what was going on, the first-stage estimations turned out to be informative in identifying the source of the problem. These wouldn't be available if the program exited with an error right away.

An alternative would be to exit with error but recommend that the nocollin option is used (the equivalent of the perfect option of ivregress), perhaps along with examining the first-stage estimations to see if they help explain what is going on.
1 like
Comment
Mark Schaffer

Join Date: Mar 2014

Posts: 324
#14

25 Aug 2020, 11:26

Originally posted by Sebastian Kripfganz View Post

In larger models, these linear dependencies may not be obvious any more. I might seriously consider to use the perfect option by default whenever I run an ivregress estimation. I find it difficult to imagine a situation where this option does some harm.

I think your point about larger models is what makes the perfect option of ivregress (or the nocollin of ivreg2) a bit risky. For example, in some cross-section applications, people sometimes use lots of interactions of categorical variables. They could turn out to be collinear with the single endogenous regressor. More generally, there's something to be said for having checks like this in place, and asking the user to override them if they are sure about it. (Sort of the same philosophy is behind the omnipresent "replace" option.)
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2601
#15

25 Aug 2020, 14:01

I see your point.
It would be nice then to have the option to check for collinearity between any single endogenous variable and the instruments, but not for collinearity that involves multiple endogenous variables.

Last edited by Sebastian Kripfganz; 25 Aug 2020, 14:03.

https://www.kripfganz.de/stata/
1 like
Comment

Announcement

Potential bug in IVREGRESS: instruments dropped due to collinearity when they should not be dropped

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment