Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Potential bug in IVREGRESS: instruments dropped due to collinearity when they should not be dropped

    In a private conversation, haiyan lin notified me of a potential bug in ivreg2. The same problem occurs in ivregress, so I will use the latter to demonstrate it.

    In the following simplified example, I want to estimate a linear dynamic panel data model with 2 lags of the dependent variable instrumented by 2 lags of the first-differenced dependent variable.
    (Using first differences as instruments is a standard procedure as part of a system GMM estimator in dynamic panel models, where it is assumed that the levels are correlated with the unobserved unit-specific effects, but the first differences are not.)
    Code:
    . webuse abdata, clear
    . ivregress 2sls n (L.n L2.n = DL.n DL2.n)
    note: LD.n dropped due to collinearity
    equation not identified; must have at least as many instruments not in
    the regression as there are instrumented variables
    r(481);
    As you can see, ivregress drops one of the instruments, reportedly due to collinearity. As a consequence, the model becomes underidentified and ivregress exits with error.

    Yet, if you manually run the first-stage regressions, there is no evidence of any collinearity problem:
    Code:
    . regress L.n DL.n DL2.n
    
          Source |       SS           df       MS      Number of obs   =       611
    -------------+----------------------------------   F(2, 608)       =      5.50
           Model |  19.4241409         2  9.71207047   Prob > F        =    0.0043
        Residual |  1074.20669       608  1.76678733   R-squared       =    0.0178
    -------------+----------------------------------   Adj R-squared   =    0.0145
           Total |  1093.63084       610  1.79283744   Root MSE        =    1.3292
    
    ------------------------------------------------------------------------------
             L.n |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
               n |
             LD. |   .7793204   .4007435     1.94   0.052    -.0076892     1.56633
            L2D. |   .9261032   .4324795     2.14   0.033     .0767682    1.775438
                 |
           _cons |   1.101659   .0575023    19.16   0.000     .9887318    1.214586
    ------------------------------------------------------------------------------
    
    . regress L2.n DL.n DL2.n
    
          Source |       SS           df       MS      Number of obs   =       611
    -------------+----------------------------------   F(2, 608)       =      2.29
           Model |  8.10427021         2  4.05213511   Prob > F        =    0.1018
        Residual |  1074.20669       608  1.76678733   R-squared       =    0.0075
    -------------+----------------------------------   Adj R-squared   =    0.0042
           Total |  1082.31096       610  1.77428027   Root MSE        =    1.3292
    
    ------------------------------------------------------------------------------
            L2.n |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
               n |
             LD. |  -.2206796   .4007435    -0.55   0.582    -1.007689    .5663299
            L2D. |   .9261032   .4324795     2.14   0.033     .0767682    1.775438
                 |
           _cons |   1.101659   .0575023    19.16   0.000     .9887318    1.214586
    ------------------------------------------------------------------------------
    Also, if you estimate the same model with xtdpdgmm, there is no problem and all coefficients are identified:
    Code:
    . xtdpdgmm n L.n L2.n, iv(DL.n DL2.n)
    note: standard errors may not be valid
    
    Generalized method of moments estimation
    
    Fitting full model:
    Step 1         f(b) =  1.545e-27
    
    Group variable: id                           Number of obs         =       751
    Time variable: year                          Number of groups      =       140
    
    Moment conditions:     linear =       3      Obs per group:    min =         5
                        nonlinear =       0                        avg =  5.364286
                            total =       3                        max =         7
    
    ------------------------------------------------------------------------------
               n |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
               n |
             L1. |    1.25092   .0366347    34.15   0.000     1.179117    1.322722
             L2. |  -.2253948    .050786    -4.44   0.000    -.3249334   -.1258561
                 |
           _cons |  -.0743238   .0497764    -1.49   0.135    -.1718837    .0232362
    ------------------------------------------------------------------------------
    Instruments corresponding to the linear moment conditions:
     1, model(level):
       LD.n L2D.n
     2, model(level):
       _cons
    The reason why ivregress flags the instrument DL.n as collinear is the following:
    There is a perfect linear relationship between the two (!) endogenous variables and the instruments: L.n - L2.n = DL.n.

    However, I would argue that it should not be of concern that the instrument is collinear with a combination of endogenous regressors. There should only be a concern if there is collinearity among the instruments themselves (or among the regressors themselves), or if any single endogenous regressor is perfectly predicted by one or more of the instruments. The latter is not the case as we could see above in the first-stage regressions.

    Hence, I would call this a bug in ivregress.

    I am already in private conversation with KitBaum and Mark Schaffer about the same problem in ivreg2, but I want to invite others on Statalist to add your opinion to the discussion before I send an e-mail to StataCorp Tech Support.
    https://www.kripfganz.de/stata/

  • #2
    I must admit I haven't thought about this situation before. It's peculiar to instrument levels with lags, but it shouldn't be ruled out. I guess what Stata is picking up is the fact that there really are not two endogenous explanatory variables; there is only one. You can re-parameterize the model to include L.n and DL.n and then DL.n is exogenous and does not need an IV. Then, ivregress will not allow you to include DL.n as an IV because it (rightly) suspects that you're doing something unintended. You can include DL2.n as the IV for L.n and that's it. Stata's error message is cryptic but it does make one think about the nature of the endgogeneity of the explanatory variables.

    So I'm not convinced it's a bug. Stata does not allow you to list the same variables as endogenous and exogenous (included or excluded) and this example is a slight variation on that theme because it is the same problem if you re-parameterize the model.

    I'm glad I don't have to make the call ....

    Comment


    • #3
      Here is a minimal example using the built-in toy dataset. Two endogenous regressors, two excluded instruments, no exogenous regressors, no constant as a regressor. One of the excluded instruments is the constant. (The example is not completely crazy: Friedman's famous IV estimation for the permanent income hypothesis used the constant as an excluded instrument; there's a discussion in Hayashi's 2000 textbook.)

      Stata's ivregress gets it wrong and drops the constant before exiting with an error since it thinks (incorrectly) the equation is underidentified. (ivreg2 also gets it wrong, but in a different way.) But Stata's regress, using the old-fashioned synax for IV estimation, gets it right. This is verifiable via doing IV by hand using Mata.

      Code:
      sysuse auto, clear
      gen domestic = 1-foreign
      gen one = 1
      cap noi ivregress 2sls mpg (domestic foreign = one weight), nocons
      putmata y=mpg X=(domestic foreign) Z=(one weight), replace
      mata: pinv(Z'X)*Z'y
      regress mpg domestic foreign (one weight), nocons
      ivregress output:
      Click image for larger version

Name:	Capture1.PNG
Views:	1
Size:	9.6 KB
ID:	1569936


      By hand using Mata:
      Click image for larger version

Name:	Capture2.PNG
Views:	1
Size:	8.8 KB
ID:	1569937


      regress, using the old-fashioned IV syntax:
      Click image for larger version

Name:	Capture3.PNG
Views:	1
Size:	24.5 KB
ID:	1569938

      Comment


      • #4
        Originally posted by Jeff Wooldridge View Post
        I must admit I haven't thought about this situation before. It's peculiar to instrument levels with lags, but it shouldn't be ruled out. I guess what Stata is picking up is the fact that there really are not two endogenous explanatory variables; there is only one. You can re-parameterize the model to include L.n and DL.n and then DL.n is exogenous and does not need an IV. Then, ivregress will not allow you to include DL.n as an IV because it (rightly) suspects that you're doing something unintended. You can include DL2.n as the IV for L.n and that's it. Stata's error message is cryptic but it does make one think about the nature of the endgogeneity of the explanatory variables.

        So I'm not convinced it's a bug. Stata does not allow you to list the same variables as endogenous and exogenous (included or excluded) and this example is a slight variation on that theme because it is the same problem if you re-parameterize the model.

        I'm glad I don't have to make the call ....
        That the model can be re-parameterized is a good point. Yet, consider an overidentified model by adding DL3.n as another instrument to the initial example. Now, ivregress still drops DL.n from the list of instruments but no longer exits with error because the parameters are still identified. But the resulting estimates are now definitely wrong:
        Code:
        . ivregress 2sls n (L.n L2.n = DL.n DL2.n DL3.n)
        note: LD.n dropped due to collinearity
        
        Instrumental variables (2SLS) regression          Number of obs   =        471
                                                          Wald chi2(2)    =    1036.63
                                                          Prob > chi2     =     0.0000
                                                          R-squared       =     0.9872
                                                          Root MSE        =      .1512
        
        ------------------------------------------------------------------------------
                   n |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                   n |
                 L1. |   1.309919   .3175556     4.13   0.000     .6875211    1.932316
                 L2. |   -.335349   .3507221    -0.96   0.339    -1.022752    .3520536
                     |
               _cons |  -.0394477   .0666491    -0.59   0.554    -.1700775    .0911821
        ------------------------------------------------------------------------------
        Instrumented:  L.n L2.n
        Instruments:   L2D.n L3D.n
        
        . xtdpdgmm n L.n L2.n if e(sample), iv(DL.n DL2.n DL3.n)
        note: standard errors may not be valid
        
        Generalized method of moments estimation
        
        Fitting full model:
        Step 1         f(b) =  .00004035
        
        Group variable: id                           Number of obs         =       471
        Time variable: year                          Number of groups      =       140
        
        Moment conditions:     linear =       4      Obs per group:    min =         3
                            nonlinear =       0                        avg =  3.364286
                                total =       4                        max =         5
        
        ------------------------------------------------------------------------------
                   n |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
                   n |
                 L1. |   1.153779   .0455171    25.35   0.000     1.064567    1.242991
                 L2. |  -.1632875   .0551489    -2.96   0.003    -.2713773   -.0551976
                     |
               _cons |  -.0669634   .0363085    -1.84   0.065    -.1381267    .0041999
        ------------------------------------------------------------------------------
        Instruments corresponding to the linear moment conditions:
         1, model(level):
           LD.n L2D.n L3D.n
         2, model(level):
           _cons
        (You can replicate the ivregress results by dropping DL.n from the xtdpdgmm instrument list.)

        Edit: With the perfect option, ivregress produces the correct estimates.
        Last edited by Sebastian Kripfganz; 25 Aug 2020, 09:28.
        https://www.kripfganz.de/stata/

        Comment


        • #5
          Jeff,

          Does what you say above:

          Originally posted by Jeff Wooldridge View Post
          I guess what Stata is picking up is the fact that there really are not two endogenous explanatory variables; there is only one.
          apply to the minimal example I posted above using the toy auto dataset? I don't think so but would be happy to be shown wrong.

          --Mark

          Comment


          • #6
            Mark:
            Yes, Jeff's comments do apply to your example:
            Code:
            . ivregress 2sls mpg (foreign = weight)
            
            Instrumental variables (2SLS) regression          Number of obs   =         74
                                                              Wald chi2(1)    =      27.05
                                                              Prob > chi2     =     0.0000
                                                              R-squared       =          .
                                                              Root MSE        =     7.6721
            
            ------------------------------------------------------------------------------
                     mpg |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                 foreign |    17.1176   3.291427     5.20   0.000     10.66653    23.56868
                   _cons |   16.20828   1.323985    12.24   0.000     13.61332    18.80324
            ------------------------------------------------------------------------------
            Instrumented:  foreign
            Instruments:   weight
            
            . lincom foreign + _cons
            
             ( 1)  foreign + _cons = 0
            
            ------------------------------------------------------------------------------
                     mpg |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                     (1) |   33.32588   2.478889    13.44   0.000     28.46735    38.18442
            ------------------------------------------------------------------------------
            https://www.kripfganz.de/stata/

            Comment


            • #7
              Sebastian - your example isn't a different parameterisation of the same model. They are different models.

              Original model:

              Code:
              regress mpg domestic foreign (one weight), nocons
              Root MSE = 7.7779

              Your model:

              Code:
              ivregress 2sls mpg (foreign = weight)
              Root MSE = 7.6721

              Comment


              • #8
                Mark:
                This difference is just due to different degrees-of-freedom corrections. If you add the small option to ivregress, the Root MSEs coincide.
                https://www.kripfganz.de/stata/

                Comment


                • #9
                  Ah - now that is interesting!

                  Comment


                  • #10
                    So to take the toy auto dataset example further, and using the old-fashioned syntax for regress, start again with

                    Code:
                    regress mpg domestic foreign (one weight), nocons
                    and we detect that there is a collinearity in there ("there really are not two endogenous explanatory variables; there is only one").

                    Is there a way to reparameterise the model by declaring that one of the two provided endogenous explanatory variables is actually exogenous? If there were, it would be the kind of thing that an estimator could do automatically, but I don't think it's possible without in effect changing the model (unlike the usual case where dropping collinear variables doesn't change the model). For example, continuing in the old-fashioned regress syntax, reassigning foreign to be exogenous would be

                    Code:
                    regress mpg domestic foreign (one weight foreign), nocons
                    but it is a different model (RMSE is much different). Same applies if you reassign domestic instead (of course).

                    Comment


                    • #11
                      I do not think you can easily automatize such a re-parameterization, and probably you should not. But that leads to a practical problem:
                      1. If ivregress exits with error because the model is no longer identified after dropping an instrument, that might be acceptable because the users now need to think about what is going on.
                      2. If ivregress in the overidentified case just drops the instrument and then moves on to produce estimates for the model with the original endogenous variables, can we really expect the users to think about the reason for the dropped instrument, and to figure out by themselves that they should either re-parameterize the model or use the option perfect to avoid the collinearity check? In larger models, these linear dependencies may not be obvious any more. I might seriously consider to use the perfect option by default whenever I run an ivregress estimation. I find it difficult to imagine a situation where this option does some harm.
                      https://www.kripfganz.de/stata/

                      Comment


                      • #12
                        Originally posted by Mark Schaffer View Post
                        So to take the toy auto dataset example further, and using the old-fashioned syntax for regress, start again with

                        Code:
                        regress mpg domestic foreign (one weight), nocons
                        and we detect that there is a collinearity in there ("there really are not two endogenous explanatory variables; there is only one").

                        Is there a way to reparameterise the model by declaring that one of the two provided endogenous explanatory variables is actually exogenous? If there were, it would be the kind of thing that an estimator could do automatically, but I don't think it's possible without in effect changing the model (unlike the usual case where dropping collinear variables doesn't change the model). For example, continuing in the old-fashioned regress syntax, reassigning foreign to be exogenous would be

                        Code:
                        regress mpg domestic foreign (one weight foreign), nocons
                        but it is a different model (RMSE is much different). Same applies if you reassign domestic instead (of course).
                        Logically, the second example doesn't make sense. If one and foreign are exogenous then domestic must be exogenous, too. So then 2SLS should (and does) reduce to OLS.

                        At some point, the user has to be responsible for doing something sensible.

                        Comment


                        • #13
                          Originally posted by Jeff Wooldridge View Post
                          At some point, the user has to be responsible for doing something sensible.
                          Absolutely! The issue for me is how to code ivreg2 to respond in these cases. (Right now it detects a problem and reports it, and then tries to reclassify variables across lists of endogenous, exogenous and IV, but there's a bug in the code for the latter.)

                          Should ivreg2 report the errors, reclassify variables with a warning, and rely on the user to sort it out? Or exit with error straight away? The advantage of the former is that the intermediate or final estimation results might help the user work out what the problem is. When Sebastian, Kit and I were trying to diagnose what was going on, the first-stage estimations turned out to be informative in identifying the source of the problem. These wouldn't be available if the program exited with an error right away.

                          An alternative would be to exit with error but recommend that the nocollin option is used (the equivalent of the perfect option of ivregress), perhaps along with examining the first-stage estimations to see if they help explain what is going on.

                          Comment


                          • #14
                            Originally posted by Sebastian Kripfganz View Post
                            In larger models, these linear dependencies may not be obvious any more. I might seriously consider to use the perfect option by default whenever I run an ivregress estimation. I find it difficult to imagine a situation where this option does some harm.
                            I think your point about larger models is what makes the perfect option of ivregress (or the nocollin of ivreg2) a bit risky. For example, in some cross-section applications, people sometimes use lots of interactions of categorical variables. They could turn out to be collinear with the single endogenous regressor. More generally, there's something to be said for having checks like this in place, and asking the user to override them if they are sure about it. (Sort of the same philosophy is behind the omnipresent "replace" option.)

                            Comment


                            • #15
                              I see your point.
                              It would be nice then to have the option to check for collinearity between any single endogenous variable and the instruments, but not for collinearity that involves multiple endogenous variables.
                              Last edited by Sebastian Kripfganz; 25 Aug 2020, 14:03.
                              https://www.kripfganz.de/stata/

                              Comment

                              Working...
                              X