Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel Data Multicollinearity

    Hello everybody!
    I want to check my panel data at multicollinearity. Can I take the command -regress- and then do -vif-?
    . estat vif

    Variable...............VIF...........1/VIF

    Deposits ........... 3.16.......... 0.316788
    LEVERAGE....... 2.46..........0.406408
    DebtMaturity.......1.38..........0.723251
    OPM...................1.36..........0.734592
    MTBV..................1.31..........0.761493
    Tier1....................1.21..........0.825978
    LnLossprov..........1.14.........0.874342
    CDfromBanks......1.06.........0.939530

    Mean VIF 1.64

    Thank you!

  • #2
    Did you have a look at the various results on statalist that you get when searching for "Panel Data Multicolinearity"?
    This link for example: http://www.statalist.org/forums/foru...ity-panel-data

    Comment


    • #3
      Thank you, yes I did, and I read the chapter in the book. But for me the chapter wasnt clear enough to write it in my exam. The sentences in this link are good. But because I didnt find the page in the book, I wanted to try a multicollinearity test. Or isnt there a possibility?

      Comment


      • #4
        Why are you looking at VIFs, or any multicollinearity diagnostics? It's like wringing your hands about only having 1,000 rather than 2,000 observations. What you need to show is your estimation results, including the standard errors. Are the SEs too large to do anything with? The standard errors tell you exactly what you need to know. The VIFs might give you a reason the standard errors are large, but they don't tell you what to do, or whether any assumptions are violated.

        More and more I regret including a discussion of VIFs in my introductory book. People seem to not know that they should rarely even look at them.

        Comment


        • #5
          Okay thanks!

          Comment


          • #6
            Hi Stata members I have a question which I would like ask concerning multicollinearity. I have data that is panel where the outcome variable is a firm-level variable and the variable of interest is a country level variable. However, my key variable of interest is collinear with other country-level variables (correlation coefficient is very high). See for instance my outputs

            Code:
             reghdfe DEP_VAR KEY_COUNTRY_VAR if year<=2019, absorb (id year) cluster (id)
            (dropped 1921 singleton observations)
            (MWFE estimator converged in 7 iterations)
            
            HDFE Linear regression                            Number of obs   =    366,014
            Absorbing 2 HDFE groups                           F(   1,  30872) =      35.58
            Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                              R-squared       =     0.2318
                                                              Adj R-squared   =     0.1610
                                                              Within R-sq.    =     0.0002
            Number of clusters (id)      =     30,873         Root MSE        =     0.0754
            
                                               (Std. err. adjusted for 30,873 clusters in id)
            ---------------------------------------------------------------------------------
                            |               Robust
                    DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            ----------------+----------------------------------------------------------------
            KEY_COUNTRY_VAR |   .2390971   .0400812     5.97   0.000     .1605363    .3176579
                      _cons |   -.054489   .0147976    -3.68   0.000    -.0834929   -.0254851
            ---------------------------------------------------------------------------------
            
            Absorbed degrees of freedom:
            -----------------------------------------------------+
             Absorbed FE | Categories  - Redundant  = Num. Coefs |
            -------------+---------------------------------------|
                      id |     30873       30873           0    *|
                    year |        21           1          20     |
            -----------------------------------------------------+
            * = FE nested within cluster; treated as redundant for DoF computation

            Code:
            reghdfe DEP_VAR KEY_COUNTRY_VAR COUNTRY_VAR_1  if year<=2019, absorb (id year) cluster (id)
            (dropped 2105 singleton observations)
            (MWFE estimator converged in 7 iterations)
            
            HDFE Linear regression                            Number of obs   =    277,108
            Absorbing 2 HDFE groups                           F(   2,  30262) =     232.37
            Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                              R-squared       =     0.2607
                                                              Adj R-squared   =     0.1700
                                                              Within R-sq.    =     0.0028
            Number of clusters (id)      =     30,263         Root MSE        =     0.0729
            
                                               (Std. err. adjusted for 30,263 clusters in id)
            ---------------------------------------------------------------------------------
                            |               Robust
                    DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            ----------------+----------------------------------------------------------------
            KEY_COUNTRY_VAR |  -.2073782   .0637695    -3.25   0.001    -.3323691   -.0823873
              COUNTRY_VAR_1 |  -.0483314   .0023594   -20.48   0.000    -.0529559   -.0437068
                      _cons |    .596736   .0404757    14.74   0.000      .517402    .6760701
            ---------------------------------------------------------------------------------
            
            Absorbed degrees of freedom:
            -----------------------------------------------------+
             Absorbed FE | Categories  - Redundant  = Num. Coefs |
            -------------+---------------------------------------|
                      id |     30263       30263           0    *|
                    year |        15           1          14     |
            -----------------------------------------------------+
            * = FE nested within cluster; treated as redundant for DoF computation
            
            . reghdfe DEP_VAR KEY_COUNTRY_VAR COUNTRY_VAR_2  if year<=2019, absorb (id year) cluster (id)
            (dropped 1921 singleton observations)
            (MWFE estimator converged in 7 iterations)
            
            HDFE Linear regression                            Number of obs   =    366,014
            Absorbing 2 HDFE groups                           F(   2,  30872) =      42.80
            Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                              R-squared       =     0.2319
                                                              Adj R-squared   =     0.1611
                                                              Within R-sq.    =     0.0004
            Number of clusters (id)      =     30,873         Root MSE        =     0.0754
            
                                               (Std. err. adjusted for 30,873 clusters in id)
            ---------------------------------------------------------------------------------
                            |               Robust
                    DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            ----------------+----------------------------------------------------------------
            KEY_COUNTRY_VAR |   .2104902   .0402811     5.23   0.000     .1315375    .2894428
              COUNTRY_VAR_2 |  -.0003427   .0000479    -7.15   0.000    -.0004366   -.0002487
                      _cons |  -.0172973   .0156442    -1.11   0.269    -.0479605    .0133659
            ---------------------------------------------------------------------------------
            
            Absorbed degrees of freedom:
            -----------------------------------------------------+
             Absorbed FE | Categories  - Redundant  = Num. Coefs |
            -------------+---------------------------------------|
                      id |     30873       30873           0    *|
                    year |        21           1          20     |
            -----------------------------------------------------+
            * = FE nested within cluster; treated as redundant for DoF computation
            
            . reghdfe DEP_VAR KEY_COUNTRY_VAR COUNTRY_VAR_1 COUNTRY_VAR_2  if year<=2019, absorb (id year) cluster (id)
            (dropped 2105 singleton observations)
            (MWFE estimator converged in 7 iterations)
            
            HDFE Linear regression                            Number of obs   =    277,108
            Absorbing 2 HDFE groups                           F(   3,  30262) =     157.81
            Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                              R-squared       =     0.2608
                                                              Adj R-squared   =     0.1701
                                                              Within R-sq.    =     0.0029
            Number of clusters (id)      =     30,263         Root MSE        =     0.0729
            
                                               (Std. err. adjusted for 30,263 clusters in id)
            ---------------------------------------------------------------------------------
                            |               Robust
                    DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
            ----------------+----------------------------------------------------------------
            KEY_COUNTRY_VAR |  -.1859862   .0637477    -2.92   0.004    -.3109345   -.0610379
              COUNTRY_VAR_1 |  -.0457914   .0025188   -18.18   0.000    -.0507284   -.0408544
              COUNTRY_VAR_2 |  -.0001702   .0000611    -2.79   0.005    -.0002899   -.0000505
                      _cons |   .5762671   .0408166    14.12   0.000     .4962648    .6562694
            ---------------------------------------------------------------------------------
            
            Absorbed degrees of freedom:
            -----------------------------------------------------+
             Absorbed FE | Categories  - Redundant  = Num. Coefs |
            -------------+---------------------------------------|
                      id |     30263       30263           0    *|
                    year |        15           1          14     |
            -----------------------------------------------------+
            * = FE nested within cluster; treated as redundant for DoF computation


            Code:
            pwcorr DEP_VAR KEY_COUNTRY_VAR COUNTRY_VAR_1 COUNTRY_VAR_2,sig
            
                         |  DEP_VAR KEY_CO~R COUNTR~1 COUNTR~2
            -------------+------------------------------------
                 DEP_VAR |   1.0000
                         |
                         |
            KEY_COUNTR~R |   0.0261   1.0000
                         |   0.0000
                         |
            COUNTRY_VA~1 |  -0.0487  -0.8712   1.0000
                         |   0.0000   0.0000
                         |
            COUNTRY_VA~2 |  -0.0443  -0.7492   0.8682   1.0000
                         |   0.0000   0.0000   0.0000
                         |
            My doubts
            1) Are the issues of multicollinearity evident in my results?
            2) According Jeff Wooldridge
            " Are the SEs too large to do anything with? The standard errors tell you exactly what you need to know"
            . What does this mean in my context?
            The same is endorsed by Clyde Schechter as
            " The hallmark of that is that the standard errors of the coefficients of x1 and x2 in the regression output are large. How large is unreasonably large? There is no hard and fast cut-off. But, one might see a situation where the standard error is a large multiple of the magnitude of the coefficient for each of these variables. That is the regression's way of telling you that it cannot figure out with any precision the association of x1 or x2 with y.
            -https://www.statalist.org/forums/forum/general-stata-discussion/general/1297526-multicollinearity-panel-data?p=1297657#post1297657.
            Can some one help me in this context to know how SE can help the impact of Multicollinearity. I am happy to provide more information if required
            Last edited by Neelakanda Krishna; 19 Nov 2023, 04:55.

            Comment


            • #7
              Hello everyone,

              I'm checking for any updates regarding the aforementioned matter. If my inquiry has already been discussed elsewhere in this Stata blog, kindly guide me to the relevant information. Additionally, if further details are needed to address my question, please inform me. Please be aware that my concern about multicollinearity arises because the coefficient of "KEY_COUNTRY_VAR" is being reversed in regressions when collinear country variables are included. As the Variance Inflation Factor (VIF) is not considered a reliable method for assessing collinearity in this context, I seek assistance in identifying multicollinearity issues using standard errors and coefficients.

              Thank you.
              Last edited by Neelakanda Krishna; 19 Nov 2023, 20:30.

              Comment


              • #8
                A few things. First, if you're mainly interested in KEY_COUNTRY_VAR then the correlation between the two control variables, COUNTRY_VAR_1 and COUNTRY_VAR_2, is irrelevant. They are individually and jointly significant. You have to decide whether they are "good" controls or not. Does it make sense to hold them fixed while varying KEY_COUNTRY_VAR? If so, you should control for them, even if they are highly correlated with KEY_COUNTRY_VAR. In fact, such correlation means you must control for them -- unless they are "bad" controls and you're overcontrolling.

                Comment


                • #9
                  Dear Jeff Wooldridge I appreciate your prompt response to my inquiry, and your guidance holds significant value for me as it directly addresses my concerns. If I may further inquire about the correlation between multicollinearity and standard errors using a simplified example, and if you could provide insights, it would greatly aid my understanding of this relationship.

                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input int(DEP_VAR INDEP_VAR1 INDEP_VAR3)
                   35  469  6979
                  110 1594 23854
                  135 1969 29479
                   21  259  3829
                  140 2044 30604
                  124 1804 27004
                  147 2149 32179
                   60  844 12604
                  111 1609 24079
                  135 1969 29479
                   43  589  8779
                  143 2089 31279
                   68  964 14404
                   67  949 14179
                   99 1429 21379
                   33  439  6529
                   25  319  4729
                   15  169  2479
                   58  814 12154
                   45  619  9229
                  104 1504 22504
                   81 1159 17329
                   88 1264 18904
                   27  349  5179
                  138 2014 30154
                  140 2044 30604
                   33  439  6529
                  107 1549 23179
                  105 1519 22729
                  136 1984 29704
                   43  589  8779
                   55  769 11479
                   50  694 10354
                  135 1969 29479
                    8   64   904
                  129 1879 28129
                   77 1099 16429
                   13  139  2029
                   36  484  7204
                   21  259   133
                   90  114   129
                   32   25   149
                   61   87    80
                   72   62   133
                   68   83    96
                  109   97    56
                    1   86    97
                   88   75    29
                  125   17    37
                   69  120   122
                  119  100     3
                  111  134    33
                   22   37    94
                   44   59   102
                   98  135   130
                  103   29   379
                   26   87  1249
                  104  109  1579
                  end
                  Code:
                   *Step 1 Correlation
                  . pwcorr DEP_VAR INDEP_VAR1 INDEP_VAR3
                  
                               |  DEP_VAR INDEP_~1 INDEP_~3
                  -------------+---------------------------
                       DEP_VAR |   1.0000
                    INDEP_VAR1 |   0.7001   1.0000
                    INDEP_VAR3 |   0.6853   0.9985   1.0000
                  
                  .
                  Code:
                  *Step 2 Regression
                  . * With full specification (Full Model)
                  . reg DEP_VAR INDEP_VAR1 INDEP_VAR3 ,vce(r)
                  
                  Linear regression                               Number of obs     =         58
                                                                  F(2, 55)          =      41.24
                                                                  Prob > F          =     0.0000
                                                                  R-squared         =     0.5519
                                                                  Root MSE          =     29.231
                  
                  ------------------------------------------------------------------------------
                               |               Robust
                       DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                    INDEP_VAR1 |   .2976248    .203238     1.46   0.149    -.1096733    .7049228
                    INDEP_VAR3 |  -.0166915   .0132443    -1.26   0.213    -.0432337    .0098507
                         _cons |   32.84352   10.56997     3.11   0.003     11.66082    54.02621
                  ------------------------------------------------------------------------------
                  
                  .
                  My doubt: Do the coefficients and standard errors of INDEP_VAR1 and INDEP_VAR3 in the above regression results reveal anything peculiar that might suggest multicollinearity? In other words, are the standard errors unusual in the above case?"

                  Now though not recommended, I use VIF here

                  Code:
                  estat vif      
                  
                      Variable |       VIF       1/VIF  
                  -------------+----------------------
                    INDEP_VAR1 |    326.56    0.003062
                    INDEP_VAR3 |    326.56    0.003062
                  -------------+----------------------
                      Mean VIF |    326.56
                  For shedding the light further, let me also show results of regression with independent variables considered one at a time


                  Code:
                  . reg DEP_VAR INDEP_VAR1,vce(r)
                  
                  Linear regression                               Number of obs     =         58
                                                                  F(1, 56)          =      65.58
                                                                  Prob > F          =     0.0000
                                                                  R-squared         =     0.4902
                                                                  Root MSE          =     30.899
                  
                  ------------------------------------------------------------------------------
                               |               Robust
                       DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                    INDEP_VAR1 |   .0402113   .0049654     8.10   0.000     .0302644    .0501582
                         _cons |   45.16507   7.688899     5.87   0.000     29.76235    60.56778
                  ------------------------------------------------------------------------------
                  
                  . reg DEP_VAR INDEP_VAR3,vce(r)
                  
                  Linear regression                               Number of obs     =         58
                                                                  F(1, 56)          =      58.79
                                                                  Prob > F          =     0.0000
                                                                  R-squared         =     0.4697
                                                                  Root MSE          =     31.515
                  
                  ------------------------------------------------------------------------------
                               |               Robust
                       DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                    INDEP_VAR3 |   .0025483   .0003323     7.67   0.000     .0018826    .0032141
                         _cons |   47.77402   7.595271     6.29   0.000     32.55886    62.98918
                  ------------------------------------------------------------------------------

                  The coefficients obtained from individual regressions indicate that each independent variable is positively and significantly associated with the dependent variable. However, when these variables are included in the full model, the coefficient for INDEP_VAR3 becomes negative, and none of the coefficients are statistically significant. In light of these findings, could you assist me in examining whether the Standard Errors, particularly in the context of the full model (reg DEP_VAR INDEP_VAR1 INDEP_VAR3, vce(r)), may serve as a red flag indicating potential multicollinearity? I understand your time constraints, but if you find a moment to provide guidance, it would be greatly appreciated.

                  Edits
                  I read Richard Williams tutorial for multicollinearity "https://www3.nd.edu/~rwilliam/stats2/l11.pdf" and he mentions for a prima facie check (though not strongly recommends)-Ergo, examining the tolerances or VIFs is probably superior to examining the bivariate correlations. Indeed, you may want to actually regress each X on all of the other X’s, to help you pinpoint where the problem is. Look at the correlations of the estimated coefficients (not the variables). High correlations between pairs of coefficients indicate possible collinearity problems. In Stata you get it by running the vce, corr command after a regression.
                  Continuing with example
                  Code:
                  reg DEP_VAR INDEP_VAR1 INDEP_VAR3,vce(r)
                  
                  Linear regression                               Number of obs     =         58
                                                                  F(2, 55)          =      41.24
                                                                  Prob > F          =     0.0000
                                                                  R-squared         =     0.5519
                                                                  Root MSE          =     29.231
                  
                  ------------------------------------------------------------------------------
                               |               Robust
                       DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
                  -------------+----------------------------------------------------------------
                    INDEP_VAR1 |   .2976248    .203238     1.46   0.149    -.1096733    .7049228
                    INDEP_VAR3 |  -.0166915   .0132443    -1.26   0.213    -.0432337    .0098507
                         _cons |   32.84352   10.56997     3.11   0.003     11.66082    54.02621
                  ------------------------------------------------------------------------------
                  
                  . vce, corr
                  
                  Correlation matrix of coefficients of regress model
                  
                          e(V) | INDEP_~1  INDEP_~3     _cons 
                  -------------+------------------------------
                    INDEP_VAR1 |   1.0000                     
                    INDEP_VAR3 |  -0.9997    1.0000           
                         _cons |  -0.7517    0.7369    1.0000 
                  
                  .
                  Last edited by Neelakanda Krishna; 20 Nov 2023, 22:21.

                  Comment


                  • #10
                    The results you show in #9 are actually an excellent example of the principles involved. Assuming that one or both of INDEP_VAR1 and INDEP_VAR3 are the key variables whose effects need to be estimated, you can see that either variable alone is estimated with high precision: the standard errors of each variable are about an order of magnitude smaller than the coefficient estimates. Looking, for example, at the analysis with INDEP_VAR3 alone, the confidence interval runs from 0.002 to 0.003. In almost any context you can imagine, there would be no real world consequential differences if the true value of the effect were at either end of that interval. (In the rare situation where that is not the case, then the analysis would be indeterminate and further investigation required.) By contrast, when you put the two INDEP_VARs into the model together, the standard errors are almost equal to the estimates, and, consequently, the confidence intervals are sufficiently wide that one would probably draw entirely different conclusions and make different decisions were the true effects known to be at either end of that interval. So this is a classic example where the multicolinearity (the presence of which is demonstrated by your correlation matrices or the VIF results) among the predictors creates a serious problem and leaves the analysis inconclusive.

                    Your examples also illustrate the key point made by Arthur Goldberger in his Textbook of Econometrics, where he argues that the term multicolinearity should be abandoned in favor of the more appropriate "hyponumerosity." With a sample size of 58, you are nowhere close to being able to distinguish the effects of two highly correlated predictors. It would require a much larger sample to do that.

                    Comment


                    • #11
                      Dear Clyde Schechter I want to express my sincere gratitude to you for the outstanding and insightful explanation of standard errors and confidence intervals in the context of multicollinearity. Your elucidation has been immensely helpful, and I have gained significant benefits from it. Thank you for your excellent guidance.

                      Comment

                      Working...
                      X