Panel Data Multicollinearity

Kalle Bernhardson

Join Date: Aug 2016

Posts: 8
#1

Panel Data Multicollinearity

14 Aug 2016, 04:10

Hello everybody!
I want to check my panel data at multicollinearity. Can I take the command -regress- and then do -vif-?

. estat vif

Variable...............VIF...........1/VIF

Deposits ........... 3.16.......... 0.316788
LEVERAGE....... 2.46..........0.406408
DebtMaturity.......1.38..........0.723251
OPM...................1.36..........0.734592
MTBV..................1.31..........0.761493
Tier1....................1.21..........0.825978
LnLossprov..........1.14.........0.874342
CDfromBanks......1.06.........0.939530

Mean VIF 1.64

Thank you!
Tags: None
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#2

14 Aug 2016, 05:17

Did you have a look at the various results on statalist that you get when searching for "Panel Data Multicolinearity"?
This link for example: http://www.statalist.org/forums/foru...ity-panel-data
Comment
Kalle Bernhardson

Join Date: Aug 2016

Posts: 8
#3

14 Aug 2016, 05:27

Thank you, yes I did, and I read the chapter in the book. But for me the chapter wasnt clear enough to write it in my exam. The sentences in this link are good. But because I didnt find the page in the book, I wanted to try a multicollinearity test. Or isnt there a possibility?
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2156
#4

14 Aug 2016, 16:32

Why are you looking at VIFs, or any multicollinearity diagnostics? It's like wringing your hands about only having 1,000 rather than 2,000 observations. What you need to show is your estimation results, including the standard errors. Are the SEs too large to do anything with? The standard errors tell you exactly what you need to know. The VIFs might give you a reason the standard errors are large, but they don't tell you what to do, or whether any assumptions are violated.

More and more I regret including a discussion of VIFs in my introductory book. People seem to not know that they should rarely even look at them.
3 likes
Comment
Kalle Bernhardson

Join Date: Aug 2016

Posts: 8
#5

15 Aug 2016, 15:27

Okay thanks!
Comment

Neelakanda Krishna

Join Date: May 2021
Posts: 107

19 Nov 2023, 04:31

Hi Stata members I have a question which I would like ask concerning multicollinearity. I have data that is panel where the outcome variable is a firm-level variable and the variable of interest is a country level variable. However, my key variable of interest is collinear with other country-level variables (correlation coefficient is very high). See for instance my outputs

Code:

 reghdfe DEP_VAR KEY_COUNTRY_VAR if year<=2019, absorb (id year) cluster (id)
(dropped 1921 singleton observations)
(MWFE estimator converged in 7 iterations)

HDFE Linear regression                            Number of obs   =    366,014
Absorbing 2 HDFE groups                           F(   1,  30872) =      35.58
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.2318
                                                  Adj R-squared   =     0.1610
                                                  Within R-sq.    =     0.0002
Number of clusters (id)      =     30,873         Root MSE        =     0.0754

                                   (Std. err. adjusted for 30,873 clusters in id)
---------------------------------------------------------------------------------
                |               Robust
        DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------+----------------------------------------------------------------
KEY_COUNTRY_VAR |   .2390971   .0400812     5.97   0.000     .1605363    .3176579
          _cons |   -.054489   .0147976    -3.68   0.000    -.0834929   -.0254851
---------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |     30873       30873           0    *|
        year |        21           1          20     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Code:

reghdfe DEP_VAR KEY_COUNTRY_VAR COUNTRY_VAR_1  if year<=2019, absorb (id year) cluster (id)
(dropped 2105 singleton observations)
(MWFE estimator converged in 7 iterations)

HDFE Linear regression                            Number of obs   =    277,108
Absorbing 2 HDFE groups                           F(   2,  30262) =     232.37
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.2607
                                                  Adj R-squared   =     0.1700
                                                  Within R-sq.    =     0.0028
Number of clusters (id)      =     30,263         Root MSE        =     0.0729

                                   (Std. err. adjusted for 30,263 clusters in id)
---------------------------------------------------------------------------------
                |               Robust
        DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------+----------------------------------------------------------------
KEY_COUNTRY_VAR |  -.2073782   .0637695    -3.25   0.001    -.3323691   -.0823873
  COUNTRY_VAR_1 |  -.0483314   .0023594   -20.48   0.000    -.0529559   -.0437068
          _cons |    .596736   .0404757    14.74   0.000      .517402    .6760701
---------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |     30263       30263           0    *|
        year |        15           1          14     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

. reghdfe DEP_VAR KEY_COUNTRY_VAR COUNTRY_VAR_2  if year<=2019, absorb (id year) cluster (id)
(dropped 1921 singleton observations)
(MWFE estimator converged in 7 iterations)

HDFE Linear regression                            Number of obs   =    366,014
Absorbing 2 HDFE groups                           F(   2,  30872) =      42.80
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.2319
                                                  Adj R-squared   =     0.1611
                                                  Within R-sq.    =     0.0004
Number of clusters (id)      =     30,873         Root MSE        =     0.0754

                                   (Std. err. adjusted for 30,873 clusters in id)
---------------------------------------------------------------------------------
                |               Robust
        DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------+----------------------------------------------------------------
KEY_COUNTRY_VAR |   .2104902   .0402811     5.23   0.000     .1315375    .2894428
  COUNTRY_VAR_2 |  -.0003427   .0000479    -7.15   0.000    -.0004366   -.0002487
          _cons |  -.0172973   .0156442    -1.11   0.269    -.0479605    .0133659
---------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |     30873       30873           0    *|
        year |        21           1          20     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

. reghdfe DEP_VAR KEY_COUNTRY_VAR COUNTRY_VAR_1 COUNTRY_VAR_2  if year<=2019, absorb (id year) cluster (id)
(dropped 2105 singleton observations)
(MWFE estimator converged in 7 iterations)

HDFE Linear regression                            Number of obs   =    277,108
Absorbing 2 HDFE groups                           F(   3,  30262) =     157.81
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.2608
                                                  Adj R-squared   =     0.1701
                                                  Within R-sq.    =     0.0029
Number of clusters (id)      =     30,263         Root MSE        =     0.0729

                                   (Std. err. adjusted for 30,263 clusters in id)
---------------------------------------------------------------------------------
                |               Robust
        DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
----------------+----------------------------------------------------------------
KEY_COUNTRY_VAR |  -.1859862   .0637477    -2.92   0.004    -.3109345   -.0610379
  COUNTRY_VAR_1 |  -.0457914   .0025188   -18.18   0.000    -.0507284   -.0408544
  COUNTRY_VAR_2 |  -.0001702   .0000611    -2.79   0.005    -.0002899   -.0000505
          _cons |   .5762671   .0408166    14.12   0.000     .4962648    .6562694
---------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |     30263       30263           0    *|
        year |        15           1          14     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Code:

pwcorr DEP_VAR KEY_COUNTRY_VAR COUNTRY_VAR_1 COUNTRY_VAR_2,sig

             |  DEP_VAR KEY_CO~R COUNTR~1 COUNTR~2
-------------+------------------------------------
     DEP_VAR |   1.0000
             |
             |
KEY_COUNTR~R |   0.0261   1.0000
             |   0.0000
             |
COUNTRY_VA~1 |  -0.0487  -0.8712   1.0000
             |   0.0000   0.0000
             |
COUNTRY_VA~2 |  -0.0443  -0.7492   0.8682   1.0000
             |   0.0000   0.0000   0.0000
             |

My doubts
1) Are the issues of multicollinearity evident in my results?
2) According Jeff Wooldridge

" Are the SEs too large to do anything with? The standard errors tell you exactly what you need to know"

. What does this mean in my context?
The same is endorsed by Clyde Schechter as

" The hallmark of that is that the standard errors of the coefficients of x1 and x2 in the regression output are large. How large is unreasonably large? There is no hard and fast cut-off. But, one might see a situation where the standard error is a large multiple of the magnitude of the coefficient for each of these variables. That is the regression's way of telling you that it cannot figure out with any precision the association of x1 or x2 with y.

-https://www.statalist.org/forums/forum/general-stata-discussion/general/1297526-multicollinearity-panel-data?p=1297657#post1297657.
Can some one help me in this context to know how SE can help the impact of Multicollinearity. I am happy to provide more information if required

Last edited by Neelakanda Krishna; 19 Nov 2023, 04:55.

Comment

Neelakanda Krishna

Join Date: May 2021

Posts: 107
#7

19 Nov 2023, 20:22

Hello everyone,

I'm checking for any updates regarding the aforementioned matter. If my inquiry has already been discussed elsewhere in this Stata blog, kindly guide me to the relevant information. Additionally, if further details are needed to address my question, please inform me. Please be aware that my concern about multicollinearity arises because the coefficient of "KEY_COUNTRY_VAR" is being reversed in regressions when collinear country variables are included. As the Variance Inflation Factor (VIF) is not considered a reliable method for assessing collinearity in this context, I seek assistance in identifying multicollinearity issues using standard errors and coefficients.

Thank you.

Last edited by Neelakanda Krishna; 19 Nov 2023, 20:30.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2156
#8

20 Nov 2023, 11:00

A few things. First, if you're mainly interested in KEY_COUNTRY_VAR then the correlation between the two control variables, COUNTRY_VAR_1 and COUNTRY_VAR_2, is irrelevant. They are individually and jointly significant. You have to decide whether they are "good" controls or not. Does it make sense to hold them fixed while varying KEY_COUNTRY_VAR? If so, you should control for them, even if they are highly correlated with KEY_COUNTRY_VAR. In fact, such correlation means you must control for them -- unless they are "bad" controls and you're overcontrolling.
Comment

Neelakanda Krishna

Join Date: May 2021
Posts: 107

20 Nov 2023, 21:22

Dear Jeff Wooldridge I appreciate your prompt response to my inquiry, and your guidance holds significant value for me as it directly addresses my concerns. If I may further inquire about the correlation between multicollinearity and standard errors using a simplified example, and if you could provide insights, it would greatly aid my understanding of this relationship.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int(DEP_VAR INDEP_VAR1 INDEP_VAR3)
 35  469  6979
110 1594 23854
135 1969 29479
 21  259  3829
140 2044 30604
124 1804 27004
147 2149 32179
 60  844 12604
111 1609 24079
135 1969 29479
 43  589  8779
143 2089 31279
 68  964 14404
 67  949 14179
 99 1429 21379
 33  439  6529
 25  319  4729
 15  169  2479
 58  814 12154
 45  619  9229
104 1504 22504
 81 1159 17329
 88 1264 18904
 27  349  5179
138 2014 30154
140 2044 30604
 33  439  6529
107 1549 23179
105 1519 22729
136 1984 29704
 43  589  8779
 55  769 11479
 50  694 10354
135 1969 29479
  8   64   904
129 1879 28129
 77 1099 16429
 13  139  2029
 36  484  7204
 21  259   133
 90  114   129
 32   25   149
 61   87    80
 72   62   133
 68   83    96
109   97    56
  1   86    97
 88   75    29
125   17    37
 69  120   122
119  100     3
111  134    33
 22   37    94
 44   59   102
 98  135   130
103   29   379
 26   87  1249
104  109  1579
end

Code:

 *Step 1 Correlation
. pwcorr DEP_VAR INDEP_VAR1 INDEP_VAR3

             |  DEP_VAR INDEP_~1 INDEP_~3
-------------+---------------------------
     DEP_VAR |   1.0000
  INDEP_VAR1 |   0.7001   1.0000
  INDEP_VAR3 |   0.6853   0.9985   1.0000

.

Code:

*Step 2 Regression
. * With full specification (Full Model)
. reg DEP_VAR INDEP_VAR1 INDEP_VAR3 ,vce(r)

Linear regression                               Number of obs     =         58
                                                F(2, 55)          =      41.24
                                                Prob > F          =     0.0000
                                                R-squared         =     0.5519
                                                Root MSE          =     29.231

------------------------------------------------------------------------------
             |               Robust
     DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  INDEP_VAR1 |   .2976248    .203238     1.46   0.149    -.1096733    .7049228
  INDEP_VAR3 |  -.0166915   .0132443    -1.26   0.213    -.0432337    .0098507
       _cons |   32.84352   10.56997     3.11   0.003     11.66082    54.02621
------------------------------------------------------------------------------

.

My doubt: Do the coefficients and standard errors of INDEP_VAR1 and INDEP_VAR3 in the above regression results reveal anything peculiar that might suggest multicollinearity? In other words, are the standard errors unusual in the above case?"

Now though not recommended, I use VIF here

Code:

estat vif      

    Variable |       VIF       1/VIF  
-------------+----------------------
  INDEP_VAR1 |    326.56    0.003062
  INDEP_VAR3 |    326.56    0.003062
-------------+----------------------
    Mean VIF |    326.56

For shedding the light further, let me also show results of regression with independent variables considered one at a time

Code:

. reg DEP_VAR INDEP_VAR1,vce(r)

Linear regression                               Number of obs     =         58
                                                F(1, 56)          =      65.58
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4902
                                                Root MSE          =     30.899

------------------------------------------------------------------------------
             |               Robust
     DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  INDEP_VAR1 |   .0402113   .0049654     8.10   0.000     .0302644    .0501582
       _cons |   45.16507   7.688899     5.87   0.000     29.76235    60.56778
------------------------------------------------------------------------------

. reg DEP_VAR INDEP_VAR3,vce(r)

Linear regression                               Number of obs     =         58
                                                F(1, 56)          =      58.79
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4697
                                                Root MSE          =     31.515

------------------------------------------------------------------------------
             |               Robust
     DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  INDEP_VAR3 |   .0025483   .0003323     7.67   0.000     .0018826    .0032141
       _cons |   47.77402   7.595271     6.29   0.000     32.55886    62.98918

------------------------------------------------------------------------------

The coefficients obtained from individual regressions indicate that each independent variable is positively and significantly associated with the dependent variable. However, when these variables are included in the full model, the coefficient for INDEP_VAR3 becomes negative, and none of the coefficients are statistically significant. In light of these findings, could you assist me in examining whether the Standard Errors, particularly in the context of the full model (reg DEP_VAR INDEP_VAR1 INDEP_VAR3, vce(r)), may serve as a red flag indicating potential multicollinearity? I understand your time constraints, but if you find a moment to provide guidance, it would be greatly appreciated.

Edits
I read Richard Williams tutorial for multicollinearity "https://www3.nd.edu/~rwilliam/stats2/l11.pdf" and he mentions for a prima facie check (though not strongly recommends)-Ergo, examining the tolerances or VIFs is probably superior to examining the bivariate correlations. Indeed, you may want to actually regress each X on all of the other X’s, to help you pinpoint where the problem is. Look at the correlations of the estimated coefficients (not the variables). High correlations between pairs of coefficients indicate possible collinearity problems. In Stata you get it by running the vce, corr command after a regression.
Continuing with example

Code:

reg DEP_VAR INDEP_VAR1 INDEP_VAR3,vce(r)

Linear regression                               Number of obs     =         58
                                                F(2, 55)          =      41.24
                                                Prob > F          =     0.0000
                                                R-squared         =     0.5519
                                                Root MSE          =     29.231

------------------------------------------------------------------------------
             |               Robust
     DEP_VAR | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
  INDEP_VAR1 |   .2976248    .203238     1.46   0.149    -.1096733    .7049228
  INDEP_VAR3 |  -.0166915   .0132443    -1.26   0.213    -.0432337    .0098507
       _cons |   32.84352   10.56997     3.11   0.003     11.66082    54.02621
------------------------------------------------------------------------------

. vce, corr

Correlation matrix of coefficients of regress model

        e(V) | INDEP_~1  INDEP_~3     _cons 
-------------+------------------------------
  INDEP_VAR1 |   1.0000                     
  INDEP_VAR3 |  -0.9997    1.0000           
       _cons |  -0.7517    0.7369    1.0000 

.

Last edited by Neelakanda Krishna; 20 Nov 2023, 22:21.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#10

21 Nov 2023, 09:10

The results you show in #9 are actually an excellent example of the principles involved. Assuming that one or both of INDEP_VAR1 and INDEP_VAR3 are the key variables whose effects need to be estimated, you can see that either variable alone is estimated with high precision: the standard errors of each variable are about an order of magnitude smaller than the coefficient estimates. Looking, for example, at the analysis with INDEP_VAR3 alone, the confidence interval runs from 0.002 to 0.003. In almost any context you can imagine, there would be no real world consequential differences if the true value of the effect were at either end of that interval. (In the rare situation where that is not the case, then the analysis would be indeterminate and further investigation required.) By contrast, when you put the two INDEP_VARs into the model together, the standard errors are almost equal to the estimates, and, consequently, the confidence intervals are sufficiently wide that one would probably draw entirely different conclusions and make different decisions were the true effects known to be at either end of that interval. So this is a classic example where the multicolinearity (the presence of which is demonstrated by your correlation matrices or the VIF results) among the predictors creates a serious problem and leaves the analysis inconclusive.

Your examples also illustrate the key point made by Arthur Goldberger in his Textbook of Econometrics, where he argues that the term multicolinearity should be abandoned in favor of the more appropriate "hyponumerosity." With a sample size of 58, you are nowhere close to being able to distinguish the effects of two highly correlated predictors. It would require a much larger sample to do that.
1 like
Comment
Neelakanda Krishna

Join Date: May 2021

Posts: 107
#11

21 Nov 2023, 23:42

Dear Clyde Schechter I want to express my sincere gratitude to you for the outstanding and insightful explanation of standard errors and confidence intervals in the context of multicollinearity. Your elucidation has been immensely helpful, and I have gained significant benefits from it. Thank you for your excellent guidance.
Comment

Announcement