Check multicollinearity (panel data)

Ophelie Anne

Join Date: May 2016

Posts: 7
#1

Check multicollinearity (panel data)

01 May 2016, 17:40

Hi everyone. I have difficulties with working with panel data (i also new with working with STATA as statistical program)

At this moment, I have a model with 17 variables but I'm sure that there will be variables that are highly correlated with each other. I wanted to reduce this number by checking the multicollinearity. Normally, without panel data but with just one observation per variable, I would check the Variance Inflation Factors to decide/look up which variables are highly correlated.
Can some one give me some advice on using which commands to check this?

I also was wondering if some one knows the alternative for the independent samples t -test for panel data? My dependent variable is binary and I want to check if the variables discrimate enough between the 2 possible outcomes.

Kind regards

Last edited by Ophelie Anne; 01 May 2016, 17:42.
Tags: Multicollinearity, panel data
Emad Shehata

Join Date: Oct 2014

Posts: 203
#2

01 May 2016, 19:10

You can check some of user written Stata modules for estimating panel data regression that remedy multicollinearity by using ridge regression without removing of independent variables

XTREGAM: Stata module to estimate Amemiya Random-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

XTREGBEM: Stata module to estimate Between-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

XTREGBN: Stata module to estimate Balestra-Nerlove Random-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

XTREGFEM: Stata module to estimate Fixed-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

XTREGMLE: Stata module to estimate Trevor Breusch MLE Random-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

XTREGREM: Stata module to estimate Fuller-Battese GLS Random-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

XTREGSAM: Stata module to estimate Swamy-Arora Random-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

XTREGWEM: Stata module to estimate Within-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

XTREGWHM: Stata module to estimate Wallace-Hussain Random-Effects Panel Data: Ridge and Weighted Regression
Statistical Software Components, Boston College Department of Economics Downloads

Emad A. Shehata
Professor (PhD Economics)
Agricultural Research Center - Agricultural Economics Research Institute - Egypt
Email: [email protected]
IDEAS: http://ideas.repec.org/f/psh494.html
EconPapers: http://econpapers.repec.org/RAS/psh494.htm
Google Scholar: http://scholar.google.com/citations?...r=cOXvc94AAAAJ
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#3

01 May 2016, 19:44

So if you are new to Stata and working with panel data, you should at least peruse the entire -xt- manual. At a minimum, you need to thoroughly go over -xtset- and the basic -xt- regression models. Also, you should familiarize yourself with factor variable notation, which is particularly helpful when using categorical variables. See the corresponding section in the [U] manual, or just -help fvvarlist- for a quick overview.

Your last sentence confuses me. Your dependent variable is binary, and you wnat to check if the variables discriminate between the 2 possible outcomes. So, are both the independent and dependent variables dichotomous? If you have a continuous variable somewhere and you want to know if it discriminates a dichotomous outcome in panel data, you will probably want to use -xtlogit- or -xtprobit-. If you have a continuous variable and you want to see if its values differ according to a dichotomous predictor (a situation that is analogous to a t-test in independent-observations data) you can regress the continuous variable on the dichotomous predictor using -xtreg-.

As for colinearity, my advice is that you should probably just not think about it. There are two different situations with (nearly) colinear predictors: one of them is simply not a problem, and the other can be, but there is nothing you can do about it.

Situation one is where there is high correlation among a group of predictors not including the main predictor variable(s) about which you wish to reach conclusions. That is, there is multicolinearity among a bunch of covariates that are put in the model to control for their effects. The presence of the multicolinearity here has no impact on the estimation of the model coefficients for your main variables of interest, and it in no way interferes with adequately controlling for their effects. So this type of colinearity can simply be ignored.

Situation two is where one of your main predictor variables, about which you are trying to draw conclusions, is involved in a near-linear relationship with other predictor variables. This can lead to inflation of the standard errors of all of the variables involved, and it can make the estimates of the key coefficients unstable (i.e. highly sensitive, for example, to the inclusion or exclusion of a small number of observations from the estimation sample.) It is easy enough to tell whether this is happening or not. Just take a look at the standard errors and confidencence intervals for your main predictor variables in your output. If they are narrow enough that your estimates are sufficiently precise for your purposes, then there is no problem. You report your findings and move on.

If, on the other hand, they are too wide to provide useful estimation of the parameters of interest, then you have a problem. But it is not a problem you can solve with your existing data. Leaving out one or more of the variables will create the illusion of solving the problem, but it does so at the expense of adding omitted variable bias to your analysis. To get more precise estimates without biasing your analysis requires one of two things. You can get more data: in principle, colinearity's effects on the standard deviation can be overcome by getting a large enough sample. But the amount of additional data needed is typically gigantic, and as a practical matter, you won't be able to do that. The other way to proceed is to go back to square one with a different design, one that gathers data in a way that breaks the colinearity among the variables. This generally involves some sort of matching, or oversampling observations that are discordant on the correlated variables.

So, I think that at this stage you have to just proceed with your analysis and hope for the best. If you end up having the kind of colinearity that is problematic, you then may need to run some analyses to identify specifically which variables are involved so that you can plan the design of your next study. That is done by simply running your analysis with the OLS -regress- command instead of whatever other model you fitted, and then running -estat vif- after that. That will give you the variance inflation factors, and point to the source of your problem. You can then try to design a new data collection that would artificially cause the offending variables to be uncorrelated (or only weakly correlated).

Paul Allison has a very nice summary of multicollinearity at http://statisticalhorizons.com/multicollinearity.

Last edited by Clyde Schechter; 01 May 2016, 19:49.
2 likes
Comment
Emad Shehata

Join Date: Oct 2014

Posts: 203
#4

01 May 2016, 20:22

Thanks for your notice prof.Clyde

The Stata panel data modules which I put them in my post are valid only with models that have continous dependent variables not for binary models; i. e. Logit, probit...

Emad A. Shehata
Professor (PhD Economics)
Agricultural Research Center - Agricultural Economics Research Institute - Egypt
Email: [email protected]
IDEAS: http://ideas.repec.org/f/psh494.html
EconPapers: http://econpapers.repec.org/RAS/psh494.htm
Google Scholar: http://scholar.google.com/citations?...r=cOXvc94AAAAJ
Comment

Emad Shehata

Join Date: Oct 2014
Posts: 203

01 May 2016, 20:29

HTML Code:

. clear all
. sysuse lmcol.dta , clear
. lmcol y x1 x2 x3

==============================================================================
* Ordinary Least Squares (OLS)
==============================================================================
  y = x1 + x2 + x3
------------------------------------------------------------------------------
  Sample Size       =          17
  Wald Test         =    253.9319   |   P-Value > Chi2(3)       =      0.0000
  F-Test            =     84.6440   |   P-Value > F(3 , 13)     =      0.0000
 (Buse 1973) R2     =      0.9513   |   Raw Moments R2          =      0.9986
 (Buse 1973) R2 Adj =      0.9401   |   Raw Moments R2 Adj      =      0.9983
  Root MSE (Sigma)  =      5.7724   |   Log Likelihood Function =    -51.6441
------------------------------------------------------------------------------
- R2h= 0.9513   R2h Adj= 0.9401  F-Test =   84.64 P-Value > F(3 , 13)  0.0000
- R2v= 0.9513   R2v Adj= 0.9401  F-Test =   84.64 P-Value > F(3 , 13)  0.0000
------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   1.060841   .2769969     3.83   0.002     .4624256    1.659257
          x2 |  -1.397391   .2321721    -6.02   0.000    -1.898969    -.895814
          x3 |  -.0034456   .0514889    -0.07   0.948    -.1146807    .1077894
       _cons |   132.2612   36.46863     3.63   0.003     53.47554    211.0469
------------------------------------------------------------------------------

==============================================================================
*** Multicollinearity Diagnostic Tests
==============================================================================

* Correlation Matrix
(obs=17)

             |       x1       x2       x3
-------------+---------------------------
          x1 |   1.0000
          x2 |   0.1788   1.0000
          x3 |  -0.1832  -0.9296   1.0000

* Multicollinearity Diagnostic Criteria
+-------------------------------------------------------------------------------+
|   Var |  Eigenval |  C_Number |   C_Index |       VIF |     1/VIF |   R2_xi,X |
|-------+-----------+-----------+-----------+-----------+-----------+-----------|
|    x1 |    1.9954 |    1.0000 |    1.0000 |    1.0353 |    0.9659 |    0.0341 |
|    x2 |    0.9342 |    2.1361 |    1.4615 |    7.3632 |    0.1358 |    0.8642 |
|    x3 |    0.0704 |   28.3396 |    5.3235 |    7.3753 |    0.1356 |    0.8644 |
+-------------------------------------------------------------------------------+

* Farrar-Glauber Multicollinearity Tests
  Ho: No Multicollinearity - Ha: Multicollinearity
--------------------------------------------------

* (1) Farrar-Glauber Multicollinearity Chi2-Test:
    Chi2 Test =   28.7675    P-Value > Chi2(3) 0.0000

* (2) Farrar-Glauber Multicollinearity F-Test:
+--------------------------------------------------------+
|   Variable |   F_Test |      DF1 |      DF2 |  P_Value |
|------------+----------+----------+----------+----------|
|         x1 |    0.247 |   14.000 |    3.000 |    0.971 |
|         x2 |   44.543 |   14.000 |    3.000 |    0.005 |
|         x3 |   44.627 |   14.000 |    3.000 |    0.005 |
+--------------------------------------------------------+

* (3) Farrar-Glauber Multicollinearity t-Test:
+-------------------------------------+
| Variable |     x1 |     x2 |     x3 |
|----------+--------+--------+--------|
|       x1 |      . |        |        |
|       x2 |  0.680 |      . |        |
|       x3 | -0.697 | -9.435 |      . |
+-------------------------------------+

* Determinant of |X'X|:
  |X'X| = 0 Multicollinearity - |X'X| = 1 No Multicollinearity
  Determinant of |X'X|:      (0 < 0.1313 < 1)
---------------------------------------------------------------

* Theil R2 Multicollinearity Effect:
  R2 = 0 No Multicollinearity - R2 = 1 Multicollinearity
    - Theil R2:              (0 < 0.7606 < 1)
---------------------------------------------------------------

* Multicollinearity Range:
  Q = 0 No Multicollinearity - Q = 1 Multicollinearity
     - Gleason-Staelin Q0:   (0 < 0.5567 < 1)
    1- Heo Range  Q1:        (0 < 0.8356 < 1)
    2- Heo Range  Q2:        (0 < 0.8098 < 1)
    3- Heo Range  Q3:        (0 < 0.6377 < 1)
    4- Heo Range  Q4:        (0 < 0.5425 < 1)
    5- Heo Range  Q5:        (0 < 0.8880 < 1)
    6- Heo Range  Q6:        (0 < 0.5876 < 1)
------------------------------------------------------------------------------

----+ References +-------------------------------------------------------------------------------

D. Belsley (1991) "Conditioning Diagnostics, Collinearity and Weak Data in Regression", John
Wiley & Sons, Inc., New York, USA.

D. Belsley, E. Kuh, and R. Welsch (1980) "Regression Diagnostics: Identifying Influential Data
and Sources of Collinearity", John Wiley & Sons, Inc., New York, USA.

Damodar Gujarati (1995) "Basic Econometrics" 3rd Edition, McGraw Hill, New York, USA.

Evagelia, Mitsaki (2011) "Ridge Regression Analysis of Collinear Data",

http://www.stat-athens.aueb.gr/~jpan...i/chapter2.pdf

Farrar, D. and Glauber, R. (1976) "Multicollinearity in Regression Analysis: the Problem
Revisited", Review of Economics and Statistics, 49; 92-107.

Greene, William (1993) "Econometric Analysis", 2nd ed., Macmillan Publishing Company Inc., New
York, USA; 616-618.

Greene, William (2007) "Econometric Analysis", 6th ed., Upper Saddle River, NJ: Prentice-Hall;
387-388.

Griffiths, W., R. Carter Hill & George Judge (1993) "Learning and Practicing Econometrics", John
Wiley & Sons, Inc., New York, USA; 602-606.

Judge, Georege, R. Carter Hill, William . E. Griffiths, Helmut Lutkepohl, & Tsoung-Chao Lee
(1988) "Introduction To The Theory And Practice Of Econometrics", 2nd ed., John Wiley & Sons,
Inc., New York, USA.

Judge, Georege, W. E. Griffiths, R. Carter Hill, Helmut Lutkepohl, & Tsoung-Chao Lee(1985) "The
Theory and Practice of Econometrics", 2nd ed., John Wiley & Sons, Inc., New York, USA; 615.

Maddala, G. (1992) "Introduction to Econometrics", 2nd ed., Macmillan Publishing Company, New
York, USA; 358-366.

Marquardt D.W. (1970) "Generalized Inverses, Ridge Regression, Biased Linear Estimation, and
Nonlinear Estimation", Technometrics, 12; 591-612.

Rencher, Alvin C. (1998) "Multivariate Statistical Inference and Applications", John Wiley &
Sons, Inc., New York, USA; 21-22.

Theil, Henri (1971) "Principles of Econometrics", John Wiley & Sons, Inc., New York, USA.

William E. Griffiths, R. Carter Hill and George G. Judge (1993) "Learning and Practicing
Econometrics", John Wiley & Sons, Inc., New York, USA.

Last edited by Emad Shehata; 01 May 2016, 20:46.

Emad A. Shehata
Professor (PhD Economics)
Agricultural Research Center - Agricultural Economics Research Institute - Egypt
Email: [email protected]
IDEAS: http://ideas.repec.org/f/psh494.html
EconPapers: http://econpapers.repec.org/RAS/psh494.htm
Google Scholar: http://scholar.google.com/citations?...r=cOXvc94AAAAJ

Comment

Ophelie Anne

Join Date: May 2016

Posts: 7
#6

02 May 2016, 14:22

Dear,

Thank you both for taking the time on answering on my questions, I really appreciate this, -although sometimes it still sound a bit difficult to understand for someone who isn't used to work with Stata, but I guess it's normal if you not familiar with stata nor with panel data use.

To first answer the question that you adressed. I have 17 independent variables, scale interval/ratio and 1 dependent variable scale binary. I'm setting up a failure prediction model where the outcome can be 1= company failed or 0= company is still active. So I wanted to check if every variable discriminates enough between these both options.
For example: you have a ratio that measures the amount of debt you have in your company. If a company failed, that amount will be high. If a company is active, that amount will be not as high. Without panel data, you just look at the 2 mean values of the ratio of these 2 different groups You assume that the zero hypothesis (H0) is that the 2 values are equal to each other, so no difference of your debt position. You do the independent sample t-test and look at the p-value if you can reject the H0 so that you know that this ratio discriminates between 2 groups and that this ratio will give your adequate information.
So I'm looking for the command that can do this with panel data included. I don't know if someone can help me practically with this.

With respect to the multicollinearity, thank you for all the info given about this subject. But a 17 variables model is quiet big so I wanted to use this method to reduce the variables to look which ones are similar because I'm sure that there will be variables that are similar to each other.

I see that Mr. Shehata generated such a table in stata which I'm looking for. Can you maybe show me the command how I can create this (with panel data)?

Kind regards,

Ophélie
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30087
#7

03 May 2016, 07:17

So, with regard to whether 17 variables is too many, it really depends on a) how relevant these particular variables are according to the science of your field, and b) how much data you have (both how many panels and the overall data size). Colinearity really has little to do with it except that if your number of variables approaches the size of your data set then there will automatically be colinearity. But long before the variables:observations ratios is high enough to force colinearity, you may encounter issues of overfitting. There are various rules of thumb as to how many observations you need for each variable. A conservative estimate would be around 25. 50 or even 100 is clearly better. If your sample size is large enough to support all 17 variables in this way, and if you can make a non-laughable case that each of them could be relevant, then I would just leave them all in.

Variables that are irrelevant should just be dropped without any regard to their statistics. Variables that are generally regarded as important in your field, even if they don't seem to show much association with your outcome in your data, should generally be retained for credibility. (E.g. in clinical studies, omitting age or sex would make your model implausible. I have no idea if there are analogous situations in business/finance/economics.) You pay a price in efficiency for that, but it doesn't bias anything. After that, selecting a subset of predictors is a very thorny issue, about which you will find little agreement in the literature. What most will agree on, though, is that automatic approaches that rely on screening with p-values are among the worst ways to do it. So I would say that your quest for the equivalent of a t-test for panel data is misguided. (Though, as I pointed out in my earlier response, the panel-data equivalent of a t-test is with -xtreg, fe- or -xtreg, re-.) Apart from the many problems that this misuse of p-values poses in this context, you have to also consider that you can have pairs (or larger sets) of variables that are jointly quite important but are not individually statistically significant.

If you are going to screen variables individually, then the screening should examine both the extent to which they are associated with the outcome, and the extent to which they are also associated with your principle predictor(s) of interest. A variable that is not associated on both sides of the equation is not a confounder, and leaving it out will not cause missing variable bias. Bear in mind that the associations we are talking about here are at the level of the sample, not the population. So p-values are irrelevant. Actually, they are worse than irrelevant, they are misleading. So depending on the nature of the variables involved you would base this kind of screening on non-inferential statistics such as relative risks or odds ratio, correlation coefficients, or effect size measures like Cohen's d, etc.

Another possibility to consider, if you are just looking to reduce dimensionality among the covariates is to do a principle components analysis of those variables (not including your principle predictors of interest about which you are trying to draw conclusions), and then use the components, leaving out one or more of those with the smallest eigenvalues. You won't be able to say anything comprehensible about those variables from your analysis (which is why your , but you will have done the job of trimming the variance they contribute to the outcome.

Another possibility to consider, if your sample is large enough, is to divide your data (randomly selecting whole panels) into a learning set and a validation set. Explore many models on your learning set and identify a reasonably parsimonious one that fits the data reasonably well. Then test your model in the validation set. There are more advanced ways of doing this that go under the general heading of cross-validation and you can look into those if you like.

Anyway, these are just some general thoughts on a vexatious problem for which no entirely satisfactory solution is known.
Comment

Announcement