Multicollinearity

donchoa

Join Date: Jan 2015

Posts: 1
#1

Multicollinearity

06 Jan 2015, 12:07

Hi,

I wish run a multiple linear regression with DV= abnormal returns
and continuous IV= ROA, board size, firm size, leverage, market-to-book, % independant directors
and dummy IV: cash, cross border
I wish to create a interaction term between ROA and % independant directors
My question is: because I have a multicollinearity problem should centered (i.e., subtracting their means) all continuous (DV + IVs) or just variables included in interaction term
Thank you in advance
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

06 Jan 2015, 12:46

In setting up interaction terms, the issue with regard to centering has more to do with the interpretation of the results than with multicollinearity problems. If you have a model that includes ROA (whatever that is), % independent directors, and their interaction, then the coefficient of ROA represents the effect* of a unit increase in ROA on your DV conditional on % independent directors = 0. If firms rarely or never have 0 % independent directors, this effectively makes the ROA coefficient meaningless and it would make more sense to center % independent directors around some value (possibly the mean, but that is not the only reasonable approach) that occurs frequently in your data. Similarly, the coefficient of % independent directors will represent the effect of an increase of 1 percentage point in % independent directors on our DV conditional on ROA = 0. Once again, the question is whether 0 is a reasonable value of ROA: if it seldom or never occurs, you should center. If ROA = 0 is a realistic, reasonably common situation, then you can leave it alone. (In multi-level mixed-effects models there are additional implications of centering, but that doesn't seem to be in your plans, so I won't go into that.)

If you choose the most meaningful centering of your variables (which might be leaving them as is) and you then find problems with multicollinearity in your regression results (unusually high standard errors, high VIF), you can always deal with it by changing the centering and re-running it: ordinary linear regression runs very quickly even on huge data sets. (Also, before re-centering, you should look at the correlation between ROA and % independent directors: if that is high, re-centering alone may not reduce the multi-collinearity much.)

As for centering variables not involved in interaction terms, a similar, but typically less critical consideration applies. The constant term in the regression represents the expected value of the DV when all of the independent variables are zero. So it may make sense to have zero be a realistic, reasonable value for each of the independent variables. But this can sometimes be ignored because sometimes you are just not interested in the constant term.

Centering variables that are not parts of interaction terms will do nothing to change any multicollinearity relationships (though, as already noted, this is unlikely to be a problem, and if it is, the solution lies elsewhere.)

*I'm using the term effect, which ordinarily has causal connotations, here as a shorthand for "expected difference in mean outcome associated with unit difference in mean predictor" for convenience and brevity. My discussion is unchanged whether we are dealing with causal relationships or just associations.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

06 Jan 2015, 15:07

I took Clyde's explanation as a great lecture on the matter. What is more (difficult), in a nutshell ! I wonder if you, Clyde, could kindly go into the (intriguing to me) issue of centering variables in mixed models and panel data. Please.

Best regards,

Marcos
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30174

06 Jan 2015, 15:42

It's a long and complicated topic, and I think it requires more equations and graphs than are feasible to put into one of these posts. But the ramifications of centering in multi-level models are quite complicated. To see just one example where centering has surprising effects, run this code:

Code:

 clear*
// CREATE A SIMULATED DATA SET FOR A
// RANDOM SLOPES MODEL
set seed 1234
// TOP LEVEL
set obs 100 // PANEL MEMBERS
gen int id = _n
gen u = rnormal(0, 1) // VARIATION IN INTERCEPT
gen v = rnormal(0, 0.5)  // VARIATION IN SLOPE
// BOTTOM LEVEL
expand 20 // OBSERVATIONS PER PANEL MEMBER
by id, sort: gen x = _n // INDEPENDENT VARIABLE
gen e = rnormal(0, 0.25) // OBSERVATION LEVEL RESIDUAL
  // LINEAR MODEL
gen y = 3+(2+v)*x + u + e
  // RECOVER THE MODEL
// EXAMINE THE RELATIONSHIP BETWEEN
// THE id: LEVEL SLOPE AND INTERCEPT
mixed y x || id: x, cov(unstructured)
predict uhat vhat, reffects
graph twoway scatter uhat vhat, name(uncentered, replace)
  // DO IT AGAIN WITH X RE-CENTERED AT GRAND MEAN
summ x
gen x_c = x - `r(mean)'
mixed y x_c || id: x_c, cov(unstructured)
predict uhat_c vhat_c, reffects
graph twoway scatter uhat_c vhat_c, name(centered, replace)

Notice how the covariance between the random intercept and the random slope changes dramatically with the centering. In some multilevel studies one of the key questions is "is the base level of y associated with the rate of change in y"; you can see that the substantive answer depends on the choice of centering (equivalently, the choice of the meaning of "base").

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

07 Jan 2015, 03:26

Dear Clyde, thank you very much for this example and the crystal-clear explanation!

Best regards,

Marcos
Comment

Announcement

Comment

Comment

Comment

Comment