Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multicollinearity

    Hi,

    I wish run a multiple linear regression with DV= abnormal returns
    and continuous IV= ROA, board size, firm size, leverage, market-to-book, % independant directors
    and dummy IV: cash, cross border
    I wish to create a interaction term between ROA and % independant directors
    My question is: because I have a multicollinearity problem should centered (i.e., subtracting their means) all continuous (DV + IVs) or just variables included in interaction term
    Thank you in advance

  • #2
    In setting up interaction terms, the issue with regard to centering has more to do with the interpretation of the results than with multicollinearity problems. If you have a model that includes ROA (whatever that is), % independent directors, and their interaction, then the coefficient of ROA represents the effect* of a unit increase in ROA on your DV conditional on % independent directors = 0. If firms rarely or never have 0 % independent directors, this effectively makes the ROA coefficient meaningless and it would make more sense to center % independent directors around some value (possibly the mean, but that is not the only reasonable approach) that occurs frequently in your data. Similarly, the coefficient of % independent directors will represent the effect of an increase of 1 percentage point in % independent directors on our DV conditional on ROA = 0. Once again, the question is whether 0 is a reasonable value of ROA: if it seldom or never occurs, you should center. If ROA = 0 is a realistic, reasonably common situation, then you can leave it alone. (In multi-level mixed-effects models there are additional implications of centering, but that doesn't seem to be in your plans, so I won't go into that.)

    If you choose the most meaningful centering of your variables (which might be leaving them as is) and you then find problems with multicollinearity in your regression results (unusually high standard errors, high VIF), you can always deal with it by changing the centering and re-running it: ordinary linear regression runs very quickly even on huge data sets. (Also, before re-centering, you should look at the correlation between ROA and % independent directors: if that is high, re-centering alone may not reduce the multi-collinearity much.)

    As for centering variables not involved in interaction terms, a similar, but typically less critical consideration applies. The constant term in the regression represents the expected value of the DV when all of the independent variables are zero. So it may make sense to have zero be a realistic, reasonable value for each of the independent variables. But this can sometimes be ignored because sometimes you are just not interested in the constant term.

    Centering variables that are not parts of interaction terms will do nothing to change any multicollinearity relationships (though, as already noted, this is unlikely to be a problem, and if it is, the solution lies elsewhere.)

    *I'm using the term effect, which ordinarily has causal connotations, here as a shorthand for "expected difference in mean outcome associated with unit difference in mean predictor" for convenience and brevity. My discussion is unchanged whether we are dealing with causal relationships or just associations.

    Comment


    • #3
      I took Clyde's explanation as a great lecture on the matter. What is more (difficult), in a nutshell ! I wonder if you, Clyde, could kindly go into the (intriguing to me) issue of centering variables in mixed models and panel data. Please.
      Best regards,

      Marcos

      Comment


      • #4
        It's a long and complicated topic, and I think it requires more equations and graphs than are feasible to put into one of these posts. But the ramifications of centering in multi-level models are quite complicated. To see just one example where centering has surprising effects, run this code:

        Code:
         clear*
        // CREATE A SIMULATED DATA SET FOR A
        // RANDOM SLOPES MODEL
        set seed 1234
        // TOP LEVEL
        set obs 100 // PANEL MEMBERS
        gen int id = _n
        gen u = rnormal(0, 1) // VARIATION IN INTERCEPT
        gen v = rnormal(0, 0.5)  // VARIATION IN SLOPE
        // BOTTOM LEVEL
        expand 20 // OBSERVATIONS PER PANEL MEMBER
        by id, sort: gen x = _n // INDEPENDENT VARIABLE
        gen e = rnormal(0, 0.25) // OBSERVATION LEVEL RESIDUAL
          // LINEAR MODEL
        gen y = 3+(2+v)*x + u + e
          // RECOVER THE MODEL
        // EXAMINE THE RELATIONSHIP BETWEEN
        // THE id: LEVEL SLOPE AND INTERCEPT
        mixed y x || id: x, cov(unstructured)
        predict uhat vhat, reffects
        graph twoway scatter uhat vhat, name(uncentered, replace)
          // DO IT AGAIN WITH X RE-CENTERED AT GRAND MEAN
        summ x
        gen x_c = x - `r(mean)'
        mixed y x_c || id: x_c, cov(unstructured)
        predict uhat_c vhat_c, reffects
        graph twoway scatter uhat_c vhat_c, name(centered, replace)
        Notice how the covariance between the random intercept and the random slope changes dramatically with the centering. In some multilevel studies one of the key questions is "is the base level of y associated with the rate of change in y"; you can see that the substantive answer depends on the choice of centering (equivalently, the choice of the meaning of "base").


        Comment


        • #5
          Dear Clyde, thank you very much for this example and the crystal-clear explanation!
          Best regards,

          Marcos

          Comment

          Working...
          X