-Collin- in probit model with interaction terms?

Guest

-Collin- in probit model with interaction terms?

08 Mar 2016, 19:21

Dear listers,

I am working on a paper on economics (industrial organization) where the empirical strategy relies on a probit.

My model is

p (y=1 | X) = Φ(β1.x1 + β2.x2 + β3.x1*x2 + β4.x3 + β5.fx1*x3 + β6.x4 + β7.x1*x4 + ψ + u)

For information:
x1 is binary
x2, x3 and x4 are continuous.
x2 has a normal distribution for it was log transformed whereas x3 and x4 are left skewed.

I was worried with potential collinearity arising from the (relatively) strong correlation between some of the variables - for example, x2 and one of the controls included in ψ (lets call it c1) .

Code:

             |   y       x1        x2      c1        x3         x4
-------------+------------------------------------------------------
         y   |   1.0000 
             |
             |
         x1  |   0.4694   1.0000 
             |   0.0000
             |
         x2  |   0.4252   0.7958   1.0000 
             |   0.0000   0.0000
             |
          c1 |   0.1646   0.2233   0.4109   1.0000 
             |   0.0000   0.0000   0.0000
             |
          x3 |   0.1820   0.1661   0.2712   0.2887   1.0000 
             |   0.0000   0.0000   0.0000   0.0000
             |
          x4 |  -0.1412  -0.0592  -0.0333  -0.0135   0.0469   1.0000 
             |   0.0000   0.0771   0.3203   0.7204   0.1615

I was then in doubt if I should run a collinearity test or just ignore it once collinearity would arise anyway due to the presence of three interaction terms.
I had the impression that I should have a look at it anyway. So I ran -collin- and of course I obtained high VIFs for some interaction terms and respective main variables (particularly x2 and x1, that are highly correlated).

Code:

                        SQRT                   R-
  Variable      VIF     VIF    Tolerance    Squared
----------------------------------------------------
      x2     16.68    4.08    0.0600      0.9400
      x1     58.58    7.65    0.0171      0.9829
   x2*x1     30.47    5.52    0.0328      0.9672
      x3      1.40    1.18    0.7132      0.2868
   x1*x3      1.38    1.18    0.7238      0.2762
      x4      1.56    1.25    0.6404      0.3596
   x1*x4      2.53    1.59    0.3946      0.6054
      c1      1.33    1.15    0.7514      0.2486
      c2      1.03    1.01    0.9708      0.0292
----------------------------------------------------
  Mean VIF     12.77

What I thought I could do:
1) Just ignore it.
2) center the continuous variables at zero, and rerun the test
3) run the test without the interaction terms (with the continuous variables centered at zero) - my idea being to check whether the diagnosis of collinearity would point to some worrisome result if there were no interaction terms.

I did 2 and 3.

Code:

                       SQRT                   R-
  Variable      VIF     VIF    Tolerance    Squared
----------------------------------------------------
        x2     16.68    4.08    0.0600      0.9400
        x1      3.42    1.85    0.2923      0.7077
     x2*x1     11.74    3.43    0.0852      0.9148
        x3      1.40    1.18    0.7132      0.2868
     x1*x3      1.38    1.18    0.7238      0.2762
        x4      1.56    1.25    0.6404      0.3596
     x1*x4      2.53    1.59    0.3946      0.6054
        c1      1.33    1.15    0.7514      0.2486
        c2      1.03    1.01    0.9708      0.0292
----------------------------------------------------
  Mean VIF      4.56


	Code:
	
                        SQRT                   R-
  Variable      VIF     VIF    Tolerance    Squared
----------------------------------------------------
        x2      2.43    1.56    0.4110      0.5890
        x1      2.25    1.50    0.4448      0.5552
        x3      1.11    1.05    0.9013      0.0987
        x4      1.02    1.01    0.9835      0.0165
        c1      1.22    1.10    0.8207      0.1793
        c2      1.03    1.01    0.9739      0.0261
----------------------------------------------------
  Mean VIF      1.51

My question is: is this approach correct? Should I do something else? Or just accept the fact that interaction terms will make collinearity arise and ignore everything?

Thank you in advance for your help.

Best,

Jo

Tags: collin, collinearity, interaction, probit, VIF

Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#2

08 Mar 2016, 20:53

Colinearity is one of the most over-rated problems in statistics. It isn't' worth putting this kind of effort into detecting it. Actually, it isn't worth putting any effort into detecting it. Here's why.

If the variables involved in near colinear relationships are not the focus of the research question, and are just included in the model because they need to be adjusted for, the adjustments still work perfectly fine: there is no adverse effect on the estimates relating to the other variables.

If one of the variables that you really want a good estimate for is involved in the near-collinear relationship, then you do have a problem. But it is also a problem you can't do anything about. How would it be a problem? You would see a large standard error for the coefficient of that variable, implying that you have only an imprecise estimate of the coefficient you are interested in. You don't need a VIF or a correlation matrix to tell you this. You just have to look at your regression output. If you have a satisfactorily small standard error for the variables of interest, you are in good shape and there is nothing more to say. If you don't, then your study will not achieve its goals. Is there anything you can do to fix it? No, not with the existing data. Your only way out is to either gather a much larger data set that will give you an adequate standard error for your estimates, or you can scrap the study and start over with a different design whereby the nature of the sampling breaks the colinear relationship. Those are the only options. There are no fixes within the data.

The situation in models with interaction terms is even more compelling for ignoring multicolinearity. It's inevitable that x1 and x2 will both correlate with x1*x2. Centering can help a little bit, but it doesn't eliminate the problem. Centering is usually a good idea for other reasons anyway. For example, without centering, the coefficient of x2 is a measure of the association of x2 with outcome conditional on x1 = 0. But if 0 is not within the range of observed values of x1 (as is often the case with real world data; and sometimes it isn't even a possible value) then it makes the coefficient of x2 meaningless. By contrast, if you center x1, the coefficient of x2 is a measure of the association of x2 with outcome conditional on x1 = mean value of x1 (or median, or whatever center point.) Well, that's an interesting statistic! The same applies to marginal effects. If you center x1, calculating the marginal effect of x2 at x1_centered = 0 gives you a good, efficient estimate of the marginal effect. If you don't center x1 and 0 is not in its range, you won't bother calculating the marginal effect of x2 at x1 = 0. But when you try to calculate the marginal effect at x2 = mean of observed x1, your estimate will be less efficient. So centering variables in interaction terms is really recommended regardless of colinearity issues.

So center your continuous interaction variables and look at your regression output. Either you have usable results or you don't. If you do, then proceed. If not, start planning a bigger data collection or a new design.
1 like
Comment
Guest
#3

09 Mar 2016, 01:06

Thank you so so much for such a complete answer Clyde Schechter. I really appreciate it! Have a great day!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5026
#4

09 Mar 2016, 06:26

Paul Allison has a blog entry on this. Clyde has already covered many of the main points.

http://statisticalhorizons.com/multicollinearity

p. 4 of my own handout has suggestions for dealing with collinearity:

http://www3.nd.edu/~rwilliam/stats2/l11.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
2 likes
Comment
Guest
#5

09 Mar 2016, 07:55

Thank you very much for the references, Richard Williams.
I forgot to mention that I have gone through some references previously (as recommended in the FAQ), including your material, but I was still not confident if my approach was ok.
Now I am much more confident in what I will present. Thank you very much again!
Comment
Jeanette Atan

Join Date: Sep 2016

Posts: 3
#6

19 Sep 2016, 09:21

Originally posted by Guest View Post

Dear listers,

I am working on a paper on economics (industrial organization) where the empirical strategy relies on a probit.

My model is

p (y=1 | X) = Φ(β1.x1 + β2.x2 + β3.x1*x2 + β4.x3 + β5.fx1*x3 + β6.x4 + β7.x1*x4 + ψ + u)

For information:
x1 is binary
x2, x3 and x4 are continuous.
x2 has a normal distribution for it was log transformed whereas x3 and x4 are left skewed.

I was worried with potential collinearity arising from the (relatively) strong correlation between some of the variables - for example, x2 and one of the controls included in ψ (lets call it c1) .

Code:

| y x1 x2 c1 x3 x4 -------------+------------------------------------------------------ y | 1.0000 | | x1 | 0.4694 1.0000 | 0.0000 | x2 | 0.4252 0.7958 1.0000 | 0.0000 0.0000 | c1 | 0.1646 0.2233 0.4109 1.0000 | 0.0000 0.0000 0.0000 | x3 | 0.1820 0.1661 0.2712 0.2887 1.0000 | 0.0000 0.0000 0.0000 0.0000 | x4 | -0.1412 -0.0592 -0.0333 -0.0135 0.0469 1.0000 | 0.0000 0.0771 0.3203 0.7204 0.1615

I was then in doubt if I should run a collinearity test or just ignore it once collinearity would arise anyway due to the presence of three interaction terms.
I had the impression that I should have a look at it anyway. So I ran -collin- and of course I obtained high VIFs for some interaction terms and respective main variables (particularly x2 and x1, that are highly correlated).

Code:

SQRT R- Variable VIF VIF Tolerance Squared ---------------------------------------------------- x2 16.68 4.08 0.0600 0.9400 x1 58.58 7.65 0.0171 0.9829 x2*x1 30.47 5.52 0.0328 0.9672 x3 1.40 1.18 0.7132 0.2868 x1*x3 1.38 1.18 0.7238 0.2762 x4 1.56 1.25 0.6404 0.3596 x1*x4 2.53 1.59 0.3946 0.6054 c1 1.33 1.15 0.7514 0.2486 c2 1.03 1.01 0.9708 0.0292 ---------------------------------------------------- Mean VIF 12.77

What I thought I could do:
1) Just ignore it.
2) center the continuous variables at zero, and rerun the test
3) run the test without the interaction terms (with the continuous variables centered at zero) - my idea being to check whether the diagnosis of collinearity would point to some worrisome result if there were no interaction terms.

I did 2 and 3.

Code:

SQRT R- Variable VIF VIF Tolerance Squared ---------------------------------------------------- x2 16.68 4.08 0.0600 0.9400 x1 3.42 1.85 0.2923 0.7077 x2*x1 11.74 3.43 0.0852 0.9148 x3 1.40 1.18 0.7132 0.2868 x1*x3 1.38 1.18 0.7238 0.2762 x4 1.56 1.25 0.6404 0.3596 x1*x4 2.53 1.59 0.3946 0.6054 c1 1.33 1.15 0.7514 0.2486 c2 1.03 1.01 0.9708 0.0292 ---------------------------------------------------- Mean VIF 4.56

Code:

SQRT R- Variable VIF VIF Tolerance Squared ---------------------------------------------------- x2 2.43 1.56 0.4110 0.5890 x1 2.25 1.50 0.4448 0.5552 x3 1.11 1.05 0.9013 0.0987 x4 1.02 1.01 0.9835 0.0165 c1 1.22 1.10 0.8207 0.1793 c2 1.03 1.01 0.9739 0.0261 ---------------------------------------------------- Mean VIF 1.51

My question is: is this approach correct? Should I do something else? Or just accept the fact that interaction terms will make collinearity arise and ignore everything?

Thank you in advance for your help.

Best,

Jo

Hi Joe,

I am working on a similar model with interactions. The collin command with perturb prefix gives me the error "no interaction allowed". Would you mind sharing your code for how to introduce interaction terms.

Thank you in advance for your help.

Best

Jeanette
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35810
#7

20 Sep 2016, 03:10

Jo [not "Joe"] has evidently unregistered, which is why their posts appear as "Guest". That doesn't stop others answering, but the probability of a reply from the OP appears negligible.
Comment

Announcement

-Collin- in probit model with interaction terms?

Comment

Comment

Comment

Comment

Comment

Comment