Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -Collin- in probit model with interaction terms?

    Dear listers,

    I am working on a paper on economics (industrial organization) where the empirical strategy relies on a probit.

    My model is

    p (y=1 | X) = Φ(β1.x1 + β2.x2 + β3.x1*x2 + β4.x3 + β5.fx1*x3 + β6.x4 + β7.x1*x4 + ψ + u)


    For information:
    x1 is binary
    x2, x3 and x4 are continuous.
    x2 has a normal distribution for it was log transformed whereas x3 and x4 are left skewed.

    I was worried with potential collinearity arising from the (relatively) strong correlation between some of the variables - for example, x2 and one of the controls included in ψ (lets call it c1) .

    Code:
                 |   y       x1        x2      c1        x3         x4
    -------------+------------------------------------------------------
             y   |   1.0000 
                 |
                 |
             x1  |   0.4694   1.0000 
                 |   0.0000
                 |
             x2  |   0.4252   0.7958   1.0000 
                 |   0.0000   0.0000
                 |
              c1 |   0.1646   0.2233   0.4109   1.0000 
                 |   0.0000   0.0000   0.0000
                 |
              x3 |   0.1820   0.1661   0.2712   0.2887   1.0000 
                 |   0.0000   0.0000   0.0000   0.0000
                 |
              x4 |  -0.1412  -0.0592  -0.0333  -0.0135   0.0469   1.0000 
                 |   0.0000   0.0771   0.3203   0.7204   0.1615
    I was then in doubt if I should run a collinearity test or just ignore it once collinearity would arise anyway due to the presence of three interaction terms.
    I had the impression that I should have a look at it anyway. So I ran -collin- and of course I obtained high VIFs for some interaction terms and respective main variables (particularly x2 and x1, that are highly correlated).

    Code:
                            SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
          x2     16.68    4.08    0.0600      0.9400
          x1     58.58    7.65    0.0171      0.9829
       x2*x1     30.47    5.52    0.0328      0.9672
          x3      1.40    1.18    0.7132      0.2868
       x1*x3      1.38    1.18    0.7238      0.2762
          x4      1.56    1.25    0.6404      0.3596
       x1*x4      2.53    1.59    0.3946      0.6054
          c1      1.33    1.15    0.7514      0.2486
          c2      1.03    1.01    0.9708      0.0292
    ----------------------------------------------------
      Mean VIF     12.77

    What I thought I could do:
    1) Just ignore it.
    2) center the continuous variables at zero, and rerun the test
    3) run the test without the interaction terms (with the continuous variables centered at zero) - my idea being to check whether the diagnosis of collinearity would point to some worrisome result if there were no interaction terms.

    I did 2 and 3.


    Code:
                           SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
            x2     16.68    4.08    0.0600      0.9400
            x1      3.42    1.85    0.2923      0.7077
         x2*x1     11.74    3.43    0.0852      0.9148
            x3      1.40    1.18    0.7132      0.2868
         x1*x3      1.38    1.18    0.7238      0.2762
            x4      1.56    1.25    0.6404      0.3596
         x1*x4      2.53    1.59    0.3946      0.6054
            c1      1.33    1.15    0.7514      0.2486
            c2      1.03    1.01    0.9708      0.0292
    ----------------------------------------------------
      Mean VIF      4.56
    
    
    Code:
    
    
    SQRT R- Variable VIF VIF Tolerance Squared ---------------------------------------------------- x2 2.43 1.56 0.4110 0.5890 x1 2.25 1.50 0.4448 0.5552 x3 1.11 1.05 0.9013 0.0987 x4 1.02 1.01 0.9835 0.0165 c1 1.22 1.10 0.8207 0.1793 c2 1.03 1.01 0.9739 0.0261 ---------------------------------------------------- Mean VIF 1.51


    My question is: is this approach correct? Should I do something else? Or just accept the fact that interaction terms will make collinearity arise and ignore everything?

    Thank you in advance for your help.

    Best,

    Jo

  • #2
    Colinearity is one of the most over-rated problems in statistics. It isn't' worth putting this kind of effort into detecting it. Actually, it isn't worth putting any effort into detecting it. Here's why.

    If the variables involved in near colinear relationships are not the focus of the research question, and are just included in the model because they need to be adjusted for, the adjustments still work perfectly fine: there is no adverse effect on the estimates relating to the other variables.

    If one of the variables that you really want a good estimate for is involved in the near-collinear relationship, then you do have a problem. But it is also a problem you can't do anything about. How would it be a problem? You would see a large standard error for the coefficient of that variable, implying that you have only an imprecise estimate of the coefficient you are interested in. You don't need a VIF or a correlation matrix to tell you this. You just have to look at your regression output. If you have a satisfactorily small standard error for the variables of interest, you are in good shape and there is nothing more to say. If you don't, then your study will not achieve its goals. Is there anything you can do to fix it? No, not with the existing data. Your only way out is to either gather a much larger data set that will give you an adequate standard error for your estimates, or you can scrap the study and start over with a different design whereby the nature of the sampling breaks the colinear relationship. Those are the only options. There are no fixes within the data.

    The situation in models with interaction terms is even more compelling for ignoring multicolinearity. It's inevitable that x1 and x2 will both correlate with x1*x2. Centering can help a little bit, but it doesn't eliminate the problem. Centering is usually a good idea for other reasons anyway. For example, without centering, the coefficient of x2 is a measure of the association of x2 with outcome conditional on x1 = 0. But if 0 is not within the range of observed values of x1 (as is often the case with real world data; and sometimes it isn't even a possible value) then it makes the coefficient of x2 meaningless. By contrast, if you center x1, the coefficient of x2 is a measure of the association of x2 with outcome conditional on x1 = mean value of x1 (or median, or whatever center point.) Well, that's an interesting statistic! The same applies to marginal effects. If you center x1, calculating the marginal effect of x2 at x1_centered = 0 gives you a good, efficient estimate of the marginal effect. If you don't center x1 and 0 is not in its range, you won't bother calculating the marginal effect of x2 at x1 = 0. But when you try to calculate the marginal effect at x2 = mean of observed x1, your estimate will be less efficient. So centering variables in interaction terms is really recommended regardless of colinearity issues.

    So center your continuous interaction variables and look at your regression output. Either you have usable results or you don't. If you do, then proceed. If not, start planning a bigger data collection or a new design.

    Comment


    • #3
      Thank you so so much for such a complete answer Clyde Schechter. I really appreciate it! Have a great day!

      Comment


      • #4
        Paul Allison has a blog entry on this. Clyde has already covered many of the main points.

        http://statisticalhorizons.com/multicollinearity

        p. 4 of my own handout has suggestions for dealing with collinearity:

        http://www3.nd.edu/~rwilliam/stats2/l11.pdf
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://academicweb.nd.edu/~rwilliam/

        Comment


        • #5
          Thank you very much for the references, Richard Williams.
          I forgot to mention that I have gone through some references previously (as recommended in the FAQ), including your material, but I was still not confident if my approach was ok.
          Now I am much more confident in what I will present. Thank you very much again!

          Comment


          • #6
            Originally posted by Guest View Post
            Dear listers,

            I am working on a paper on economics (industrial organization) where the empirical strategy relies on a probit.

            My model is

            p (y=1 | X) = Φ(β1.x1 + β2.x2 + β3.x1*x2 + β4.x3 + β5.fx1*x3 + β6.x4 + β7.x1*x4 + ψ + u)


            For information:
            x1 is binary
            x2, x3 and x4 are continuous.
            x2 has a normal distribution for it was log transformed whereas x3 and x4 are left skewed.

            I was worried with potential collinearity arising from the (relatively) strong correlation between some of the variables - for example, x2 and one of the controls included in ψ (lets call it c1) .

            Code:
            | y x1 x2 c1 x3 x4
            -------------+------------------------------------------------------
            y | 1.0000
            |
            |
            x1 | 0.4694 1.0000
            | 0.0000
            |
            x2 | 0.4252 0.7958 1.0000
            | 0.0000 0.0000
            |
            c1 | 0.1646 0.2233 0.4109 1.0000
            | 0.0000 0.0000 0.0000
            |
            x3 | 0.1820 0.1661 0.2712 0.2887 1.0000
            | 0.0000 0.0000 0.0000 0.0000
            |
            x4 | -0.1412 -0.0592 -0.0333 -0.0135 0.0469 1.0000
            | 0.0000 0.0771 0.3203 0.7204 0.1615
            I was then in doubt if I should run a collinearity test or just ignore it once collinearity would arise anyway due to the presence of three interaction terms.
            I had the impression that I should have a look at it anyway. So I ran -collin- and of course I obtained high VIFs for some interaction terms and respective main variables (particularly x2 and x1, that are highly correlated).

            Code:
            SQRT R-
            Variable VIF VIF Tolerance Squared
            ----------------------------------------------------
            x2 16.68 4.08 0.0600 0.9400
            x1 58.58 7.65 0.0171 0.9829
            x2*x1 30.47 5.52 0.0328 0.9672
            x3 1.40 1.18 0.7132 0.2868
            x1*x3 1.38 1.18 0.7238 0.2762
            x4 1.56 1.25 0.6404 0.3596
            x1*x4 2.53 1.59 0.3946 0.6054
            c1 1.33 1.15 0.7514 0.2486
            c2 1.03 1.01 0.9708 0.0292
            ----------------------------------------------------
            Mean VIF 12.77

            What I thought I could do:
            1) Just ignore it.
            2) center the continuous variables at zero, and rerun the test
            3) run the test without the interaction terms (with the continuous variables centered at zero) - my idea being to check whether the diagnosis of collinearity would point to some worrisome result if there were no interaction terms.

            I did 2 and 3.


            Code:
            SQRT R-
            Variable VIF VIF Tolerance Squared
            ----------------------------------------------------
            x2 16.68 4.08 0.0600 0.9400
            x1 3.42 1.85 0.2923 0.7077
            x2*x1 11.74 3.43 0.0852 0.9148
            x3 1.40 1.18 0.7132 0.2868
            x1*x3 1.38 1.18 0.7238 0.2762
            x4 1.56 1.25 0.6404 0.3596
            x1*x4 2.53 1.59 0.3946 0.6054
            c1 1.33 1.15 0.7514 0.2486
            c2 1.03 1.01 0.9708 0.0292
            ----------------------------------------------------
            Mean VIF 4.56
            
            
            Code:
            
            
            SQRT R- Variable VIF VIF Tolerance Squared ---------------------------------------------------- x2 2.43 1.56 0.4110 0.5890 x1 2.25 1.50 0.4448 0.5552 x3 1.11 1.05 0.9013 0.0987 x4 1.02 1.01 0.9835 0.0165 c1 1.22 1.10 0.8207 0.1793 c2 1.03 1.01 0.9739 0.0261 ---------------------------------------------------- Mean VIF 1.51


            My question is: is this approach correct? Should I do something else? Or just accept the fact that interaction terms will make collinearity arise and ignore everything?

            Thank you in advance for your help.

            Best,

            Jo

            Hi Joe,

            I am working on a similar model with interactions. The collin command with perturb prefix gives me the error "no interaction allowed". Would you mind sharing your code for how to introduce interaction terms.

            Thank you in advance for your help.

            Best

            Jeanette

            Comment


            • #7
              Jo [not "Joe"] has evidently unregistered, which is why their posts appear as "Guest". That doesn't stop others answering, but the probability of a reply from the OP appears negligible.

              Comment

              Working...
              X