Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multicollinearity in binary logistic regression

    Dear Statalist Forum,

    I'm running a binary logistic regression (independent variables are dichotomous and continuous) and want to test the multicollinearity of the independent variables. Given that I can not use VIF, I have read that the collin command is useful for logistic regression. When I type collin following all independent variables I get very low VIFs (maximum 2.45). Is this sufficient to prove that the multicollinearity is very low in my model?

    Thank you very much in advance!
    Maria

  • #2
    I think even people who believe in looking at VIF would agree that 2.45 is sufficiently low.

    That said, VIF is a waste of time. In fact, worrying about multicollinearity is almost always a waste of time. It is the most overrated "problem" in statistics, in my opinion. There are basically two different situations with multicollinearity:

    1. There is some multicollinearity among variables that have been included, not because they are of interest in their own right, but because you want to adjust for their effects. Crucially, the key variables you are concerned about are not involved. In this case, it doesn't matter how colinear those variables are. Including those variables will adequately adjust for their effects, regardless of colinearity, and the colinearity does not in any way adversely affect your estimates for the uninvolved variables that you actually care about. So this kind of colinearity is completely irrelevant.

    2. There is multicolinearity that does involve one or more of the variables you are actually interested in. This may, indeed, be a problem. But if it is a problem, it is one that, for practical purposes, has no solution. To tell whether it is a problem, all you have to do is look at the standard errors (or, equivalently, the 95% CI) of the coefficient estimate(s) for the variable(s) of interest. If the standard error is small enough (or the CI is narrow enough) that you have a sufficiently precise estimate of the effects of your key variables for the purposes at hand, then there is no problem. After all, the only thing colinearity does is widen the standard errors of the involved variables: it makes their effect estimates less precise. But if your results are precise enough for your purposes, then there is nothing more to say.

    On the other hand, if you are left with a gaping confidence interval (large standard error) and your estimate(s) of your key effect(s) are so imprecise that they are not useful, then you have a problem. Unfortunately, there is no practical way to solve that problem. You cannot simply omit the variables that are colinear, because you will likely be left with severe omitted variable bias. You could solve the problem with a larger sample, but in this circumstance the required sample size is typically much, much larger than the sample you have--and presumably if it were easy/affordable to get more data, you would have done so in the first place. So usually enlarging the sample size is not feasible. The other approach is to just scrap everything and start over with a new study design that breaks the colinearity among these variables--that typically involves matching or stratified sampling, etc. But that is an entirely new study altogether.

    So, bottom line: forget about VIF. Just look at the standard errors (Confidence Intervals) for the key variables in your study. If you have adequate precision, you're fine, end of story. If you don't, you're sunk, end of story.
    Last edited by Clyde Schechter; 22 Jun 2017, 09:23.

    Comment


    • #3
      Maria: I agree 100% with Clyde, whose arguments are compelling. If you are interested in additional reading on this topic, see this piece on Art Goldberger and his ideas on multicollinearity and "micronumerosity."

      http://davegiles.blogspot.com/2011/0...umerosity.html

      Comment


      • #4
        Paul Allison has a good blog entry on this. But like Clyde, I would be even less concerned than Allison is:

        https://statisticalhorizons.com/multicollinearity

        Some more thoughts are at

        http://www3.nd.edu/~rwilliam/stats2/l11.pdf
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        Stata Version: 17.0 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Thank you so much! That was all I was looking for! I just have one question left: How should I exactly look at the standard errors. What do exactly mean with "adequate precision" ? Is there an exact value for interpretation?
          Regards, Maria

          Comment


          • #6
            The exact value for interpretation depends on your research goals. Here's how I would look at it. You are running these analyses for some reason. You want to estimate some effect(s), and somebody might take certain actions based on the results. (It might be some immediate action, or it might be something as remote as planning to do some different study in the future, or something in between.) The results of your study are there to guide those actions. Examine the confidence intervals and ask yourself: if the value were at the low end of the CI, would it make any practical difference in the real world if the lower end of the confidence interval were the result than if the upper end were? Would anybody do anything differently? If not, then you have adequate precision. If people might act differently in response to the results, then precision is insufficient.

            Now, sometimes we do analyses for purely theoretical reasons and we are basically just curious about the magnitude of some effect(s), with no actions contingent on them. In that case, any degree of precision is acceptable--and in that case you just report the result with your confidence interval and say, in effect, "this is what we know, with this degree of uncertainty, based on this study."

            Comment


            • #7
              Dear Statalist users,
              I am regressing a binary variable on a set of continuous variables using a logit model. I realised that 2 of my main independent variables are correlated (0.5 correlation). When one used alone, it has the expected sign. However, when I add the other variable, the sign on the first one changes.
              I do not want to drop any of my variables.
              Is there a simple way to solve for this??
              Thank you

              Comment


              • #8
                what is the command for checking multicollinerity in binary logistic regression

                Comment


                • #9
                  Hello Kensley Ndovi. You could use Phil Ender's collin package.
                  Code:
                  net describe collin, from(https://stats.oarc.ucla.edu/stat/stata/ado/analysis)
                  But be careful to use only the estimation sample. E.g.,
                  Code:
                  logit foreign weight mpg price
                  collin weight mpg price if e(sample)
                  Output from -collin-:
                  Code:
                  collin weight mpg price if e(sample)
                  (obs=74)
                  
                    Collinearity Diagnostics
                  
                                          SQRT                   R-
                    Variable      VIF     VIF    Tolerance    Squared
                  ----------------------------------------------------
                      weight      3.17    1.78    0.3155      0.6845
                         mpg      2.88    1.70    0.3469      0.6531
                       price      1.42    1.19    0.7066      0.2934
                  ----------------------------------------------------
                    Mean VIF      2.49
                  
                                             Cond
                          Eigenval          Index
                  ---------------------------------
                      1     3.7380          1.0000
                      2     0.1988          4.3362
                      3     0.0589          7.9693
                      4     0.0043         29.4270
                  ---------------------------------
                   Condition Number        29.4270
                   Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
                   Det(correlation matrix)    0.2462
                  --
                  Bruce Weaver
                  Email: [email protected]
                  Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
                  Version: Stata/MP 18.0 (Windows)

                  Comment


                  • #10
                    Kensley: Why are you checking for multicollinearity? Are the estimates you care about too imprecise to be useful?

                    Comment

                    Working...
                    X