Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to estimate multicollinearity among factor variables after logit regression?

    Hello,

    I estimated the VIF after logit regression using the -collin- command. However, this command does not allow inclusion of factor variables. My model includes 2 factor variables. How do I assess the VIF of these 2 variables? Is it okay if factor variables are left out in estimation of multicollinearity?

    Thank you

  • #2
    If you create the actual dummy indicators by hand, and put that into the regression, then collin should work:

    Code:
    tab married, gen(m)
    tab race, gen(r)
    logit union m2 r2 r3
    collin m2 r2 r3
    Results:

    Code:
    . collin m2 r2 r3
    (obs=2,246)
    
      Collinearity Diagnostics
    
                            SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
            m2      1.05    1.02    0.9548      0.0452
            r2      1.05    1.03    0.9510      0.0490
            r3      1.00    1.00    0.9959      0.0041
    ----------------------------------------------------
      Mean VIF      1.03
    
                               Cond
            Eigenval          Index
    ---------------------------------
        1     2.1154          1.0000
        2     1.0009          1.4538
        3     0.7204          1.7136
        4     0.1633          3.5988
    ---------------------------------
     Condition Number         3.5988
     Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
     Det(correlation matrix)    0.9509
    However, it's actually not necessary. Collinearity happens among predictors, so the collinearity assessment from regress can work as well. Just change to regress:

    Code:
    reg union i.married i.race, base
    estat vif
    The results are the similar as above:

    Code:
    . estat vif
    
        Variable |       VIF       1/VIF  
    -------------+----------------------
       1.married |      1.04    0.957121
            race |
              2  |      1.05    0.952794
              3  |      1.00    0.995290
    -------------+----------------------
        Mean VIF |      1.03
    Is it okay if factor variables are left out in estimation of multicollinearity?
    This depends. Collinearity can be an issue, but more than often its detrimental effects are exaggerated. Within a categorical variable, their levels tend to be more collinear and that's just a fact. Aggregation of subgroups may be done, but that does not always make sense. On the other hand, collinearity between two categorical variables can happen as well (e.g. three income levels and ever received government subsistence during childhood), those could be more conceptually interesting. In any case, high VIF inflates the standard error, and in turn p-values. If the variables with high VIF are not the main independent variable of interest, or the p-value is already lower than the designated threshold, VIF investigation can perhaps be assigned to a lower priority.

    Comment


    • #3
      If the variables with high VIF are not the main independent variable of interest, or the p-value is already lower than the designated threshold, VIF investigation can perhaps be assigned to a lower priority.
      I would take it a step farther. See Arthur Goldberger's textbook A Course in Econometrics, which has a chapter devoted to demolishing the whole issue of multicolinearity for a lengthier, and very entertaining, elaboration.

      In a nutshell, as Ken Chui pointed out, if the variables involved are not the main independent variable of interest, or if it is but the confidence interval around it is already sufficiently narrow that your results are useful, then quantifying or otherwise exploring the multicolinearity serves no purpose. The research goals have been achieved without any further information about the multicolinearity. If, on the other hand, your multicolinearity involves the main predictor variable and the confidence interval around it is too wide, then you have a multicolinearity problem. But there is nothing you can do about it. As Goldberger so effectively points out, multicolinearity is misnamed: it should be called hyponumerosity. Because what it really tells you is that for variables that are as strongly correlated with each other as the ones you are analyzing, your data set is too small. The only solutions are to get a larger data set, (usually, a much larger one is needed), or use a different study design altogether that breaks the multicolinearity through carefully stratified or matched sampling. Either way, your present study is simply inconclusive and you are back to the drawing boards.

      So don't waste your time even calculating VIF in the first place. If your confidence intervals for the main predictor variable(s) are satisfactory, there is no issue. If they aren't, you are stuck any way you look at it. OK, in the last situation, calculating VIF might be one way to identify which variables are causing the problem, and that, in turn, might guide the design of your next study. But that's the best it can do for you. The present study is simply inadequate to the task and is not salvageable.

      Comment


      • #4
        Thank you both
        Clyde Schechter "If your confidence intervals for the main predictor variable(s) are satisfactory, there is no issue". What would be an acceptable range for confidence interval for one to deem it to be satisfactory? In my case, I have cross sectional data with groups. My key explanatory variable varies at the group level. When I estimate the regression with group dummies, the vif of my explanatory variable is inflated(around 28000), so are the confidence intervals (-273 and -30). However, when I estimate the regression without group dummies, the vif of my key explanatory variable is only 4. What should I do in this case?

        Last edited by Laiy Kho; 29 Jan 2023, 02:07.

        Comment


        • #5
          Originally posted by Laiy Kho View Post
          What would be an acceptable range for confidence interval for one to deem it to be satisfactory?
          The interval is satisfactory if you call it satisfactory.
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Let me elaborate on Maarten Buis' response. It is indeed acceptable if you call it acceptable. When would you call it acceptable? To use your own example, where the confidence interval is from -273 to -30, the question becomes, does it matter in the real world whether the result is really -273 or is really -30? Would you, or any reasonable person, do anything differently under those different circumstances? If the answer is no, then the confidence interval is acceptable. If the answer is yes, then the conclusion is that the data and model do not provide sufficient information to resolve the value of that variable's effect sufficiently for practical purposes.

            Comment

            Working...
            X