How to estimate multicollinearity among factor variables after logit regression?

Laiy Kho

Join Date: Oct 2022

Posts: 48
#1

How to estimate multicollinearity among factor variables after logit regression?

29 Nov 2022, 12:02

Hello,

I estimated the VIF after logit regression using the -collin- command. However, this command does not allow inclusion of factor variables. My model includes 2 factor variables. How do I assess the VIF of these 2 variables? Is it okay if factor variables are left out in estimation of multicollinearity?

Thank you
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1062
#2

29 Nov 2022, 13:14

If you create the actual dummy indicators by hand, and put that into the regression, then collin should work:

Code:

tab married, gen(m) tab race, gen(r) logit union m2 r2 r3 collin m2 r2 r3

Results:

Code:

. collin m2 r2 r3 (obs=2,246) Collinearity Diagnostics SQRT R- Variable VIF VIF Tolerance Squared ---------------------------------------------------- m2 1.05 1.02 0.9548 0.0452 r2 1.05 1.03 0.9510 0.0490 r3 1.00 1.00 0.9959 0.0041 ---------------------------------------------------- Mean VIF 1.03 Cond Eigenval Index --------------------------------- 1 2.1154 1.0000 2 1.0009 1.4538 3 0.7204 1.7136 4 0.1633 3.5988 --------------------------------- Condition Number 3.5988 Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept) Det(correlation matrix) 0.9509

However, it's actually not necessary. Collinearity happens among predictors, so the collinearity assessment from regress can work as well. Just change to regress:

Code:

reg union i.married i.race, base estat vif

The results are the similar as above:

Code:

. estat vif Variable | VIF 1/VIF -------------+---------------------- 1.married | 1.04 0.957121 race | 2 | 1.05 0.952794 3 | 1.00 0.995290 -------------+---------------------- Mean VIF | 1.03

Is it okay if factor variables are left out in estimation of multicollinearity?

This depends. Collinearity can be an issue, but more than often its detrimental effects are exaggerated. Within a categorical variable, their levels tend to be more collinear and that's just a fact. Aggregation of subgroups may be done, but that does not always make sense. On the other hand, collinearity between two categorical variables can happen as well (e.g. three income levels and ever received government subsistence during childhood), those could be more conceptually interesting. In any case, high VIF inflates the standard error, and in turn p-values. If the variables with high VIF are not the main independent variable of interest, or the p-value is already lower than the designated threshold, VIF investigation can perhaps be assigned to a lower priority.
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#3

29 Nov 2022, 13:43

If the variables with high VIF are not the main independent variable of interest, or the p-value is already lower than the designated threshold, VIF investigation can perhaps be assigned to a lower priority.

I would take it a step farther. See Arthur Goldberger's textbook A Course in Econometrics, which has a chapter devoted to demolishing the whole issue of multicolinearity for a lengthier, and very entertaining, elaboration.

In a nutshell, as Ken Chui pointed out, if the variables involved are not the main independent variable of interest, or if it is but the confidence interval around it is already sufficiently narrow that your results are useful, then quantifying or otherwise exploring the multicolinearity serves no purpose. The research goals have been achieved without any further information about the multicolinearity. If, on the other hand, your multicolinearity involves the main predictor variable and the confidence interval around it is too wide, then you have a multicolinearity problem. But there is nothing you can do about it. As Goldberger so effectively points out, multicolinearity is misnamed: it should be called hyponumerosity. Because what it really tells you is that for variables that are as strongly correlated with each other as the ones you are analyzing, your data set is too small. The only solutions are to get a larger data set, (usually, a much larger one is needed), or use a different study design altogether that breaks the multicolinearity through carefully stratified or matched sampling. Either way, your present study is simply inconclusive and you are back to the drawing boards.

So don't waste your time even calculating VIF in the first place. If your confidence intervals for the main predictor variable(s) are satisfactory, there is no issue. If they aren't, you are stuck any way you look at it. OK, in the last situation, calculating VIF might be one way to identify which variables are causing the problem, and that, in turn, might guide the design of your next study. But that's the best it can do for you. The present study is simply inadequate to the task and is not salvageable.
2 likes
Comment
Laiy Kho

Join Date: Oct 2022

Posts: 48
#4

29 Jan 2023, 01:11

Thank you both
Clyde Schechter "If your confidence intervals for the main predictor variable(s) are satisfactory, there is no issue". What would be an acceptable range for confidence interval for one to deem it to be satisfactory? In my case, I have cross sectional data with groups. My key explanatory variable varies at the group level. When I estimate the regression with group dummies, the vif of my explanatory variable is inflated(around 28000), so are the confidence intervals (-273 and -30). However, when I estimate the regression without group dummies, the vif of my key explanatory variable is only 4. What should I do in this case?

Last edited by Laiy Kho; 29 Jan 2023, 02:07.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3467
#5

29 Jan 2023, 08:32

Originally posted by Laiy Kho View Post

What would be an acceptable range for confidence interval for one to deem it to be satisfactory?

The interval is satisfactory if you call it satisfactory.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30192
#6

29 Jan 2023, 09:35

Let me elaborate on Maarten Buis' response. It is indeed acceptable if you call it acceptable. When would you call it acceptable? To use your own example, where the confidence interval is from -273 to -30, the question becomes, does it matter in the real world whether the result is really -273 or is really -30? Would you, or any reasonable person, do anything differently under those different circumstances? If the answer is no, then the confidence interval is acceptable. If the answer is yes, then the conclusion is that the data and model do not provide sufficient information to resolve the value of that variable's effect sufficiently for practical purposes.
Comment

Announcement

How to estimate multicollinearity among factor variables after logit regression?

Comment

Comment

Comment

Comment

Comment