Multi-collinerity test for independent categorical variables in logistic regression

Steven Azizi

Join Date: Mar 2016

Posts: 12
#1

Multi-collinerity test for independent categorical variables in logistic regression

30 Mar 2017, 05:28

Dear esteemed members,

I have a dataset where I am assessing predictors of uptake of optimal doses of fansidar in pregnant women as a chemo-preventive therapy for malaria in sub-Saharan Africa. I have 24 explanatory variables and all are categorical.

I want to test for multicollinearity but when I use vif command it says "not appropriate after regress, nocons; use option uncentered to get uncentered VIFs". Please help.
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

30 Mar 2017, 06:34

Hello Steven,

Welcome to the Stata Forum / Statalist.

Please present command and output, as recommended in the FAQ.

That said, it seems you are using - regress - where you should use - logistic - command, since yourr model is a logistic regression.

Besides, you used the option "nocons" for a linear regression, but it seems you need a logistic regression.

In short, you may wish to type:

Code:

. help logistic

Last edited by Marcos Almeida; 30 Mar 2017, 06:37.

Best regards,

Marcos
Comment
Steven Azizi

Join Date: Mar 2016

Posts: 12
#3

30 Mar 2017, 08:14

Thank you Marcos. Here is the command:
xi:logistic uptaksp i.zone i.hiedulev i.agecat i.occup2 i.marital3 i.religion3 ///
i.tribe3 i.firstpreg i.parity i.firstvist2 i.evheardsp2 i.novistanc3 i.gravida4 ///
i.tknowleg2 i.dknowleg3 i.dot2 i.permhf2 i.moneytrt i.distance i.transport ///
i.escort2 i.nofem2 i.nohthprsn2 i.nodrug2

output omitted.

vif
Output: not appropriate after regress, nocons; use option uncentered to get uncentered VIFs r(301);
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17680
#4

30 Mar 2017, 08:31

Steven:
as Marcos pointed out, you are using a -regress postestimation- command which is not supported after -logistic- (please, see -help logistic postestimation-).
Besides, -xi- in your code is redundant, as -fvvarlist- notation takes care of all the matter.

Kind regards,
Carlo
(Stata 19.0)
Comment
Steven Azizi

Join Date: Mar 2016

Posts: 12
#5

30 Mar 2017, 09:15

Thanks Carlo. I will check the help option.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29968
#6

30 Mar 2017, 09:46

The -vif- command is designed to run only after -regress- without the -nocons- option. It isn't really clear why it was written that way, but that is how it is. But multicollinearity is purely an independent variable phenomenon and is independent of the particular regression model being used. So the way to get this is to run -regress- with your variables (don't use -nocons-) and then run -vif-. The regression results themselves should be ignored if what you really need is a logistic model. But the "variance inflation factors" are the same no matter what regression you are running (though, clearly in the case of a logistic model there is no "variance inflation" to talk about).

All of that said, the best thing to do with multicollinearity is not to test for it. Multicolinearity is a highly overrated statistical "problem." There are two types of multicolinearity. One of them is simply not a problem at all, and the other is a problem that usually can't be solved. So neither one is worth looking for.

If multicolinearity arises among variables whose coefficients do not need to be estimated precisely (that is, they are covariates whose effects are being adjusted for but are not of interest in their own right), then the multicolinearity is not a problem at all. You still get adjustment for all those effects, and in fact, the adjustment you get by including them all is better than it would be if you dropped one or more of the variables to make them less multicolinear! So you can just ignore this.

You may have a problem, however, if the multicolinear variables include a predictor whose effect is of actual importance to the research goals and must be estimated with reasonable precision. In the presence of multicolinearity, the standard error of that coefficient will be increased (that is precisely the variance that gets inflated), so your estimate will have decreased precision. So then you have to look at that standard error and decide whether the precision you have is adequate for your purposes, or not. If it's adequate, then the multicolinearity hasn't really harmed you and you can, again, ignore it. This brings us to the case where you do have a problem. This is chase where the resulting standard error is large enough that you don't have an estimate of the effect that is precise enough to achieve your research goals. But then, there isn't actually anything you can do about it. Eliminating some of the other variables with which it is multicolinear will improve (reduce) your standard error: but this is accomplished at the cost of introducing omitted variable (confounding) bias into your estimate. (If the variables you drop to reduce multicolinearity are not actually associated with the outcome variable, then you don't introduce omitted variable bias, but in that case there was no good reason to include them in the model in the first place.) The only truly effective solutions to the problem are to either get a much larger data set (which is usually infeasible), or to scrap the data you have and do a new study with a design such as matched pairs or stratified sampling that adjusts for these confounders in ways that do not require including them in the regression analysis.

So my advice is to just skip this. Don't include covariates in your model if they aren't really needed to reduce confounding. After running the regression, examine the standard errors of the effects that you need precise estimates of. If they're good, then you're fine and happy. If they're not, then you really aren't going to be able to achieve your goals with the data in hand anyway.
2 likes
Comment
Steven Azizi

Join Date: Mar 2016

Posts: 12
#7

30 Mar 2017, 13:59

Thank you Clyde for your valuable comments and suggestions. The effects of most of my covariates are being adjusted for, hence as per your suggestion I will not worry about multicollinearity.
Comment

Francoise Ren

Join Date: Mar 2016
Posts: 14

31 Mar 2017, 07:53

Dear Clyde,

May I take advantage of this interesting discussion to ask a related-question for my own research.
I am running a Poisson regression with some geographical and Socio-economic(SE) predictors.
All variables are categorical.
I was asked to check for multicollinearity between the SE variables; I tried different commands but I don't know how to interpret the results:

First I tried with "Collin". But Stata does not accept when I declare, in the collin command, my variables as categorical (with “i.” ). My question is “may I use collin with categorical variables, that are not declared as categorical? Is the interpretation of the resulting VIF still correct ? “

Code:

collin education housingm empl4m if sex==1 & natbe==1

HTML Code:

collin education housingm empl4m if sex==1 & natbe==1
(obs=6,483,813)

  Collinearity Diagnostics

                        SQRT                   R-
  Variable      VIF     VIF    Tolerance    Squared
----------------------------------------------------
 education      1.28    1.13    0.7839      0.2161
  housingm      1.04    1.02    0.9573      0.0427
    empl4m      1.26    1.12    0.7933      0.2067
----------------------------------------------------
  Mean VIF      1.19

                           Cond
        Eigenval          Index
---------------------------------
    1     3.3677          1.0000
    2     0.3593          3.0617
    3     0.2023          4.0800
    4     0.0707          6.8998
---------------------------------
 Condition Number         6.8998
 Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
 Det(correlation matrix)    0.7646

2. Second I tried with "coldiag2". But I don’t know how to interpret the output.

Code:

coldiag2 education housingm empl4m if sex==1 & natbe==1

HTML Code:

coldiag2 education housingm empl4m if sex==1 & natbe==1

Condition number using scaled variables =         6.90

Condition Indexes and Variance-Decomposition Proportions

condition
   index     _cons education  housingm    empl4m                                                                                                              
1  1.00      0.01      0.02      0.01      0.03
2  3.06      0.06      0.02      0.07      0.65
3  4.08      0.02      0.95      0.04      0.33
4  6.90      0.91      0.01      0.87      0.00


.
end of do-file

So please could you help me to answer the question: is there a multicollinearity between those 3 variables ?

Thanks a lot ! Françoise

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29968
#9

31 Mar 2017, 08:38

-collin- is not part of official Stata. It is a user written command. Since I consistently advocate against testing for colinearity, it probably won't surprise you to hear that I have not invested the time to download and learn about a user-written command that does that. But if you insist on doing it, you can do this:

Code:

regress whatever housingm empl4m education if sex == 1 & natbe == 1 vif

The vif's will be valid. Useless, in my opinion, but valid.
Comment

Announcement