Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multi-collinerity test for independent categorical variables in logistic regression

    Dear esteemed members,

    I have a dataset where I am assessing predictors of uptake of optimal doses of fansidar in pregnant women as a chemo-preventive therapy for malaria in sub-Saharan Africa. I have 24 explanatory variables and all are categorical.

    I want to test for multicollinearity but when I use vif command it says "not appropriate after regress, nocons; use option uncentered to get uncentered VIFs". Please help.


  • #2
    Hello Steven,

    Welcome to the Stata Forum / Statalist.

    Please present command and output, as recommended in the FAQ.

    That said, it seems you are using - regress - where you should use - logistic - command, since yourr model is a logistic regression.

    Besides, you used the option "nocons" for a linear regression, but it seems you need a logistic regression.

    In short, you may wish to type:

    Code:
    . help logistic
    Last edited by Marcos Almeida; 30 Mar 2017, 06:37.
    Best regards,

    Marcos

    Comment


    • #3
      Thank you Marcos. Here is the command:
      xi:logistic uptaksp i.zone i.hiedulev i.agecat i.occup2 i.marital3 i.religion3 ///
      i.tribe3 i.firstpreg i.parity i.firstvist2 i.evheardsp2 i.novistanc3 i.gravida4 ///
      i.tknowleg2 i.dknowleg3 i.dot2 i.permhf2 i.moneytrt i.distance i.transport ///
      i.escort2 i.nofem2 i.nohthprsn2 i.nodrug2

      output omitted.

      vif
      Output: not appropriate after regress, nocons; use option uncentered to get uncentered VIFs r(301);

      Comment


      • #4
        Steven:
        as Marcos pointed out, you are using a -regress postestimation- command which is not supported after -logistic- (please, see -help logistic postestimation-).
        Besides, -xi- in your code is redundant, as -fvvarlist- notation takes care of all the matter.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thanks Carlo. I will check the help option.

          Comment


          • #6
            The -vif- command is designed to run only after -regress- without the -nocons- option. It isn't really clear why it was written that way, but that is how it is. But multicollinearity is purely an independent variable phenomenon and is independent of the particular regression model being used. So the way to get this is to run -regress- with your variables (don't use -nocons-) and then run -vif-. The regression results themselves should be ignored if what you really need is a logistic model. But the "variance inflation factors" are the same no matter what regression you are running (though, clearly in the case of a logistic model there is no "variance inflation" to talk about).

            All of that said, the best thing to do with multicollinearity is not to test for it. Multicolinearity is a highly overrated statistical "problem." There are two types of multicolinearity. One of them is simply not a problem at all, and the other is a problem that usually can't be solved. So neither one is worth looking for.

            If multicolinearity arises among variables whose coefficients do not need to be estimated precisely (that is, they are covariates whose effects are being adjusted for but are not of interest in their own right), then the multicolinearity is not a problem at all. You still get adjustment for all those effects, and in fact, the adjustment you get by including them all is better than it would be if you dropped one or more of the variables to make them less multicolinear! So you can just ignore this.

            You may have a problem, however, if the multicolinear variables include a predictor whose effect is of actual importance to the research goals and must be estimated with reasonable precision. In the presence of multicolinearity, the standard error of that coefficient will be increased (that is precisely the variance that gets inflated), so your estimate will have decreased precision. So then you have to look at that standard error and decide whether the precision you have is adequate for your purposes, or not. If it's adequate, then the multicolinearity hasn't really harmed you and you can, again, ignore it. This brings us to the case where you do have a problem. This is chase where the resulting standard error is large enough that you don't have an estimate of the effect that is precise enough to achieve your research goals. But then, there isn't actually anything you can do about it. Eliminating some of the other variables with which it is multicolinear will improve (reduce) your standard error: but this is accomplished at the cost of introducing omitted variable (confounding) bias into your estimate. (If the variables you drop to reduce multicolinearity are not actually associated with the outcome variable, then you don't introduce omitted variable bias, but in that case there was no good reason to include them in the model in the first place.) The only truly effective solutions to the problem are to either get a much larger data set (which is usually infeasible), or to scrap the data you have and do a new study with a design such as matched pairs or stratified sampling that adjusts for these confounders in ways that do not require including them in the regression analysis.

            So my advice is to just skip this. Don't include covariates in your model if they aren't really needed to reduce confounding. After running the regression, examine the standard errors of the effects that you need precise estimates of. If they're good, then you're fine and happy. If they're not, then you really aren't going to be able to achieve your goals with the data in hand anyway.

            Comment


            • #7
              Thank you Clyde for your valuable comments and suggestions. The effects of most of my covariates are being adjusted for, hence as per your suggestion I will not worry about multicollinearity.

              Comment


              • #8
                Dear Clyde,

                May I take advantage of this interesting discussion to ask a related-question for my own research.
                I am running a Poisson regression with some geographical and Socio-economic(SE) predictors.
                All variables are categorical.
                I was asked to check for multicollinearity between the SE variables; I tried different commands but I don't know how to interpret the results:
                1. First I tried with "Collin". But Stata does not accept when I declare, in the collin command, my variables as categorical (with “i.” ). My question is “may I use collin with categorical variables, that are not declared as categorical? Is the interpretation of the resulting VIF still correct ? “
                Code:
                collin education housingm empl4m if sex==1 & natbe==1
                HTML Code:
                collin education housingm empl4m if sex==1 & natbe==1
                (obs=6,483,813)
                
                  Collinearity Diagnostics
                
                                        SQRT                   R-
                  Variable      VIF     VIF    Tolerance    Squared
                ----------------------------------------------------
                 education      1.28    1.13    0.7839      0.2161
                  housingm      1.04    1.02    0.9573      0.0427
                    empl4m      1.26    1.12    0.7933      0.2067
                ----------------------------------------------------
                  Mean VIF      1.19
                
                                           Cond
                        Eigenval          Index
                ---------------------------------
                    1     3.3677          1.0000
                    2     0.3593          3.0617
                    3     0.2023          4.0800
                    4     0.0707          6.8998
                ---------------------------------
                 Condition Number         6.8998
                 Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
                 Det(correlation matrix)    0.7646
                2. Second I tried with "coldiag2". But I don’t know how to interpret the output.
                Code:
                coldiag2 education housingm empl4m if sex==1 & natbe==1

                HTML Code:
                coldiag2 education housingm empl4m if sex==1 & natbe==1
                
                Condition number using scaled variables =         6.90
                
                Condition Indexes and Variance-Decomposition Proportions
                
                condition
                   index     _cons education  housingm    empl4m                                                                                                              
                1  1.00      0.01      0.02      0.01      0.03
                2  3.06      0.06      0.02      0.07      0.65
                3  4.08      0.02      0.95      0.04      0.33
                4  6.90      0.91      0.01      0.87      0.00
                
                
                .
                end of do-file
                

                So please could you help me to answer the question: is there a multicollinearity between those 3 variables ?

                Thanks a lot ! Françoise








                Comment


                • #9
                  -collin- is not part of official Stata. It is a user written command. Since I consistently advocate against testing for colinearity, it probably won't surprise you to hear that I have not invested the time to download and learn about a user-written command that does that. But if you insist on doing it, you can do this:

                  Code:
                  regress whatever housingm empl4m education if sex == 1 & natbe == 1
                  vif
                  The vif's will be valid. Useless, in my opinion, but valid.

                  Comment

                  Working...
                  X