Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Understanding high VIF

    Hello,

    I am having a problem with a variable showing consistently high VIF scores.

    I am conducting regression looking at predictors of dyslexia. I have one variable (parents highest NVQ) which always shows high VIF scores even when entered alone into the model. From my knowledge of VIF I cannot understand why 1 variable alone would show high VIFs. The variable is has 5 categories with the number of dyslexics in each cell ranging from 7 to 124. Would anyone be able to shed light on why this variable is causing me these issues and any potential solutions?

    Thanks in advance.

  • #2
    When you have a multi-level categorical variable with an unbalanced distribution, you will see a high VIF. Think of it this way. Look at that level that only represents 7 people. Suppose it represented 0 people: it would be a constant, and would be dropped because of complete colinearity with the constant term in the model. Well, with only 7 people, it is nearly a constant, and so it will correlate highly with the model's constant term.

    So what to do? Probably nothing. If NVQ is just included in the model to adjust for possible confounding effects, then it doesn't matter at all, and the lesson learned is that you shouldn't have wasted your time calculating VIF. If NVQ is a key variable in the model, one whose effects are the subject of the investigation, then you need to look at the confidence intervals around those NVQ coefficients. If they are narrow enough that you can draw satisfactory conclusions with this level of uncertainty in those coefficients, then, there is no problem and you should just ignore the VIF result.

    If the confidence intervalsare so wide that your conclusions are impaired by the uncertainty in the NVQ effects, then you have a problem. Unfortunately there may or may not be anything you can do about it. One solution would be to combine that small category with some other category of your NVQ variable if it makes scientific sense to do that. (You don't explain what NVQ is, so I can't even conjecture about this.) You might need to do that with some other small categories. Another solution, which might be even better and worth doing even if there were no VIF issue, is to not use a categorized NVQ variable. It seems that NVQ is the "highest" value of something, which implies that there is an underlying numerical variable. If you actually have that underlying numerical continuous variable, it would be better to just use that variable instead of categorizing it. Categorizing variables just throws away information and, sometimes, introduces bias.

    If you can't do either of these solutions, then I think the only other thing you can do is get more data, or scrap your existing study and gather a different sample which oversamples people in the small categories (and weights the analysis accordingly).

    Tip for the future: When evaluating your regression results, the useful diagnostic for multicolinearity is the standard errors (or confidence intervals) around the coefficients of the key variables in the model. If those are too wide, multicolinearity might be the cause, and in that situation VIF might rule that in or out (as opposed to other problems with the model or data), and might suggest what other variables are involved. But if the key variables' confidence intervals are narrow enough for your purposes, then calculating VIF might raise your anxiety, but won't provide any useful information, so don't do it.

    Comment


    • #3
      Thank-you very much for your advice. NVQ refers to parents education level- using your advice I have entered this as a continuous variable and this lowers any collinearity significantly! Thanks so much for clarifying this- anxiety levels lowered!

      Comment

      Working...
      X