Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • correlation test

    Dear all,

    May I ask for some advice, please

    how to test the correlation between continuous and discrete (binary) independent variables?
    can we use command "pwcorr x1 x2 x3" in stata?
    if there is a statistically significant result of the correlation, can we still use those variables for the regression model?
    Is it better to use "pairwise correlation" or "vif" test to show that our regression model has a multicollinearity problem?

    Thanks

  • #2
    If your concern about correlation among these variables is in relation to multilinearity in a regression model, then running a correlation among them is unlikely to give you much, if any, useful information. Even VIF has only a limited role to play.

    Multicolinearity is a phenomenon that gets way more attention than it deserves. To understand that, you have to understand how multicolinearity actually affects a regression. There are three important facts. 1. It does not introduce bias at all. 2. It only affects the coefficients and standard errors of the variables that are part of the colinearity; nothing happens to the other variables in the model. 3. The effect of multicolinearity on an involved variable is to increase the standard error of the variable's coefficient. That increased standard error means that the confidence interval around the coefficient estimate is widened. In simple words: it makes the estimates for the affected variables less precise.

    So multicolinearity is only a problem if, 1) a key variable for which precise estimates of the coefficient are necessary to answer the research question(s) is involved in the multicolinearity. And 2) the confidence interval is, as a result, so wide that your estimates are too imprecise to support an answer to the research question(s).

    VIF can play a limited role in diagnosing multicolinearity. If after you run your regression, you have a key variable, not just a "control variable," which has a confidence interval so wide that your ability to answer your research question is hindered, then multicolinearity might be a reason for that. (Or it might be due to other things: small sample size, noisy outcome variable, unmodeled interactions, etc.) If it isn't obvious whether the key variable is actually involved in a multicolinearity, then VIF can tell you that. (A correlation matrix will not tell you that because you can have a bunch of variables that are highly multicolinear even though their individual paired correlation coefficients are unimpressive.)

    The bad news, if you find that you do have a multicolinearity problem, is that there is usually no feasible solution to the problem. There are only two ways to fix a multicolinearity problem. You must either get a larger data set (usually a much larger sample is needed, perhaps an order of magnitude or more larger), or you must start over with a different data design involving complex sampling that breaks the multicolinearity. There is no way to solve the multicolinearity problem within the data set that has it. (No, you cannot omit the variables that the key variable is multicolinear with, because then you introduce omitted variable bias.)

    But if your key variable's coefficient estimates are sufficiently precise to allow you to answer your research question, then there is no multicolinearity problem, even if there is some multicolinearity in the data. In that case, you should not waste any time assessing multicolinearity. Accept success and move on.

    Comment


    • #3
      Hello Rissa avita, and welcome to Statalist. Adding to Clyde's comments, here are a couple of resources you may find helpful. The second one, which relates to Clyde's comment about increasing sample size, could be described as statistical satire. Cheers,
      Bruce
      --
      Bruce Weaver
      Email: [email protected]
      Version: Stata/MP 18.5 (Windows)

      Comment

      Working...
      X