No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Clyde Schechter
    Makes sense. For example while collecting gender data, ideally one should target sampling in a way that it collects other self reported gender-identity observations apart from cis-man and cis-woman. Otherwise sex and self- reported gender identity will be perfectly correlated.

    A) Is it good practice to do correlation analysis before regression and report correlation coefficients as well? Though we can get most of the information from regression coefficients, but correlation coefficients can help us gauge some direction, can help explain large standard errors for the variables and direction to check for multicollinearity.

    B) is Pearson correlation a good choice with both variables as binary ?
    Last edited by sandeep kaur; 21 Jul 2022, 19:25.


    • #17
      A) When you are doing exploratory data analysis, particularly if the variables involved are unfamiliar or are scaled in non-intuitive ways, then examining the correlations first may prove helpful. Similarly, looking at correlations between variables may be helpful in understanding the results of a regression when there are unexpectedly large standard errors--multicollinearity is a common cause for this. In general, though, it would be unusual to publish both the correlations and the regression results.

      B) Yes.


      • #18
        Clyde Schechter Thanks for clarifying concepts.


        • #19
          Clyde Schechter
          Can there be multicollinearity when the variables used in regression modelare categorical? I used some categorical variables to check for confounding in logistic regression model- but stata output was that some of the variables got omitted because of multicollinearity. Isn't collinearity and multicollinearity for continuous variables?


          • #20
            No, colinearity and multicolinearity can occur with any kind of variable. A categorical variable, actually, in regressions, is just a series of numeric variables coded 0/1 ("dummy" variables). [Yes, there are other ways of representing categorical variables in regression, but they are much less commonly used, and all of them are numeric anyway.]

            Colinearity is a concept from linear algebra. If x1, x2, x3, ..., xn are variables, then they are colinear if there are constants c1, c2, c3, ..., cn such that c1*x1 + c2*x2 + c3*x3 + ... + cn*xn = 0. Whether the x's are only zeroes and ones, or range widely over the number line makes no difference. For example, in the post that began this thread, you had a sex variable and a gender variable. Sex was represented as 1 for male and 0 for female (or maybe it was the other way around, I don't recall), and gender had more categories than that, but they were not actually instantiated in your data. So, within the scope of your data, gender was coded 1 for cis male and 0 for cis female. Then the equation 1*sex + (-1)*gender = 0 establishes the colinearity of those two categorical variables in that particular data corpus.

            Multicolinearity is a related, but distinct concept. A formal definition of it is complicated, and I don't want to go further into it than saying that variables have a multicolinearity relationship if they are "nearly"
            colinear, and I am not going to be specific about the meaning of "nearly." It is not true that Stata omits variables because of multicolinearity. It only omits variables when there is an exact colinear relationship: and then it omits the minimum number needed to disrupt the colinearity. When multicolinearity is present, the involved variables remain in the model. The problem that multicolinearity may present is that the resulting regression results may be too inexact to be useful for the purposes at hand. Whether that occurs depends on just how "nearly" colinear the variables are, and also on what the purposes at hand are--sometimes an imprecise estimate for these variables is good enough to meet the research goals, and sometimes it isn't.

            Important: when the multicolinearity leaves behind results that are sufficiently precise to meet the research goals, it isn't a problem, and there is no reason to do anything about it. The good news is that this is the case most of the time. Multicolinearity is fairly common, multicolinearity problems are uncommon. When multicolinearity does lead to results that are too inexact to meet the research goals, then the solution is to gather more data (which increases the precision of the regression) or use a different sampling design that eliminates the multicolinearity itself.


            • #21
              Clyde Schechter

              Thanks for your reply.

              A) the equation 1*sex + (-1)*gender = 0

              In above equation: Why gender constant is -1 not +1?

              B) Most of the examples I see have worked around linear regression model/continuous variables.
              I am bit confused regarding correlation, collinearity and multicollinearity. How to approach them for model building? In my study
              main predictor sex is binary
              Other predictors are binary or categorical
              Outcome is binary


              • #22
                A) Consider a male. sex = 1 and gender = 1. 1*sex + 1.gender would be 2, not 0. + (-1)*gender = 0, as required for colinearity.

                B) Dichotomous (binary) variables take on the values 0 and 1. Categorical variables with more than 2 levels (polytomous variables) are just a series of dichotomous variables: as many as there are levels, minus 1. The correlation between any two variables is calculated by a mathematical formula that I won't reproduce here because it's typographically difficult, but I'm sure you have seen it in your readings. It can be applied to any pair of variables: just plug the values of the variables into the formula and calculate it. It is not hard to prove, with a little bit of algebra, that if two variables are colinear, then the correlation between them is always either 1 or -1. The formula for correlation can be applied to all numeric variables. Whether they are categorical or continuous makes no difference at all.

                When more than two variables are involved in a colinearity relationship, then it is a bit harder to say things about their correlations. What you can say is that if you make a matrix out of their pairwise correlations, then the determinant of that matrix is 0.

                Multicolinearity, as I indicated a few posts back, is when a set of variables is "nearly" colinear, where, again, I want to be vague about what "nearly" actually means.

                To truly understand these things requires a course in linear algebra. And I think it is fair to say that to truly understand regression also requires a course in linear algebra. But even without that background, it is possible to use regression productively. The basic principles needed for that purpose are the ones I outlined in #20.


                • #23
                  Note: In the definition of colinearity given in #20, I should have added that among the constants c1, c2, ..., cn, at least one of them is not zero. Without that, any set of variables would be colinear: just take all the c's to be 0.


                  • #24
                    Clyde Schechter

                    Thanks for explaining it in detail. It's making sense to me now.
                    To summarize:

                    1. When two predictors are linearly related its collinearity and it can be a problem during modelling.
                    2. Collinearity is not about continuous or binary /categorical variables. It's about numeric values. Any type of variables can be collinear.
                    3. If two variables are colinear, then the correlation between them is always either 1 or -1 or Vice versa.
                    Either calculate correlation coefficient first or prove collinearity
                    4. Multicollinearity is when more than two variables are involved in a collinearity relationship but hard to say anything about correlation incase of multicollinearity.

                    I will review the concept of linear algebra again


                    • #25
                      Clyde Schechter

                      I was wondering if you can recommend articles with discussion on:
                      1. correlation/collinearity and multicollinearity among binary/categorical predictor variables
                      2. person correlation for binary and categorical predictor variables
                      3. Vif to check for multicollinearity among mix of binary and categorical variables


                      • #26
                        Multicollinearity is when more than two variables are involved in a collinearity relationship.
                        No. The word "multicolinearity" is an unfortunate term because its meaning is not what its morpheme composition suggests. The "multi" in multicolinearity has nothing to do with how many variables are involved. You can have just two variables in a multicolinearity relationship. You can have many variables in a colinearity relationship. Multicolinearity means that you have 2 or more variables that are "nearly" in a colinear relationship.

                        Re #25: I think you can find all of these in a basic regression textbook. I don't have a specific one I recommend: I haven't taught that course in many years now. As for VIF, while I am not a huge fan of it, it can be useful for identifying which variables in a model are involved in multicolinearity when it isn't otherwise obvious. The Stata command -estat vif- does not care about the distinctions among dichotomous, polytomous, and continuous variables. The calculations are the same in all those cases. The main thing you need to know about -estat vif- is that it can only be invoked after -regress-, not after other regression commands. There is no good reason for this, so I don't know why Stata implemented it that way. In any case, if you need to use it after some other kind of regression command, you can just rerun the same model using -regress- instead of the other regression command and then run -estat vif-. This works because VIF is based exclusively on the correlations among the predictor variables in the model and has nothing to do with any other aspects of the model or the data.


                        • #27
                          Clyde Schechter

                          Thanks for help- really appreciate it.


                          • #28
                            Last edited by sandeep kaur; 05 Aug 2022, 23:07.


                            • #29
                              Last edited by sandeep kaur; 05 Aug 2022, 23:06.


                              • #30
                                Clyde Schechter

                                Dear prof.

                                There are many approaches when it comes to selecting purposeful covariates while building logistic regression model. It adds to confusion.

                                which one to choose
                                A) univariable analysis to be followed by stepwise approach and then interaction terms and confounding
                                B )choose one from univariable analysis and stepwise . And then look for effect modification and confounding
                                c) avoid univariable analysis. Select covariates by stepwise manner. Then look for effect modification and confounding
                                d) As per my understanding, interaction terms is same as effect modification. But at some places it’s mentioned check for effect modification, then confounding and check for meaningful interactions.