Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • what to do if confounder is collinear with exposure?

    Hi,

    I am doing logistic regression with fibrosis as outcome and sex as exposure. I want to check the confounding by gender.
    on using
    logistic fibrosis sex gender

    I am getting sex and gender are collinear, so dropped the variable gender.

    1) Is there a way to fix it? As I need to keep variable gender for my work.

    Thanks for your help.

  • #2
    So, I assume you are using sex as a biologic, dichotomous variable, and gender as a self-identified polytomous variable. The reality is that the vast majority of biological males and females will self-identity as cis-men and cis-women. There is no way to analyze around this. This is just an extreme example of the phenomenon often referred to as multicollinearity. The only hope to distinguish these is either to get a very large population sample, that will include large numbers of people who do not self identify that way, or to target recruitment in your study so as to heavily over-sample those whose self-identified gender does not match their biological sex (and then you will need to weight your analysis to compensate for the over-sampling.)

    This is generically true when you have two variables that are strongly correlated as explanatory variables in a regression model. The only way around it is non-simple-random sampling to include an ample number of exceptional cases, or stick with random sampling but greatly increase the sample size. See Arthur Goldberger's A Course in Econometrics which devotes an entire chapter to this and calls it by a better name: micronumerosity. Or, for the tl;dr try Bryan Caplan's post at https://www.econlib.org/archives/200...ollineari.html.

    Comment


    • #3
      As Clyde said the only solution to the "problem" of multicollinearity is to collect more data which are not collinear.

      But you might want to ask yourself whether you are not asking yourself a meaningless question, when you want to know the separate effects of "two variables" which are in fact the same variable.

      Comment


      • #4
        @Clyde & Joro-Thanks for providing an answer. That's true all females (sex) in dataset identified as woman (gender ) and all males (sex) identified as man (gender).
        Now I understand with your explanation why are they highly correlated because all values under sex variable are corresponding to all values under gender variable.

        Problem is recruitment is already completed. Sample size is small. But it'a hypothesis generating work. As per my understanding as Joro mentioned instead of seeing seprate effects of two varaibles, just look at one.

        I wanted to look at interaction terms between sex and self-identfief gender, but I have gender- role related variables , which can be used as interaction terms.

        Comment


        • #5
          Clyde Schechter Thanks for your help. As a beginner in data analysis there is some confusion around some topics.

          1) Power/sample size calculation: I have mostly seen calculation with two independent samples, with mean and effect size provided. But there can be many other scenarios which can create confusion. For example one sample with one binary outcome. Here we don't have mean or effect size.

          2) Confirming distribution of sample: For research studies should one be calculating it or apply central limit of theorem? Usually it's written as normal distribution, but not clarified which variable mean s.d it corresponds. Should one check for both outcome and main predictor variable?

          Comment


          • #6
            1) Stata's -power- command cannot do power calculations for logistic regression, as far as I know. You will have to find some other software for that. I believe GPower is downloadable for free, and it handles that.

            2) Distribution of right hand side variables in regressions are irrelevant and you should not concern yourself with them. Left hand side variables sometimes matter, but this is often over-emphasized. For example, I have seen many people waste a lot of time struggling to find or create normality in the distribution of the dependent variable of a linear regression. But this only matters if the sample size is small: the central limit theorem handles it in most real-world situation. For a logistic regression, the outcome variable is necessarily a Bernoulli variable--that's the only thing -logistic- will accept. So nothing to check there. And, as I said before, the distributions of the explanatory variables don't matter.

            Added: Let me be clear--the sample size you would need in order to analyze separate effects of biological sex and psychosocial gender has nothing to do with a power calculatioin. It has to do with finding a sample large enough that there are ample numbers of people in your study who do not endorse the predominant pattern of male sex/cis male gender and female sex/cis female gender. That size is determined by the prevalence of such people in the population you draw on (if you are doing a random population sample), or on your ability to target such people in your recruitment and thereby oversample them in sufficient numbers.
            Last edited by Clyde Schechter; 18 Jul 2022, 21:56.

            Comment


            • #7
              Clyde Schechter : thank you so much for the clarification- it helps understanding concepts in detail. As I understand

              1. sample size requirement:
              ideally should be done before project with desired power of 0.8/ according to your field. But in some situations it’s not feasible to collect/recruit sufficient sample size to have desired power.

              2. If the sample size is not large enough:
              it’s very crucial to calculate the power, as it can be underpowe/less than desired power. It’s here power is calculated with sample size you have recruited/ collected during research project. And formula/ methods to calculate power should be done on the basis of statistical tests used for data analysis.

              Comment


              • #8
                1. The number .8 is a convention in many areas. But just like the .05 significance level, it is really just an arbitrary number. If you think about it, a power of 0.8 corresponds to a type II error rate of 0.20. Why is a type II error rate of 0.20 acceptable, but type I error is expected to be more stringently controlled, to .05 or under. The unstated implication is that a Type I error is four times as bad as a Type II error. But where does that come from? So, yes, to get papers published you will generally be expected to use the .05 significance level and choose a sample that will support 80% power (for some reasonable choice of minimum detectable effect size). So do what you have to do. But don't deceive yourself into believing that these numbers have any real meaning or justification. If these things were done thoughtfully, each study would have its own designated significance level and power based on a conscious appraisal of the relative importance in the real world of Type I and Type II errors.

                2. This is correct. Everyone recognizes that a study that does not produce statistically significant results must be interpreted in light of its power: perhaps it is just that the sample was too small or the data too noisy. But what is less commonly recognized is that you really need to know about power to interpret a statistically significant result. From a Bayesian perspective, if a study has low power, then the probability that a statistically significant result will actually be a Type I error is high. So one should be much less impressed by a statistically significant result from a low-powered study than from a high-powered one. Be that as it may, it is the case that in this situation, you calculate the power using the sample size actually achieved, and the formula for the power will depend on the particular statistical test used. The results will also depend on the effect size you need to detect for a "positive" result to be meaningful in the real world. What you should not do in calculating post-hoc statistical power is use the actually observed effect size in the formula for statistical power: you should use the smallest effect size that would be meaningful in the real world.

                Comment


                • #9
                  Clyde Schechter: Thanks for introducing me to concept of post hoc power.
                  Found this paper on Post Hoc power: https://stat.uiowa.edu/sites/stat.ui...hrep/tr378.pdf

                  1.
                  From a Bayesian perspective, if a study has low power, then the probability that a statistically significant result will actually be a Type I error is high.
                  ----Does it also mean high Type II errors due to low power?

                  2. If there are two different subgroups(for examples two types of diseases): Is it recommeded to calculate post hoc power separately for each group or for total sample size?

                  3. Effect size: Is there any article/reading on effect size? I undestand it point of view from mean difference between two samples/groups.
                  -correlation coefficient, regression coefficients all depicts effect size.
                  -How/best way to decide for smallest effect size meaningful in real world? Values can vary in literature review and sometimes there are not enough studies out there.
                  Last edited by sandeep kaur; 21 Jul 2022, 13:38.

                  Comment


                  • #10
                    1. Yes. Indeed, the definition of power is 1 minus the probability of a type II error. What is less frequently realized is that, although power does not affect the probability of getting a type I error overall, it is the case that in low-power studies, the probability that a statistically significant finding will be a type I error is also increased.

                    2. I'm not sure what you mean by the question. If the goal of the study is to compare the two groups to see the extent to which they differ, then that is a single hypothesis test and you do a single power calculation for the entire study. If you have two different groups each being compared to some other thing(s), then you have two hypothesis tests and each gets a power analysis based on its own sample.

                    3. The smallest meaningful effect size is a judgment call. It is sometimes worth reviewing the literature to see what others have done in that regard, but it is a rare article where they explain or justify how they arrived at it. And often they have stood things on their heads, starting with a desired level of power and an achievable sample size given whatever constraints they work under, and calculated the minimum detectable effect size from that (just as if they were doing a post hoc power analysis). But that's not legitimate. To choose the smallest meaningful effect size requires understanding the consequences of coming out with a non-significant difference when a difference of that size really exists. This judgment call draws on life experience and familiarity with the subject matter. For example, if you were designing a clinical trial of a drug to increase the distance that people can walk before getting claudication, an improvement of one meter would clearly be trivial and of no importance. A distance of 10 meters might meaningfully improve a person's mobility within his/her own home. The minimum meaningful effect size is then somewhere between those values. My own instincts about it would put it at somewhere around 5 meters. Your mileage may vary.

                    Comment


                    • #11
                      Clyde Schechter
                      Thanks for taking time and explaining it in detail.

                      No.2- There are two groups- one with viral hepatitis and other with alcohol liver diseases. Each group wlill be analysed and comparison within group will be made by sex. It makes sense to do power analysis of each sample.

                      Comment


                      • #12
                        Yes, that's right. You have two different hypothesis tests, one for each disease, so two separate power analyses.

                        Comment


                        • #13
                          Perfect. Thank you so much.

                          Regards
                          Sandeep

                          Comment


                          • #14
                            Clyde Schechter : I have some queries about Univariable vs Multivariable analysis.

                            1. I recognize the terminology univariable and multivariable is used interchangebly with univariate and multivariate repsectively in literature. As per I understand

                            If it depends on the number of predictors: terms univariable and multivariable are used

                            Univariable: one predictor and one outcome

                            Multivariable: 2 or more predictors and one outcome


                            If it depends on the number of outcomes: terms univariate and multivariate are used.


                            2. Univariable analysis, its with one predictor and one one outcome.
                            - Are the results received with regression (OR, HR etc.) depicts crude values?
                            - If two variables (predictor &outcome) are strongly correlated, is it still necessary to do univariable analysis for that predictor and outcome? Or one should do univariable analysis regardless to find the significant relationship-as it gives the direction which variables to be included in multivariable model.

                            3.If two variables (both predictors) are strongly correlated, there are high chances there is a collinearity and one of them would be dropped from the model.

                            Comment


                            • #15
                              1. This is correct, although you will find that the terms are not always used carefully.

                              2. In a univariable analysis, regression results represent crude associations as you say. Univariable regression analysis provides more information than just a correlation coefficient. And it is not the direction that it is the additional information: correlations can be positive or negative (they will always have the same sign as the regression coefficient), too. Rather the regression analysis specifies the magnitude of the relationship in terms that reflect the scaling (units of measurement) of the variables and enable one to estimate predicted values of the outcome variable. From a correlation alone, you cannot calculate an expected value of one variable from knowing the other.

                              3. If two predictors are perfectly correlated (correlation coefficient = 1 or -1) then the variables are colinear and one of them will be automatically omitted from the model. If there is a high correlation, but not perfect, neither variable will be automatically dropped from the model. (Well, there may be small rounding errors in these calculations that lead Stata to think that a correlation that is very, very close to 1 is actually 1 even though it isn't exactly.) Both variables will be retained. If that correlation is sufficiently high, this may lead to problems such as very large standard errors for the variables, and extreme sensitivity of those variables' coefficients to tiny changes in the data. If these problems interfere with achieving the research goals (usually they do not), then re-doing the model with one of the variables left out might be a good solution, although it is often better to calculate a new variable that is some combination of the original two and replace both of the original variables with that one. However, the best solution is to gather more data so that the correlation, though it persists, no longer causes problems. This usually requires a substantially larger data set than the original, and might require using a different sampling technique altogether so as to break the correlation.

                              Comment

                              Working...
                              X