Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help With Logistic Regressions

    Hi,

    So have this issue with binary log regressions, where if I add multiple independent variables the p values change. For example, say A is my dependent variable and B, C, and D are my independent variables. If I run just A vs B, I get one p value for B, but if I run A vs B and C, I get a different p value for B. If I run A vs B, C, and D. Both B and C now have different p values from the previous test.

    Shouldn't the independent variables be considered independent of each other, and not affect each others p values?

    Thanks

  • #2
    Shouldn't the independent variables be considered independent of each other, and not affect each others p values?
    No, not at all! Unless the predictor variables (a less confusing term than "independent" variables) are statistically independent of each other, which rarely happens in practice, they will affect each other's contributions to any multivariable analysis.

    When you run -logistic A B- you are asking for the crude (overall, unconditional) association between B and A. When you run -logistic A B C- you are looking for the association between B and A adjusted for the effects of C. And in that same model, what you get for C is the association between A and C adjusted for the effects of B.

    Comment


    • #3
      Thanks for the quick response. So how exactly does it adjust for the effects of C? Maybe this will be hard to explain without a specific example, but what does a p value from logistic AB say compared to the p value from logistic ABC?

      For example, if I want to compare patients' diagnosis with mortality. What would the difference between -logistic Mortality RespiratoryDiagnosis- and -logistic Mortality RespiratoryDx NeurologicDx CardiacDx- and which would give me the p value I'm really looking for?
      Last edited by Ryan Aubrey; 19 Feb 2016, 12:41.

      Comment


      • #4
        A satisfactory response to your question would, I think, be too lengthy for this forum. Let me suggest you read the Simpson's Paradox page on Wikipedia. While the examples there do not use logistic regression (because they are simple examples that can be readily analyzed with cross-tabs), they show how the unadjusted and adjusted associations among variables can be radically different, even going in opposite directions. This is just the starkest example of how multivariable adjustments work. After that, I would suggest you read a standard textbook on logistic regression for more details on how this plays out with logistic regression models.

        By the way, do not obsess about p-values. And, in particular, do not focus on significant p-value vs. non-significant p-value. The important result in these analyses is the measure of association, typically an odds ratio. Its magnitude is what you should focus on. The associated p-value is supplementary information that tells you something about whether your sample size and data precision are adequate to draw sharp conclusions from that odds ratio. (And the confidence interval, some would say, is a better source of information still.)

        Comment


        • #5
          For example, if I want to compare patients' diagnosis with mortality. What would the difference between -logistic Mortality RespiratoryDiagnosis- and -logistic Mortality RespiratoryDx NeurologicDx CardiacDx- and which would give me the p value I'm really looking for?
          So this was edited into #3 after I posted my reply in #4. While I stand by what I said there, I can give you a quick explanation of the difference in this concrete example.

          I will assume that Mortality, RespiratoryDX, NeurologicDx and CardiacDx are all dichotomous variables for simplicity. If they are multi-level categories, similar considerations apply but are more complicated to spell out in detail.

          So if you do logistic Mortality RespiratoryDX, you will get an estimate of the ratio of the odds of dying among those with a respiratory diagnosis to the odds of dying among those without a respiratory diagnosis. Plain and simple.

          But, patients with respiratory diagnoses are different from those without in many ways besides having a respiratory diagnosis. In particular, those with a respiratory diagnosis are more likely than others to be smokers. And the list of other, potentially fatal, diseases, especially cardiac, but also including neurological, that are more common among smokers than nonsmokers is even longer than the lengthy explanation I declined to provide in #4 ;-). There is also the question of age: those with respiratory diseases are likely to be older than those without, and age is also a risk factor for most cardiac and neurologic diseases. One could go on and on: there is a lengthy list of reasons why those who have respiratory disease are also more likely to have neurologic and cardiac diagnoses than those without respiratory disease.

          So one might wonder about the meaningfulness of the results of logistic Mortality RespiratoryDx. How much of the resulting association (odds ratio) comes from respiratory disease per se, and how much is due to these collateral attributes of people with respiratory disease. That is where mujltivariable analylsis comes in. When you run -logistic Mortalitiy RespiratoryDx NeurologicDX CardiacDX-, the result you get for RespiratoryDx now gives you an estimate of what the ratio of the odds of dying if you have respiratory disease to the odds of dying if you do not have respiratory disease would be if people with respiratory disease were just as likely as those without it to have cardiac and neurologic diseases. This is sometimes referred to as the "independent effect" of respiratory disease, but is most properly called the effect of respiratory disease adjusted for cardiac and neurologic disease.

          As for which is correct, they are both correct answers, but they answer different questions. If, for example, you are running an intensive care unit and need mortality estimates so that you can plan your future purchases of equipmient used specifically to treat respiratory diseases, then the unadjusted estimate is probably most useful to you. If, on the other hand, you are interested in gaining a scientific understanding of the impact of respiratory disease on mortality, then the adjusted estimate is better (and, in fact, you would probably want to adjust for many other things besides just cardiac and neurologic conditions.) Anyway, which of these analyses (if either) is appropriate for your question depends on your question.

          And again, please put p-values out of your mind until you are otherwise done. They may well be the least important statistics you will generate in the course of your work. Don't even look at them until you have settled on and run the best feasible model for your question and your data. Then the p-values will provide some ancillary information about the adequacy of your data set (size and precision) for answering the question you have used it for.

          Comment


          • #6
            Thank you. That's pretty much exactly the answer I've been trying to figure out the last few days.

            Comment


            • #7
              Quoted from #2:
              Unless the predictor variables (a less confusing term than "independent" variables) are statistically independent of each other ...
              Perhaps one should add that "unless ..." does not hold for nonlinear probability models such as logistic regression models: Even if two predictor variables are independent of each other, adding one of the predictor variables to a model that uses the other already will change the effect of the other if both predictor variables affect the criterium variable. This is nicely discussed in an article by Mood (2010) and a (better) solution (the so called khb-method) has been developed by Karlson, Holm and Breen, see Kohler, Karlson and Holm (2011). Not being aware of this might misled you to wrongly interprete a change of the coefficient of the predictor variable from model 1 to model 2 as a suppression effect (or - if the predictor variables are correlated - wrongly conclude that there is no mediation). The references are:
              Below is an hopefully not to lengthy example that demonstrates this using artificial data and OLS regression models as compared to logistic regression models. Additionally, it shows how to use the khb-method. To understand what is going on I recommend to study the results of the example in combination with the above mentioned articles:

              Code:
              /* Example investigating an non-existing indirect effect (or confounding)
                 using the KHB-method: */
              
              * Install khb.ado if necessary:
              cap which khb
              if _rc ssc install khb
              
              set seed 12345
              
              * ==========================================================
              * Generate data for example:
              
              /* model: y = b0 + b1*x1 + b2*x2
              
                 with        y = dichotomous
                      r(x1,x2) = 0
              */
              
              clear
              drop _all
              set obs 4000
              generate x2 = ceil(_n/2000) - 1
              bysort x2: generate x1 = ceil(_n/1000) - 1
              tab x1 x2
              
              set seed 12345
              generate y = runiform() < invlogit(-4 + 4*x1 + 2*x2)
              
              corr
              
              * ----------------------------------------------------------
              * a) OLS regression:
              
              regr y x1
              estimates store total
              regr y x2
              estimates store modx2
              regr y x1 x2
              estimates store direct
              estimates tab total modx2 direct
              
              khb regr y x1 || x2, s v d
              
              * ----------------------------------------------------------
              * b) Logistic regression:
              
              logit y x1, or
              estimates store total
              logit y x2, or
              estimates store modx2
              logit y x1 x2, or
              estimates store direct
              estimates tab total modx2 direct, eform
              
              khb logit y x1 || x2, s v d or
              
              * ==========================================================


              Comment


              • #8
                Dirk Enzmann raises a good point, and I was overly influenced by the way things play out in linear regression when I wrote that.

                Comment


                • #9
                  As I am recognizing somewhat late, there is an interesting working paper (and a presentation) of Maarten Buis adressing this issue:

                  Comment

                  Working...
                  X