Help With Logistic Regressions

Ryan Aubrey

Join Date: Feb 2016

Posts: 4
#1

Help With Logistic Regressions

19 Feb 2016, 11:27

Hi,

So have this issue with binary log regressions, where if I add multiple independent variables the p values change. For example, say A is my dependent variable and B, C, and D are my independent variables. If I run just A vs B, I get one p value for B, but if I run A vs B and C, I get a different p value for B. If I run A vs B, C, and D. Both B and C now have different p values from the previous test.

Shouldn't the independent variables be considered independent of each other, and not affect each others p values?

Thanks
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#2

19 Feb 2016, 11:36

Shouldn't the independent variables be considered independent of each other, and not affect each others p values?

No, not at all! Unless the predictor variables (a less confusing term than "independent" variables) are statistically independent of each other, which rarely happens in practice, they will affect each other's contributions to any multivariable analysis.

When you run -logistic A B- you are asking for the crude (overall, unconditional) association between B and A. When you run -logistic A B C- you are looking for the association between B and A adjusted for the effects of C. And in that same model, what you get for C is the association between A and C adjusted for the effects of B.
Comment
Ryan Aubrey

Join Date: Feb 2016

Posts: 4
#3

19 Feb 2016, 12:12

Thanks for the quick response. So how exactly does it adjust for the effects of C? Maybe this will be hard to explain without a specific example, but what does a p value from logistic AB say compared to the p value from logistic ABC?

For example, if I want to compare patients' diagnosis with mortality. What would the difference between -logistic Mortality RespiratoryDiagnosis- and -logistic Mortality RespiratoryDx NeurologicDx CardiacDx- and which would give me the p value I'm really looking for?

Last edited by Ryan Aubrey; 19 Feb 2016, 12:41.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#4

19 Feb 2016, 12:39

A satisfactory response to your question would, I think, be too lengthy for this forum. Let me suggest you read the Simpson's Paradox page on Wikipedia. While the examples there do not use logistic regression (because they are simple examples that can be readily analyzed with cross-tabs), they show how the unadjusted and adjusted associations among variables can be radically different, even going in opposite directions. This is just the starkest example of how multivariable adjustments work. After that, I would suggest you read a standard textbook on logistic regression for more details on how this plays out with logistic regression models.

By the way, do not obsess about p-values. And, in particular, do not focus on significant p-value vs. non-significant p-value. The important result in these analyses is the measure of association, typically an odds ratio. Its magnitude is what you should focus on. The associated p-value is supplementary information that tells you something about whether your sample size and data precision are adequate to draw sharp conclusions from that odds ratio. (And the confidence interval, some would say, is a better source of information still.)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#5

19 Feb 2016, 13:19

For example, if I want to compare patients' diagnosis with mortality. What would the difference between -logistic Mortality RespiratoryDiagnosis- and -logistic Mortality RespiratoryDx NeurologicDx CardiacDx- and which would give me the p value I'm really looking for?

So this was edited into #3 after I posted my reply in #4. While I stand by what I said there, I can give you a quick explanation of the difference in this concrete example.

I will assume that Mortality, RespiratoryDX, NeurologicDx and CardiacDx are all dichotomous variables for simplicity. If they are multi-level categories, similar considerations apply but are more complicated to spell out in detail.

So if you do logistic Mortality RespiratoryDX, you will get an estimate of the ratio of the odds of dying among those with a respiratory diagnosis to the odds of dying among those without a respiratory diagnosis. Plain and simple.

But, patients with respiratory diagnoses are different from those without in many ways besides having a respiratory diagnosis. In particular, those with a respiratory diagnosis are more likely than others to be smokers. And the list of other, potentially fatal, diseases, especially cardiac, but also including neurological, that are more common among smokers than nonsmokers is even longer than the lengthy explanation I declined to provide in #4 ;-). There is also the question of age: those with respiratory diseases are likely to be older than those without, and age is also a risk factor for most cardiac and neurologic diseases. One could go on and on: there is a lengthy list of reasons why those who have respiratory disease are also more likely to have neurologic and cardiac diagnoses than those without respiratory disease.

So one might wonder about the meaningfulness of the results of logistic Mortality RespiratoryDx. How much of the resulting association (odds ratio) comes from respiratory disease per se, and how much is due to these collateral attributes of people with respiratory disease. That is where mujltivariable analylsis comes in. When you run -logistic Mortalitiy RespiratoryDx NeurologicDX CardiacDX-, the result you get for RespiratoryDx now gives you an estimate of what the ratio of the odds of dying if you have respiratory disease to the odds of dying if you do not have respiratory disease would be if people with respiratory disease were just as likely as those without it to have cardiac and neurologic diseases. This is sometimes referred to as the "independent effect" of respiratory disease, but is most properly called the effect of respiratory disease adjusted for cardiac and neurologic disease.

As for which is correct, they are both correct answers, but they answer different questions. If, for example, you are running an intensive care unit and need mortality estimates so that you can plan your future purchases of equipmient used specifically to treat respiratory diseases, then the unadjusted estimate is probably most useful to you. If, on the other hand, you are interested in gaining a scientific understanding of the impact of respiratory disease on mortality, then the adjusted estimate is better (and, in fact, you would probably want to adjust for many other things besides just cardiac and neurologic conditions.) Anyway, which of these analyses (if either) is appropriate for your question depends on your question.

And again, please put p-values out of your mind until you are otherwise done. They may well be the least important statistics you will generate in the course of your work. Don't even look at them until you have settled on and run the best feasible model for your question and your data. Then the p-values will provide some ancillary information about the adequacy of your data set (size and precision) for answering the question you have used it for.
Comment
Ryan Aubrey

Join Date: Feb 2016

Posts: 4
#6

20 Feb 2016, 11:22

Thank you. That's pretty much exactly the answer I've been trying to figure out the last few days.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 588
#7

20 Feb 2016, 17:28

Quoted from #2:

Unless the predictor variables (a less confusing term than "independent" variables) are statistically independent of each other ...

Perhaps one should add that "unless ..." does not hold for nonlinear probability models such as logistic regression models: Even if two predictor variables are independent of each other, adding one of the predictor variables to a model that uses the other already will change the effect of the other if both predictor variables affect the criterium variable. This is nicely discussed in an article by Mood (2010) and a (better) solution (the so called khb-method) has been developed by Karlson, Holm and Breen, see Kohler, Karlson and Holm (2011). Not being aware of this might misled you to wrongly interprete a change of the coefficient of the predictor variable from model 1 to model 2 as a suppression effect (or - if the predictor variables are correlated - wrongly conclude that there is no mediation). The references are:
Kohler, U., Karlson, K. B., & Holm, A. (2011). Comparing coefficients of nested nonlinear probability models. The Stata Journal, 11, 420-438.

Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do and what to do about it. European Sociological Review, 26, 67-82.

Below is an hopefully not to lengthy example that demonstrates this using artificial data and OLS regression models as compared to logistic regression models. Additionally, it shows how to use the khb-method. To understand what is going on I recommend to study the results of the example in combination with the above mentioned articles:

Code:

/* Example investigating an non-existing indirect effect (or confounding) using the KHB-method: */ * Install khb.ado if necessary: cap which khb if _rc ssc install khb set seed 12345 * ========================================================== * Generate data for example: /* model: y = b0 + b1*x1 + b2*x2 with y = dichotomous r(x1,x2) = 0 */ clear drop _all set obs 4000 generate x2 = ceil(_n/2000) - 1 bysort x2: generate x1 = ceil(_n/1000) - 1 tab x1 x2 set seed 12345 generate y = runiform() < invlogit(-4 + 4*x1 + 2*x2) corr * ---------------------------------------------------------- * a) OLS regression: regr y x1 estimates store total regr y x2 estimates store modx2 regr y x1 x2 estimates store direct estimates tab total modx2 direct khb regr y x1 || x2, s v d * ---------------------------------------------------------- * b) Logistic regression: logit y x1, or estimates store total logit y x2, or estimates store modx2 logit y x1 x2, or estimates store direct estimates tab total modx2 direct, eform khb logit y x1 || x2, s v d or * ==========================================================
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#8

20 Feb 2016, 20:33

Dirk Enzmann raises a good point, and I was overly influenced by the way things play out in linear regression when I wrote that.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 588
#9

21 Feb 2016, 07:44

As I am recognizing somewhat late, there is an interesting working paper (and a presentation) of Maarten Buis adressing this issue:
Working paper: Buis, M. (2015). Logistic regression: Why we often can do what we think we can do. URL: http://www.maartenbuis.nl/wp/oddsratio.html

Presentation: Buis, M. (2015). Logistic regression: Why we often can do what we think we can do. Paper presented at the 19th UK Stata Users Group meeting, London, UK, 10 Sept. 2015.
Comment

Announcement

Help With Logistic Regressions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment