Very different p-values using testnl vs test after margins

Molly Jeffery

Join Date: Jan 2019

Posts: 36
#1

Very different p-values using testnl vs test after margins

03 May 2019, 10:21

I'm working with survey data on medication use among people with a disease. The question is whether people who know they have the disease are less likely to use medications that are contraindicated. We have a binary variable for whether people told the interviewer they had the disease (disease_aware), a variable for their disease severity (stage: mild, moderate, severe), and a binary variable for whether they take a specific contraindicated medication (med_use).

We use logistic regression to calculate predictive margins for taking the drug, then use -test- to determine whether, within each disease level, people who know they have the disease are less likely to take the drug.

We also want to present risk ratios with confidence intervals to describe how much less likely drug use is with disease awareness. We use -nlcom- to calculate those risk ratios. The problem is that we get very different answers from -test- and -nlcom- (which we then verified with -testnl- to get the p-value). With -test-, the p-value for margin1-margin2=0 is 0.0002, with -testnl-, the p-value for margin1/margin2=1 is 0.1815. See below for code and results.

I know that the two methods often don't give the exact same p-value, but I have never seen a case where the result is so wildly different. I've pasted code and results for the tests below. Am I missing some basic problem with the way I've structured the analysis? Thank you for any ideas you might have on why these results might be so different.

Code:

svy, subpop(if stage!=0): logistic med_use i.stage##i.disease_aware other_variables margins disease_aware, subpop(if stage!=0) over(stage) post test _b[1.stage#0.disease_aware]=_b[1.stage_num#1.disease_aware] testnl _b[1.stage#0.disease_aware]/_b[1.stage_num#1.disease_aware] = 1

Results for -test- and -testnl- (I tried -testnl- on the test for margin1-margin2=0 to see if the problem could be explained as the Wald test vs the chi2, but the p-values were similar using -test- and -testnl-)

Code:

test _b[1.stage#0.disease_aware]=_b[1.stage_num#1.disease_aware] ( 1) 1.stage#0bn.disease_aware - 1.stage#1.disease_aware = 0 F( 1, 47) = 16.27 Prob > F = 0.0002 testnl _b[1.stage#0.disease_aware]/_b[1.stage#1.disease_aware] = 1 (1) _b[1.stage#0.disease_aware]/_b[1.stage#1.disease_aware] = 1 chi2(1) = 1.79 Prob > chi2 = 0.1815 testnl (_b[1.stage#0.disease_aware]-_b[1.stage#1.disease_aware] = 0) (1) _b[1.stage#0.disease_aware]-_b[1.stage#1.disease_aware] = 0 chi2(1) = 16.27 Prob > chi2 = 0.0001
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

03 May 2019, 11:41

You are comparing apples and oranges here. Actually, it's more like comparing apples and gorillas.

The calculations you are doing after -margins- are tests of the equality of probabilities. The ones that you do with -test- directly after the logistic regression are tests of regression coefficients, or, equivalent, of odds ratios.

You have to bear in mind that the relationship between odds ratios and probability differences is highly nonlinear and extremely sensitive to the base probability. If you start off with a base probability near 50%, an odds ratio even just slightly different from 1 will correspond to a large difference in probability. But if you start out with a base probability close to 1, even an astronomical odds ratio will correspond to a trivial change in probability.

So, in short, there is no reason to expect any particular correspondence between probability differences and odds ratios. They can be similar, or they can be radically different.
2 likes
Comment
Molly Jeffery

Join Date: Jan 2019

Posts: 36
#3

03 May 2019, 12:14

Thank you!
Both tests were performed after -margins-. Neither was after the logistic regression.

Edited to further clarify:
the order of the analysis: run the logistic regression, calculate predictive margins and post them to allow testing, test margin1-margin2=0, test margin1/margin2=1.

The two tests are algebraically identical. I know that they will generally not come out exactly the same, but I have never seen the results diverge so much.

Last edited by Molly Jeffery; 03 May 2019, 12:18.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#4

03 May 2019, 12:45

OK. Sorry I misunderstood your original question.

Again, the baseline probabilities make a big difference here. If you are starting from a hefty baseline probability and add a few percentage points to it, the difference is appreciable, and in a reasonably large sample could well be "statistically significant." But the probability ratio will still be close to 1 and might not be. I don't see anything to worry about here. Comparing p-values across different tests is a hazardous undertaking in the best of circumstances. p-values are very sensitive to minor changes in all sorts of things.

(This same issue comes up, usually in the opposite direction, where somebody will headline that something is associated with a doubling of risk of disease X from exposure Y, but when you look at the actual risks, it's a difference of 1 in a million vs 2 in a million--so the difference is negligible. You have the same thing in the opposite direction.

I don't see anything to worry about here.

Aside: Are you aware that by using the -over()- option in your -margins- command the probabilities you get for the different stages are not comparably adjusted for the distributions of other variables? Each of the probabilities you get is adjusted only to the distribution of disease awareness and other variables in its own stage--and if those distributions differ, the results are really not comparable (whether by differences or ratios) to each other. To get probabilities that are adjusted to the same distribution of everything else, the command should be:

Code:

margins disease_aware#stage, subpop(if stage!=0) post
Comment
Molly Jeffery

Join Date: Jan 2019

Posts: 36
#5

03 May 2019, 13:59

Thank you for taking another look. I have pasted below the actual calculated margins. The ratio is around 4, and the difference is 8 percentage points on a baseline of 11%. So the difference is quite large, whether you measure a difference or a ratio. In cases where I have seen the issue you describe (the risk difference is small, but the risk ratio is large) the interpretation of the two numbers might be quite different, but I have never seen a large difference in statistical significance. (0.0002 vs. 0.18)

Thank you for pointing out the effect of the "over" specification. I was aware of that effect. We are stratifying on disease stage in this study to determine the impact of disease awareness within each disease severity stratum. People in different disease stages have very different values for the covariates and the number of people in each disease stage varies substantially. If we calculated AMEs over the covariate distribution of the entire population, the results would be dominated by the covariate distribution of the least sick people and would not be clinically relevant or accurate for the sicker people. (Note, I am showing only the most severe disease stage. There are two other disease stages.)

Code:

| Margin Std. Err. t P>|t| [95% Conf. Interval] -----------------------+---------------------------------------------------------------- stage#diesase_aware | severe#unaware | .107299 .0153276 7.00 0.000 .0764638 .1381341 severe#aware | .022342 .0123942 1.80 0.078 -.002592 .0472759
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#6

03 May 2019, 14:30

So think about sampling variation on these estimates and look at the confidence intervals here as a guide to that. The one for the aware group straddles 0, and really doesn't stray very far from it. The difference between unaware and aware then is going to vary somewhere between about 7 percent and about 14 percent.

Now look at the ratios. Since 0 actually lies in the denominator confidence intervals, you will find ratios that are very large, "nearing infinity." You will also find some that are pretty small, and some that are even large negatives (Think about .1381341/.002592 = about -53.) So the confidence interval for the ratio is going to range from something like -50 to nearly + infinity, which means your p-value is going to be pretty lackluster for that.

I realize that subtracting and dividing upper and lower confidence limits does not give you the confidence limits of the difference and the ratio. But they do give you a general feel for orders of magnitude. Looking at these numbers, I don't find your results surprising. If you want some additional confirmation of this explanation, run -lincom- and -nlcom- instead of -test- and -testnl-, and I think you'll see that the confidence interval for the ratio is extremely wide.
3 likes
Comment
Molly Jeffery

Join Date: Jan 2019

Posts: 36
#7

03 May 2019, 15:04

I may be being exceptionally dense, but I still don't see this. The negative CI limit on severe#aware is essentially 0 (and should be bounded by 0, since it's a probability). So using the CI endpoints to roughly estimate, the ratios would range over roughly 7.6/4.7 = 1.6 and 13.8/0=infinity. That range doesn't include 1. I see that 13.8/-0.003 is actually negative infinity, but that result is impossible. Something can't be so unlikely that it is actually negatively likely.

The lincom and nlcom results are pasted below for interest.

Code:

Coef. Std. Err. t P>t [95% Conf. Interval] (1) .084957 .0210645 4.03 0.000 .0425808 .1273332 Coef. Std. Err. z P>z [95% Conf. Interval] _nl_1 4.802575 2.84596 1.69 0.092 -.7754041 10.38055
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#8

03 May 2019, 15:19

The kind of confidence intervals that are calculated by -margins- do include impossible values: they are based on the "delta method" and can extend below 0, even though the predicted probabilities themselves do not go below 0. The fact that the coefficient for the aware group is so close to zero means that small sampling errors in the coefficient will translate to large sampling errors in the ratio. (Think about the difference between 0.02 and 0.002; it's quite small. But if you look at 1/.02 and 1/0.002, that's 50 vs 500!) And indeed the results you show confirm that the confidence interval for the ratio goes all the way from -.78 to 10.38, with 1 fitting quite comfortably inside there. It's not as wide a CI as I had sketched out in #6, but the principle was still correct: the sampling error around a denominator near zero makes for large degrees of uncertainty in the ratio.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#9

03 May 2019, 16:37

I've given the matter more thought. The confidence intervals around the aware probability do extend into negative territory. While the delta method confidence intervals do give pretty accurate coverage probability, it can be troublesome that they include impossible values.

So I ran a little simulation of aware and unaware probabilities using beta distributions that have approximately the mean values you got as the predictive margins, and standard deviations approximately equal to the standard errors shown in your margins output and then calculated the difference and ratio. In a simulation of 1,000 replications, the minimum value of the ratio is about 1.39 and the max is about 85. The 2.5th percentile, which is something like the lower bound of a 95% CI, is 1.8:

Code:

clear* set seed 1234 set obs 1000 gen p_unaware = rbeta(45, 376) gen p_aware = rbeta(3, 133) histogram p_unaware, bin(100) name(unaware, replace) histogram p_aware, bin(100) name(aware, replace) gen ratio = p_unaware/p_aware histogram ratio, bin(100) name(ratio, replace) gen diff = p_unaware - p_aware histogram diff, bin(100) name(diff, replace) summ ratio diff centile ratio, centile(2.5)

So I think this is a situation where the use of the delta-method to calculate standard errors is leading us astray. Those values that are very close to zero or even negative are throwing off the results considerably. So I think that you should not rely on -test- and -testnl- here, but instead do bootstrap sampling of the ratio. (May as well bootstrap the difference too while you're at it, and that way they are both calculated the same way.)

That said, if you want to evaluate the effect of awareness for decision-making purposes, the risk difference is a more meaningful statistic than the risk ratio. The ratio gives an exaggerated impression of the effect.
3 likes
Comment
Molly Jeffery

Join Date: Jan 2019

Posts: 36
#10

03 May 2019, 20:14

Thank you so much for your help. I hadn't considered bootstrapping the confidence intervals, but that makes perfect sense.
I will post the result when I have it, just to close the loop.

Thank you again!
Comment
Molly Jeffery

Join Date: Jan 2019

Posts: 36
#11

05 May 2019, 17:48

I've been playing around with this. In my first effort, several of the bootstrapping replications failed--up to one-third of them for the rarest outcomes. This was because the bootstrapping sampling process happened to draw 0 cases of the rare outcome. I thought of two different ways to deal with this.

The first author on the paper wants to be able to say: "the group that didn't know their disease state was 4.8 times more likely to use inappropriate meds (95% CI: X to Y)". To allow this, I could replace the missing margin with the actual rate of use for the cases. This isn't ideal, because it changes the variance of that ratio.

Alternatively, I could calculate the inverted ratio and substitute 0 for the missing margin and the first author can say "the group that knew their disease state was 79% less likely to use the inappropriate med". This seems less hinky to me, though it may be similarly flawed.

At any rate, for the example I used above with a point estimate of 4.80, the bias-corrected bootstrapped CI (reps=100) was 2.40 to 33.79, vs. -0.78 to 10.38 using the delta method. (This particular outcome did not have the problem of a large number of failed replications, so is unaffected by the issue on replacing unestimated margins)

Thank you again for your help!
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4449
#12

05 May 2019, 18:15

Is there a reason why you don't just use

Code:

svy, subpop(if stage!=0): glm med_use i.stage##i.disease_aware other_variables, family(binomial) link(log) eform
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#13

05 May 2019, 18:57

Joseph Coveney's suggestion in #12 is excellent; I'm sorry I didn't propose it myself. It gives you a direct estimate of the risk ratio.

That said, it sounds to me a bit like the first author on your paper is trying to write a sensational-sounding conclusion that, though not false, creates an impression of greater importance of the finding than is warranted. The bottom line here is that you have found, approximately, 10.7% vs 2.2% use of an inappropriate medication. Is this difference clinically important? How bad is this inappropriate medication? If it has serious toxicity in appreciable numbers of people, then, yes, this is hugely important. If it is merely a waste of money and not terribly expensive at that, then who cares? You are going through contortions trying to come out with a large-sounding ratio, why? Because the risk difference of 8.5 percentage points somehow seems somehow less gratifying, less "sexy?" In any event, you have the ratio. Worse, it appears you are actually doing the contortions because you are concerned about whether the ratio is "statistically significant."

The American Statistical Association has recommended that the concept of statistical significance be abandoned. You can read their position paper at https://www.tandfonline.com/doi/full...5.2019.1583913.

So I urge you to spend your energy on more useful things. Forget about whether the ratio is statistically significant. An honest scientific report of your results will include the probability of taking inappropriate medication in both the unaware and aware groups. For most purposes, and certainly for any kind of decision making (e.g. should we mount a trial of the impact of an intervention to increase awareness as a follow-up to this study, or even go directly to implementing some intervention to increase awareness without further study) the risk difference is a more useful and illuminating statistic than the risk ratio anyway. If you want to report the risk ratio as well, go ahead. But the focus of your efforts should be on how clinically important this effect, whether presented as a difference or a ratio, is.
Comment
Molly Jeffery

Join Date: Jan 2019

Posts: 36
#14

05 May 2019, 19:45

We can't report the the risk ratio directly from the model results because of the interaction.

> Worse, it appears you are actually doing the contortions because you are concerned about whether the ratio is "statistically significant."

This is not true; we have the statistical significance from the risk difference. I am going through the contortions because otherwise we would be presenting a confidence interval for a risk ratio that included a negative number. That being said, I've advised that we forget about the CIs for risk ratios, present the risk differences with confidence intervals, and describe that effect as being 4.8 times higher without a CIs. Given the audience, on reflection, I think the bootstrapped CIs with the additional explanation on the rare outcomes is going to be more trouble to describe than it's worth.

Yes, I've read the ASA paper. I agree with it for the most part. The vast majority of people reviewing for the journals we'll submit to (and reading the articles) are MDs who had 1 stats course in medical school. They want p-values and confidence intervals.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#15

05 May 2019, 21:14

I've advised that we forget about the CIs for risk ratios, present the risk differences with confidence intervals, and describe that effect as being 4.8 times higher without a CIs. Given the audience, on reflection, I think the bootstrapped CIs with the additional explanation on the rare outcomes is going to be more trouble to describe than it's worth.

I agree with this 100%.

The vast majority of people reviewing for the journals we'll submit to (and reading the articles) are MDs who had 1 stats course in medical school. They want p-values and confidence intervals.

Well, that's the world I live and publish in, too. But if we just allow them to linger in ignorance, the situation will never improve. It is perfectly OK to pushback against reviewers if asked to do things that are inappropriate scientifically. The ASA paper makes it easier to do that, as you can site something that they probably will have some respect for.

And, by the way, the ASA position paper does not say to stop using p-values and confidence intervals. Those are separate issues. The ASA is only recommending that we stop dichotomizing p-values into "significant" and "non-significant" around a cutoff. In fact, they affirmatively recommend reporting actual p-values and confidence intervals.
1 like
Comment

Announcement

Very different p-values using testnl vs test after margins

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment