Significant coefficient in logistic regression, but overlapping margins

Felix Scholl

Join Date: Aug 2020

Posts: 33
#1

Significant coefficient in logistic regression, but overlapping margins

20 Sep 2020, 12:52

Dear members,

I am running a logistic regression of a binary variable on a set of independent variables. The coefficient of my key independent variable y is significant on the 5% level (p=0.018). To assess the substantive effect of the variable, I run margins after the estimation to calculate predicted probabilities for different levels of the independent variable (using the at options of margins) and finally plot them using marginsplot. The problem that arises is: all of the margins are not statistically different from each other. The 95% confidence intervals for the predicted probabilities always overlap. I could not find a single pair of margins that is statistically different from each other.

Is there any statistical explanation for this? Common sense tells me that there cannot be a signficant effect, when all the predictions of the outcome overlap. Any idea what the reason could be?

I clustered standard errors in the logistic regression and use the vce(unconditioinal) and asobserved options of margins, because I want to make inference from a survey sample to the general population (as suggested in the Stata manual). I calculated the margins over the observed range of the independent variable y, from the minimum to the maximum value.

Thanks for your Help.

Best regards,
Felix
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

20 Sep 2020, 13:17

This is a good example of the problems that arise from using the concept of statistical significance, and one of the reasons why the American Statistical Association has recommended it be abandoned. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and
https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr.

By recasting the continuous effect estimates (and even the continuous p-values) into a false dichotomy of significant vs non-significant, the illusion is created that these correspond to "effect" and "no effect," which intuition suggests should be consistently arrived at when different measures are used. That, of course, is all wrong.

There are several factors at play here. First of all, even in simple things like t-tests, looking at the difference between two means is not the same as looking at whether the confidence intervals of those means overlap. When the confidence intervals do not overlap, the difference is always statistically significant, but the reverse is not always true. Your situation is this in a different metric: logistic regression coefficient (a single measure of difference) vs overlap of confidence intervals of two measures of the corresponding levels. See https://journals.sagepub.com/doi/pdf...10581001900316 for a full explanation. In any case, it is perfectly possible for a data set to provide a very precise estimate of the difference between two things while providing only vague estimates of the two things themselves. That is what you are seeing. (And you might see it in your probability estimates too if you ran -margins- again with the -pwcompare- option.)

The fact that you are working in a logistic model adds another complication. A logistic regression coefficient is the logarithm of the (adjusted) odds ratio. But when the baseline probabilities are high, the odds ratios greatly exaggerate the effects compared to what is seen when looking at the corresponding probabilities. A very large odds ratio can correspond to a very tiny difference in probability between groups. For example if one group has an outcome probability of .97, an odds ratio of 4 (which is huge for an odds ratio between two groups) means that the other group has a probability of .9999909. That's a difference of less then 0.03, which in many contexts would be meaninglessly small. Since you don't show your actual outputs, I can't say whether something like this is happening in your situation.

So what you need to do is carefully review what your research goal was in the first place. Why did you gather the data? What question are you trying to answer, and how will you put your results to use? Which his more important, the odds ratio, which is more a theoretical measure of strength of association between y and the outcome, or the actual probabilities Will the results be used for decision making: if so. the probabilities will be more useful than the odds ratio.
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2169
#3

20 Sep 2020, 13:19

Without showing your output you’re much less likely to get a helpful response. I have some ideas, but I can’t afford to be a detective.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#4

20 Sep 2020, 13:40

Felix Scholl, here is another article on overlapping confidence intervals that you might find helpful.
https://www.cmaj.ca/content/166/1/65.long

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
1 like
Comment

Felix Scholl

Join Date: Aug 2020
Posts: 33

20 Sep 2020, 14:22

@all Thanks a lot, this is very helpful. I attach the output below, if that is of any help.

@Clyde Schechter: I assume the same is true if the baseline probability is very low, right? This would rather be the problem in my case, since the sample probability is only 0.035.

Here is the relevant output from the logistic and margins command (I only show three variables here, but there are many more in the model).

Code:

Iteration 0:   log pseudolikelihood = -2695.6077  
Iteration 1:   log pseudolikelihood = -2623.2003  
Iteration 2:   log pseudolikelihood = -2616.6522  
Iteration 3:   log pseudolikelihood = -2616.6386  
Iteration 4:   log pseudolikelihood = -2616.6386  

Logistic regression                             Number of obs     =     17,571
                                                Wald chi2(36)     =     207.25
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -2616.6386               Pseudo R2         =     0.0293

                                   (Std. Err. adjusted for 198 clusters in int_date)
----------------------------------------------------------------------------------------
                       |               Robust
               problem | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-----------------------+----------------------------------------------------------------
         medex7d_print |   1.070994   .0311648     2.36   0.018     1.011621    1.133852
         conversations |   .8283887     .12231    -1.28   0.202     .6202344    1.106401
              interest |   2.035035   .3810907     3.79   0.000     1.409846     2.93746


Predictive margins                              Number of obs     =     17,571

Expression   : Pr(problem), predict()

1._at        : medex7d_pr~t    =           0

2._at        : medex7d_pr~t    =           1

3._at        : medex7d_pr~t    =           2

4._at        : medex7d_pr~t    =           3

5._at        : medex7d_pr~t    =           4

6._at        : medex7d_pr~t    =           5

7._at        : medex7d_pr~t    =           6

8._at        : medex7d_pr~t    =           7

9._at        : medex7d_pr~t    =           8

10._at       : medex7d_pr~t    =           9

11._at       : medex7d_pr~t    =          10

12._at       : medex7d_pr~t    =          11

13._at       : medex7d_pr~t    =          12

14._at       : medex7d_pr~t    =          13

15._at       : medex7d_pr~t    =          14

16._at       : medex7d_pr~t    =          15

17._at       : medex7d_pr~t    =          16

18._at       : medex7d_pr~t    =          17

19._at       : medex7d_pr~t    =          18

20._at       : medex7d_pr~t    =          19

21._at       : medex7d_pr~t    =          20

22._at       : medex7d_pr~t    =          21

23._at       : medex7d_pr~t    =          22

24._at       : medex7d_pr~t    =          23

25._at       : medex7d_pr~t    =          24

26._at       : medex7d_pr~t    =          25

27._at       : medex7d_pr~t    =          26

28._at       : medex7d_pr~t    =          27

29._at       : medex7d_pr~t    =          28

30._at       : medex7d_pr~t    =          28

                         (Std. Err. adjusted for 198 clusters in pre_intdatum)
------------------------------------------------------------------------------
             |            Unconditional
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         _at |
          1  |   .0333932   .0021651    15.42   0.000     .0291497    .0376368
          2  |   .0356589   .0020639    17.28   0.000     .0316137     .039704
          3  |   .0380708   .0024198    15.73   0.000      .033328    .0428135
          4  |   .0406374   .0032113    12.65   0.000     .0343434    .0469313
          5  |   .0433675   .0043217    10.03   0.000     .0348971    .0518379
          6  |   .0462702   .0056787     8.15   0.000     .0351402    .0574002
          7  |    .049355   .0072534     6.80   0.000     .0351385    .0635715
          8  |   .0526317   .0090391     5.82   0.000     .0349154    .0703481
          9  |   .0561104   .0110388     5.08   0.000     .0344747    .0777462
         10  |   .0598015   .0132608     4.51   0.000     .0338108    .0857923
         11  |   .0637157    .015716     4.05   0.000      .032913    .0945185
         12  |   .0678639   .0184168     3.68   0.000     .0317677    .1039601
         13  |   .0722572   .0213767     3.38   0.001     .0303596    .1141548
         14  |   .0769069   .0246097     3.13   0.002     .0286727    .1251411
         15  |   .0818245   .0281301     2.91   0.004     .0266905    .1369585
         16  |   .0870214   .0319521     2.72   0.006     .0243964    .1496464
         17  |   .0925091   .0360898     2.56   0.010     .0217744    .1632438
         18  |    .098299   .0405567     2.42   0.015     .0188093    .1777888
         19  |   .1044025   .0453659     2.30   0.021      .015487    .1933181
         20  |   .1108307   .0505293     2.19   0.028     .0117951    .2098663
         21  |   .1175943   .0560578     2.10   0.036      .007723    .2274655
         22  |   .1247036   .0619609     2.01   0.044     .0032625    .2461446
         23  |   .1321686   .0682463     1.94   0.053    -.0015916    .2659288
         24  |   .1399985   .0749197     1.87   0.062    -.0068415    .2868385
         25  |   .1482018   .0819848     1.81   0.071    -.0124855    .3088892
         26  |   .1567863   .0894425     1.75   0.080    -.0185178    .3320904
         27  |   .1657586   .0972909     1.70   0.088    -.0249281    .3564453
         28  |   .1751243    .105525     1.66   0.097    -.0317009    .3819495
         29  |   .1848878   .1141364     1.62   0.105    -.0388155    .4085911
         30  |   .1848878   .1141364     1.62   0.105    -.0388155    .4085911
------------------------------------------------------------------------------

  Variables that uniquely identify margins: medex7d_print _atopt
  Multiple at() options specified:
      _atoption=1: medex7d_print==(0(1)28)
      _atoption=2: medex7d_print==28

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#6

20 Sep 2020, 14:40

I assume the same is true if the baseline probability is very low, right? This would rather be the problem in my case, since the sample probability is only 0.035.

Yes, at either end, a large odds ratio can correspond to a small change in probability.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2169
#7

20 Sep 2020, 18:52

Clyde has answered your question. Once I saw the issue I thought of Clyde’s example of comparing means. It’s not hard to find an example where the means are statistically different with p < 0.05 and yet the 95% CIs overlap.

Out of curiosity, did you compute the average marginal effect on the probability?
1 like
Comment

Felix Scholl

Join Date: Aug 2020
Posts: 33

22 Sep 2020, 02:58

Jeff Wooldridge When I calculate average marginal effects I get these results:

Code:

 qui: logit problem medex7d_print `controlvars', or vce(cluster date)

  margins, post dydx(medex7d_print)  vce(unconditional)

Average marginal effects                        Number of obs     =     17,571

Expression   : Pr(problem), predict()
dy/dx w.r.t. : medex7d_print

                          (Std. Err. adjusted for 198 clusters in date)
-------------------------------------------------------------------------------
              |            Unconditional
              |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
medex7d_print |   .0023257   .0009817     2.37   0.018     .0004016    .0042499
-------------------------------------------------------------------------------

Hence, the p-value is identical to the p-value of the coefficient in the regression.

Comment

Stella Lartey

Join Date: Oct 2020

Posts: 4
#9

15 Oct 2020, 11:04

Dear Members,
I have two regression equations. After each regression, I run margins and marginsplot. Now with estimates from "margin" in the two regression equations, I would like to create one figure.
Please any have an idea of how I could combine the two plots/ graphs?
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#10

15 Oct 2020, 11:55

This seems to be a different query. Please start a new thread.

Best regards,

Marcos
Comment
Stella Lartey

Join Date: Oct 2020

Posts: 4
#11

29 Oct 2020, 07:52

Sorry Marcos, I'm new here and did not know this. Thank you
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1133
#12

29 Oct 2020, 09:28

For those who are interested, Stella's new thread can be seen here:
https://www.statalist.org/forums/for...sion-equations

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment

Announcement