logistic model - area under the curve, and c statistic

Murph Ngo

Join Date: Apr 2017

Posts: 11
#1

logistic model - area under the curve, and c statistic

08 May 2017, 05:47

Hello all,

I have a query about area under the curve (lroc command) and C statistic (calculated using hl user written-program https://www.sealedenvelope.com/stata/hl/ )

I have created a logistic model (picture 1).

Using the 'lroc' command I get an area under the curve value of 0.6869 (picture 2) .

Using the hl user-written program ( https://www.sealedenvelope.com/stata/hl/ ) I get (what I believe) is a C statistic of 0.5697 (picture 3).

From what I have been reading, I believe that for logistic models both these values should be equal?

#1. Am I correct with this belief?

#2. And if so, could anyone please advise me why my two values are different.

Thank you for you help

PICTURE 1

PICTURE 2

PICTURE 3
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

08 May 2017, 09:35

Yes, the area under the ROC curve and the C-statistic are the same thing. I am not familiar with the user-written program you are referring to, so I cannot comment why it gives a different result. The official Stata -lroc- program has been around for a very long time, so it would be surprising if it had an uncorrected error. I would be more inclined to believe the results of -lroc-. You might want to find the author of the user-written program and contact him/her about this.
1 like
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

08 May 2017, 09:52

Hello Murph,

I suspect Clyde couldn't open the pictures. Actually, they are very small in my screen ad I needed to enlarge them so as to get a nice view.

This is surely the reason of not underlining that the value 0.5697 is in fact the result of the Hosmer-Lemeshow test.

Indeed, the p-value in this case, being > 0.05, is good news in terms of calibration of the model.

Last edited by Marcos Almeida; 08 May 2017, 09:57.

Best regards,

Marcos
1 like
Comment
Murph Ngo

Join Date: Apr 2017

Posts: 11
#4

08 May 2017, 18:52

Thank you for your advice.

Apologies for the sizing of the pictures.

On the user-written program's website, they say that the program's output is the C statistic (P value 0.5697). [PICTURE 1]

When I use "estat gof, group(10)" -> I get a P value of 0.3765. [PICTURE 2]

I notice that the Hosmer-Lemeshow chi2 value is the same for both methods (8.61), however the user-written program uses 10 degrees of freedom, whilst estat gof uses 8. Which would explain the different P values.

Perhaps I have misunderstood the explanation provided by the author of hl. I now realise this is not how a c-statistic would be calculated. Apologies for my confusion. I will ask them for clarification.

Thank you

[PICTURE 1]

[PICTURE 2]

Last edited by Murph Ngo; 08 May 2017, 19:22.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

09 May 2017, 07:07

Thank you for presenting larger images. I gather the issue on the values is clarified. If in doubt, I'd stick to the - estat gof - results (dfs).

Last edited by Marcos Almeida; 09 May 2017, 07:10.

Best regards,

Marcos
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#6

09 May 2017, 12:36

Coming back to this with the benefit of the readable graphics, a quick summary.

1. If you want the C-statistic, that is what -lroc- gives you.

2. If you want the Hosmer-Lemeshow goodness-of-fit test, -estat gof- does that.

3. If you are doing the Hosmer-Lemeshow test on the same data to which the logistic model was fit, the correct df is 8.

4. If you are applying the test to a different, non-overlapping sample then the correct df is 10. You can get that by specifying the -outsample- option in the -estat, gof- command.
1 like
Comment
Ginny Han

Join Date: Jul 2018

Posts: 22
#7

17 Dec 2018, 00:40

Hi,
I have a follow-up question regarding the C-statistics. I've been using -lroc- command following -logit- to calculate C-statistics. However, -lroc- provides area under ROC curve as point estimate. I wonder if there is a command or a method in STATA that can calculate the point estimate and 95% confidence interval of C-statistics?
I did not think that it is necessary to have the CIs until I saw that several articles have reported C-statistics and its 95% confidence intervals:
Moore, B.J., et al., Identifying Increased Risk of Readmission and In-hospital Mortality Using Hospital Administrative Data: The AHRQ Elixhauser Comorbidity Index. Med Care, 2017. 55(7): p. 698-705.
Walraven, C.V., et al., A Modification of the Elixhauser Comorbidity Measures into a Point System for Hospital Death Using Administrative Data. Medical Care, 2009. 47(6): p. 626-633
And these articles were using SAS (the %ROC macro from Gonen).
Can STATA calculate C-statistics and its 95% confidence intervals? If yes how to do that?

Any suggestions or comments are welcome. Thanks very much.

Ginny
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4433

17 Dec 2018, 00:57

Originally posted by Ginny Han View Post

Can [Stata] calculate C-statistics and its 95% confidence intervals? If yes how to do that?

Code:

sysuse auto

// One classification variable
roctab foreign gear_ratio

// Multiple classification variables in concert
quietly logit foreign c.(gear_ratio displacement), nolog
predict double xb, xb
roctab foreign xb

help roc

Comment

Ginny Han

Join Date: Jul 2018

Posts: 22
#9

17 Dec 2018, 01:22

Thank you very much Mr.Coveney! Works perfectly.
Comment
Tom Hsiung

Join Date: Sep 2017

Posts: 153
#10

24 May 2024, 02:37

Originally posted by Joseph Coveney View Post

Code:

sysuse auto // One classification variable roctab foreign gear_ratio // Multiple classification variables in concert quietly logit foreign c.(gear_ratio displacement), nolog predict double xb, xb roctab foreign xb help roc

Hi, Joseph

Wow, this is amazing. Could you tell me the math behind this estimation? Especially the confidence interval of the C-statistic. Hitherto, I see each study produces only one value of sample C-statistic. How do we estimate the confidence interval of C-statistic so? Thanks.
Comment
ericmelse

Join Date: May 2014

Posts: 436
#11

08 Oct 2024, 11:08

For those who are interested and not aware of this paper, it is Open Access available: Carrington, A. M., Fieguth, P. W., Qazi, H., Holzinger, A., Chen, H. H., Mayr, F., & Manuel, D. G. (2020). A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms. BMC medical informatics and decision making, 20, 1-12. https://doi.org/10.1186/s12911-019-1014-6

http://publicationslist.org/eric.melse
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#12

08 Oct 2024, 11:17

This paper would also be of intnerest (and shameless self-promotion). It extends the De Long method to improve estimation and allows for missing data. I have written a Stata program for it and will be releasing it when I can find some time in the coming weeks.

Zou L, Choi YH, Guizzetti L, Shu D, Zou J, Zou G. Extending the DeLong algorithm for comparing areas under correlated receiver operating characteristic curves with missing data. Stat Med. 2024 Sep 20;43(21):4148-4162. doi: 10.1002/sim.10172. Epub 2024 Jul 16. PMID: 39013403.
2 likes
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#13

08 Oct 2024, 11:35

There is a reason for differences in the c-statistic/AUC where one would expect the be identical regardless of the method used. As already pointed out, the terms AUC and c-statistic (among many others) mean precisely the same quantity. However, they can be estimated differently, either under parametric or non-parametric assumptions. It is my experience that non-parametric methods are more common, but parametric models do exist (notably in the diagnostic test meta-analysis space that I am aware of).

Here's a toy example illustrating how different estimates of the AUC can be obtained when one models the same predictor differently.

Code:

syuse auto, clear xtile price_group = price, nq(5) // I create a variable that could be modelled as a linear covariate or factor variable qui logit foreign price_group estat auc qui logit foreign i.price_group estat auc roctab foreign price_group qui ranksum price_group, by(foreign) porder di 1 - r(porder)
Comment

Announcement

logistic model - area under the curve, and c statistic

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment