Calculating ROC curve areas: problems with using predicted values from logit

Joe Canner

Join Date: Mar 2014

Posts: 580
#1

Calculating ROC curve areas: problems with using predicted values from logit

28 Jan 2015, 13:34

As outlined in a Stata Journal article from 2002 by Mario Cleves, one can compute ROC curve areas using lroc or roctab, as follows:

Code:

logit refvar classvars lroc predict p roctab refvar p

Of course, this is the only way to proceed if you have multiple classification variables. If you have a single classification variable, you can use roctab alone and get the same answer:

Code:

roctab refvar classvar

However, in cases where the predication model is not that good (e.g., with a negative regression coefficient), the two methods give different answers. The reason for this seems to be that roctab is expecting a classification variable for which increasing values indicate increased risk of the outcome of interest. However a negative regression coefficient will yield predicted values where increasing values indicate decreased risk, which results in a different answer.

My questions are these:
Can anyone confirm that using logit (or probit) before roctab can give different results than using roctab alone?

Is this documented anywhere?

If it is not documented, should it be? Or is it something that I should have known about ahead of time?
Tags: None

Joseph Coveney

Join Date: Apr 2014
Posts: 4410

28 Jan 2015, 19:04

I have run across the phenomenon where lroc after a logistic regression model using the classification variable gives a different result from roctab on the original classification variable, and it's for the reason that you mention, which is documented in the entry for roctab of the user's manual. But I don't recall ever getting different results using roctab on the predictions. Also, any discrepancy always went away when I reversed the sign of the original classification variable before using roctab.

I cannot confirm that you can get a different result using predictions after fitting a logistic regression model, even with a poor predictor, with or without guaranteeing that the regression coefficient is negative.

Code:

version 13.1

clear *
set more off
set seed `=date("2015-01-29", "YMD")'
quietly set obs 100
generate byte response = runiform() > 0.5
quietly generate double predictor = .

program define rocem, rclass
    version 13.1
    syntax , [NOREVerse]

    tempvar xb
    tempname area
    quietly {
        replace predictor = runiform()
        
        if "`reverse'" == "" {
            correlate response predictor
            if r(rho) > 0 replace predictor = -predictor
        }
        logit response c.predictor
        predict double `xb', xb
        lroc, nograph
        scalar define `area' = r(area)
        roctab response `xb'
    }
    return scalar roctab = r(area)
    return scalar lroc = `area'
end

program define signflip
    version 13.1
    syntax , [flip]

    local noreverse = cond(("`flip'" == ""), "", "noreverse")
    
    tempname file_handle
    tempfile tmpfil0
    postfile `file_handle' double(lroc roctab) using `tmpfil0'

    forvalues rep = 1/500 {
        rocem , `noreverse'
        post `file_handle' (r(lroc)) (r(roctab))
    }
    postclose `file_handle'
    
    preserve
    use `tmpfil0', clear
    graph twoway scatter roctab lroc, mcolor(black) msize(vsmall) ///
        ylabel( , angle(horizontal) nogrid)

    generate double delta = lroc - roctab
    summarize delta
end

pause on
signflip
pause

signflip , flip

exit

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#3

28 Jan 2015, 22:00

I had deleted my first post to this thread, which illustrated the phenomenon and quoted the user's manual. Here's the illustration.

Code:

version 13.1 clear * set more off sysuse auto logit foreign c.displacement, nolog lroc, nograph roctab foreign displacement quietly replace displacement = -displacement roctab foreign displacement exit

The Description section of the entry in the user's manual for roctab says, "The rating or outcome of the diagnostic test or test modality is recorded in classvar, which must be at least ordinal, with higher values indicating higher risk."
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#4

29 Jan 2015, 07:25

Joseph,

Thanks for weighing in on this. My situation is similar to your example above (auto data set), except that in my case the regression coefficient, while negative, is not significantly different from 0. In other words, changing the sign of the classification variable makes the logit/lroc and roctab answers the same, but it doesn't make it any better of a predictor.

Moreover, even though changing the sign makes the lroc and roctab results the same, in my case they are both wrong. My classification variable takes values 0 through 10, with higher values meaning higher risk of disease. In most cases, it is a reasonably good predictor, but in one particular case it is not. Simply changing the sign is not an appropriate solution because the classification variable generally works well the way it was designed. Instead, I want the ROC curve to be accurate representation of the classification variable as it was originally designed, so my solution is to use the roctab answer and ignore the logit/lroc answer.

So, I guess the take-home message here is to pay attention to the requirement that the classification variable be positively correlated with the outcome variable and not to blindly use roctab (or roccomp) on the predicted probabilities without paying attention to whether that requirement is satisfied.

Incidentally, using roctab on the predicted probabilities from the regression is generally the same as using roctab on the original classification variable because the predicted probabilities are perfectly correlated with the classification variable and thus result in the same ROC curve.

Regards,
Joe
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#5

29 Jan 2015, 15:49

Just a couple of points.

If the regression coefficient is negative (regardless of whether it is significantly different from zero), then take the value from logit/lroc (or from roctab on the predictions if you wish) and subtract it from 0.5. Alternatively, to use the orginal classification variable, change its sign, run roctab, and then subtract the result from 0.5.

Using roctab on the predicted probabilities (or linear predictions) is never the same as using roctab on the original classification variable when the sign of regression coefficient is negative.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4410
#6

29 Jan 2015, 17:11

Should have said "subtract 0.5 from the result", sorry.
Comment

Announcement