Cross validation of a logistic regression

Sergio Alzate

Join Date: Apr 2024

Posts: 8
#1

Cross validation of a logistic regression

18 Jul 2024, 11:08

Im a creating a modelo capable of predicting an outcome.
What I have done so far:
1. Randomly divided my sample into two groups.
2. Ran logistic regression for my outcome and predictive variables in group 1.

What I need to do now, crossvalidation:
1. I haven't been able to assign coefficients for group 2: Using the logistic coefficients found in group 1, assigning logistic coefficients to each patient in group 2 based on how variables interplay in each.
2. Predicting the outcome in group 2: Using logistic coefficients to predict the outcome in group 2 (Cross validation).

My final results should have a predicted vs observed outcome in group 2.

Any guidance would be of help.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3142
#2

18 Jul 2024, 11:56

predict (or predict asif, see help file) will get you the predictions.

The question is how to assess the fit? You're not going to get zeros and ones. ROC might work, but I suspect there are other approaches.
1 like
Comment

George Ford

Join Date: Aug 2014
Posts: 3142

18 Jul 2024, 12:31

Code:

webuse lbw, clear

logit low age lwt i.race ptl ht ui if smoke==1
predict yfit
g yfit_d = yfit>=0.5
estat classification 
lroc
rocfit low yfit if smoke==1, cont(10)
rocfit low yfit if smoke==0, cont(10)

tab2 low yfit_d if smoke==1
tab2 low yfit_d if smoke==0

see

HTML Code:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1327821-interpretating-classification-of-logistic-model

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#4

18 Jul 2024, 14:02

The ROC area is a measure of model discrimination, but it does not directly say much about fit of the model. I think it is better for this purpose to use a calibration statistic, such as the Hosmer-Lemeshow statistic. (-estat gof, group(10)- after a logistic regression) Do note that when you are doing this for the validation sample, which is out-of-sample for the model coefficients from the learning sample, the df for the statistic is 10, not 8.

If you prefer something a bit newer and fancier than the Hosmer-Lemeshow statistic, there is the -calibrationbelt- procedure, available from http://www.stata-journal.com/software/sj17-4. Read the help file. It explains the way to use it specifically for developmental and validation data sets.
2 likes
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4459
#5

18 Jul 2024, 14:24

unless you have a sizable N, at least in the tens of thousands, this is unlikely to be a good idea as you will have power issues and possibly bias issues; in some circumstances changing to bootstrap or n-fold (e.g, n=10) cross-validation may help but more info is needed to judge that
1 like
Comment

Announcement

Cross validation of a logistic regression

Comment

Comment

Comment

Comment