Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cross validation of a logistic regression

    Im a creating a modelo capable of predicting an outcome.
    What I have done so far:
    1. Randomly divided my sample into two groups.
    2. Ran logistic regression for my outcome and predictive variables in group 1.

    What I need to do now, crossvalidation:
    1. I haven't been able to assign coefficients for group 2: Using the logistic coefficients found in group 1, assigning logistic coefficients to each patient in group 2 based on how variables interplay in each.
    2. Predicting the outcome in group 2: Using logistic coefficients to predict the outcome in group 2 (Cross validation).

    My final results should have a predicted vs observed outcome in group 2.

    Any guidance would be of help.

  • #2
    predict (or predict asif, see help file) will get you the predictions.

    The question is how to assess the fit? You're not going to get zeros and ones. ROC might work, but I suspect there are other approaches.

    Comment


    • #3
      Code:
      webuse lbw, clear
      
      logit low age lwt i.race ptl ht ui if smoke==1
      predict yfit
      g yfit_d = yfit>=0.5
      estat classification 
      lroc
      rocfit low yfit if smoke==1, cont(10)
      rocfit low yfit if smoke==0, cont(10)
      
      tab2 low yfit_d if smoke==1
      tab2 low yfit_d if smoke==0

      see
      HTML Code:
      https://www.statalist.org/forums/forum/general-stata-discussion/general/1327821-interpretating-classification-of-logistic-model

      Comment


      • #4
        The ROC area is a measure of model discrimination, but it does not directly say much about fit of the model. I think it is better for this purpose to use a calibration statistic, such as the Hosmer-Lemeshow statistic. (-estat gof, group(10)- after a logistic regression) Do note that when you are doing this for the validation sample, which is out-of-sample for the model coefficients from the learning sample, the df for the statistic is 10, not 8.

        If you prefer something a bit newer and fancier than the Hosmer-Lemeshow statistic, there is the -calibrationbelt- procedure, available from http://www.stata-journal.com/software/sj17-4. Read the help file. It explains the way to use it specifically for developmental and validation data sets.

        Comment


        • #5
          unless you have a sizable N, at least in the tens of thousands, this is unlikely to be a good idea as you will have power issues and possibly bias issues; in some circumstances changing to bootstrap or n-fold (e.g, n=10) cross-validation may help but more info is needed to judge that

          Comment

          Working...
          X