Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Validating logit model on another data frame

    Hello, I have 2 data frames, one is a training set, one is a validation set.

    I want to create the logit model from the training set, and then run it against the validation set to get the classification results.

    I ran:
    Code:
    frame change beer_training
    logit BeerPreference Gender Married Income Age
    estat classification
    Now I want to apply that model to beer_validation and run estat classification.

    I am super new to Stata, to please be gentle. 😂

  • #2
    I think you have just made things more complicated by having the two data sets in different frames. As far as I can tell, the way Stata does out-of-sample prediction calculations (which is what you need to see how the model works in the validation sample) does not work across different frames. It probably has to do with some difficulties defining e(sample) in a frame where the regression was not carried out--that's just my speculation.

    Be that as it may, the simplest approach is to append the two data sets together, with an indicator variable distinguishing the training and validation data sets. Then you run the regression with a restriction to the training data set, and get your -estat classification-. Then rerun -estat classification- in the other data set. Actually, I think of all the statistics one can use to do this, -estat classification-, especially with the default cutoff of 0.5 (which is almost never a useful cutoff value) is the least useful. I would strongly recommend looking at the ROC area and the Hosmer-Lemeshow statistics instead.

    Here's an example of how to do this. I've done it by taking the auto.dta and randomly splitting it into two halves. In your case, I imagine, it will instead be a matter of appending two data sets together. Anyway:
    Code:
    clear*
    
    sysuse auto
    
    set seed 1234
    label define dataset    0    "training"    1    "validation"
    gen byte dataset:dataset = runiformint(0, 1)
    
    logit foreign price mpg if dataset == "training":dataset
    lroc if e(sample), nograph
    estat gof if e(sample), group(10) table
    estat classification if e(sample)
    
    
    lroc if !e(sample), nograph
    estat gof if !e(sample), group(10) table
    estat classification if !e(sample)
    By the way, in interpreting your results, bear in mind that cross validation by a split data set, like I have shown here, is a much weaker form of validation than cross validation between two data sets collected independently of each other. So titrate your enthusiasm for whatever you find accordingly when you write up your results.

    Comment


    • #3
      From a purely technical perspective, and assuming your validation dataset is in the frame default, you can do something like

      Code:
      frame change beer_training
      logit BeerPreference Gender Married Income Age
      estat classification
      
      frame change default
      estimates esample :
      estat classification
      The line

      Code:
      estimate esample :
      marks all observations as the estimation sample.

      Comment

      Working...
      X