Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to (manually?) get classification table for LPM and Cloglog?

    Dear Stata experts,

    I want to evaluate the fit of two different models, a linear probability model and a complementary log log.
    Unfortunately, estat classification does not support these two specifications.
    However, since we can obtain predictions I thougt about manually getting a classification table.

    As an empiric rookie, I am now stuck with this simplistic code (using the default cutoff 0.5).


    Code:
     
    predict phatLPM
    gen correct=0
    replace correct=1 if phatLPM>=0.5 & EVENT==1
    replace correct=1 if phatLPM<0.5 & EVENT==0
    tab correct

    That way, I seem to get the "Correctly classified" at least. But still, I'd highly appreciate any help for a superior approach.

    Thanks a lot!
    Jo

  • #2
    Well, probability correctly classified is not a very useful statistic, because it depends on the overall probability of EVENT and only indirectly reflects the properties of the test. The more useful statistics in a classification scheme are the sensitivity and specificity. You can get those with:

    Code:
    predict phatLPM
    replace pos_test = (phatLPM >= 0.5) if !missing(phatLPM)
    tab pos_test EVENT, col
    (The sensitivity and specificity will be the column percentages in the main diagonal of the table.)

    If you want the positive and negative predictive values as well, just add the -row- option to the -tab- command.

    Finally, I'll add that using a 0.5 predicted probability as the cutoff for defining a positive test is arbitrary and often far from the best choice.


    Comment


    • #3
      Thank you very much for your quick and very helpful answer, Clyde!

      I read about the 'correctly classified' being used in social sciences quite often, unfortunately I don't have the book at hand right now.
      But is there an option to include this statistic as well? I managed to calculate it out of your nice table (the one with row option) but it would be great if it was given directly.

      Concerning the cutoff:
      I was planning to adjust it such that the number of events predicted equals the actual number of events in the sample. What do you think about this approach?

      And one more question:
      Is it appropriate to use the classification scheme for (cross-) validation? (I am concerned about overfitting)


      Thanks again and best regards
      Jo

      Comment


      • #4
        What are you using before predict? I can't see your command for either model. Certainly, regress has no idea that predicted values being above or below 0.5 has any definite meaning.

        Comment


        • #5
          My models are:
          reg Event regressor1 regressor2 regressor3 regressor4

          cloglog Event regressor1 regressor2 regressor3 regressor4
          Whereby regressors 1 to 3 are dummies, only no. 4 is continuous.

          Comment


          • #6
            Thanks; that's an uncertainty resolved, but my comment remains.

            regress has no sense of, or special switches for, your using it for a linear probability model (LPM). Conversely, it's my wild guess that people wanting the LPM have found their needs satisfied by regress.

            With cloglog it seems more a matter of fact rather than of principle that what you want doesn't appear to be implemented in official Stata.

            Clyde's work-around looks fine to me. It could be a programming project for someone to wrap it up in a program and to generalise it to allow different cut-offs, but that would be just another command to worry about.

            Comment


            • #7
              You can get correctly classified with
              Code:
              gen correct = (pos_test == EVENT) if !missing(pos_test)
              tab correct
              Concerning the cutoff:
              I was planning to adjust it such that the number of events predicted equals the actual number of events in the sample. What do you think about this approach?
              Well, it has the virtue that the use of that cut-off point results in a somewhat realistic prediction scenario, at least relative to your study sample. Of course, these study samples are often not representative of the populations to which the prediction algorithm is intended to apply: if EVENTs are rare, samples are often taken to deliberately over-represent them, which improves the discrimination of the prediction algorithm. But you would not then want to use a cutoff that derives from the sample probability of events to a population in which that probability is actually much lower.

              I usually study a range of cutoff values, and I also typically calculate the area under the ROC curve (-help roctab) to get a sense of the power of discrimination of the prediction algorithm that is independent of any particular choice of cutoff. When it comes time to settle on a particular cutoff, I rely on decision theory. That means that I select a cutoff that maximizes some utility or objective function. That function, in turn, reflects the frequency of EVENTS in the population to which I want to apply the prediction, and quantification of the harms associated with failing to predict an EVENT that does occur and of the harms associated with falsely predicting an EVENT that does not occur. You will find other approaches to cutoff selection in the literature, but in my (in this case not at all humble) opinion cutoffs chosen by other methods are just nonsense.

              Is it appropriate to use the classification scheme for (cross-) validation? (I am concerned about overfitting)
              I don't know what you mean by that. FIrst of all, there are several different approaches to cross-validation. Second, I'm not sure what you have in mind with the term classification "scheme." I would think that demonstrating that your model cross-validates well by comparing in-sample to out-of-sample predictions for calculating sensitivity and specificity would be difficult, because these depend on a cutoff, and a cutoff that is "optimal" in a training sample might not be "optimal" for a validation sample. Similarly, assessing out-of-sample performance on positive and negative predictive values would, I think, founder on both the cutoff issue and the possibility of appreciably different prevalence of EVENTs in the training and validation samples. As for using the percent correctly classified, it founders on both of those issues and on its further probability of failing to distinguish the almost invariably quite distinct probabilities of correct classification in EVENT and non-EVENT cases, which, in turn, typically have very different consequences in the world. (Really, percent correctly classified is lame. I think that it is included in some studies because anybody can grasp what it means, or at least think they grasp what it means. But it is such an oversimplification that it is truly useless for any serious purpose.)

              Comment


              • #8
                First of all, thank you very much for this thorough answer and for taking the time, Clyde!

                In my analysis I look at unemployment, by the way. Event==1 if s.o. is unemployed. So the harms of false prediction are not as severe as it may be the case e.g. in medical research.

                Fortunately, events in my sample are roughly equally often as in the population. So after your helpful hints, I guess I'll go ahead with the cutoff adjusting the number of events.

                However, what causes me quite a headache is my small sample size (130) and the low share of events (around 20%).
                In light of this, I'm well aware that my analysis cannot necessarily be generalized. And yet I'd like to provide a check whether some things may still be deduced from the results. That's why I thought of crossvalidation. I have to admit though that even after reading up on it, I still don't get how to crossvalidate on a very practical level. Which results to compare.
                That's why I'd really appreciate any help on that.


                On a final note, I understood your points concerning the percent classified. The question I ask myself, however, if there is an option to include it in the same table you proposed in #2?
                And, I don't get why with your code from #7 as well as my code from #1 I have 2 or 3 more values for 'correct' than the regression is run on. Both 'correct' variables include cases where one of the variables used in my model are missing in the sample. So that in the end, I need to run the code(s) to clean up.
                if !missing(variable)

                Comment


                • #9
                  Concerning the three extra observations, just add -if e(sample)- to the -predict- command. That will suppress any out-of-sample predictions.

                  I agree that with 130 cases and 26 events your model is likely to lack precision. With four predictors, there is also a good chance that it will be overfit and will not perform well in cross validation.

                  As I said earlier, there are many different approaches to cross validation. But the way they are all typically used is that you collect the model regression coefficients in all of the cross validation samples in a data set and then you compare the original model's coefficients to the empirical distribution of the coefficients in the cross validation samples. The exact method of comparing the coefficients is another source of variation in methods. I can't recall seeing any studies where the comparison of the original model and the cross-validation models is based on comparing the sensitivity and specificity instead of the coefficients. My hunch is that because the sensitivity and specificity are cutoff dependent, this approach would prove difficult, as the "best" cutoff could differ from sample to sample. And, in any case, you could always force the results to look good by choosing an unrealistic and extreme cutoff (so that sensitivity will be close to 1 and specificity close to 0, or perhaps the other way around, in all samples.) So I don't think that's viable.

                  You might want to use Stata's -search- command, or do a Google search to see if there are any Stata programs out there that might handle the cross validation process for you. I don't know if there are, but looking for them won't take long, and if you find one it could save you a lot of time and trouble.

                  Comment


                  • #10
                    Ok, thanks a lot Clyde!

                    I was also worrying about the cutoff for the different cross validation samples. The rash idea of doing cross validation with a classification table was only due to the fact that I did not understand how to do cv concretely and the classification table on its own was sth that seemed vivid to me. So I will do it the 'normal' way.

                    As a (last) supplement to the percent correctly classified:
                    I found the following in Wooldridge's 'Introductory Econometrics' (2016):
                    "The percent correctly predicted is a widely used goodness-of-fit-measure for binary dependent variables" (p. 227) and "Although the percentage correctly predicted is useful as a goodness-of-fit-measure, it can be misleading. In particular, it is possible to get rather high percentages correctly predicted even when the least likely outcome is very poorly predicted. [...] Therefore, it makes sense to also compute the percentage correctly predicted for each of the outcomes." (p. 530).
                    Wooldridge was my first and maybe formative statistical textbook, this may explain my wish to get the this value for my models.

                    Comment


                    • #11
                      Wooldridge was my first and maybe formative statistical textbook, this may explain my wish to get the this value for my models.
                      Yes, but read carefully what he says. He says that the percentage correctly predicted for each of the outcomes should be used. That's just a different way of saying to use the sensitivity and specificity, not the (total) percent correctly predicted which, he notes, "can be misleading." The spirit of his words is less vehement than mine, but the content and conclusions are identical.

                      Anyway, thanks for a nice discussion about this. Good luck. Hope to "see" you here again soon.

                      Comment


                      • #12
                        Well, my take-away from Wooldridge was "widely used" and "useful" and I thought adjusting the cutoff to accommodate both, sensitivity and specificity, does the job.
                        But maybe you see only what you want to see ...

                        Thank YOU, Clyde!
                        See you!

                        Comment

                        Working...
                        X