Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unbalanced data - probit

    Good Morning.

    I'm running a probit model to analyze the probability of a price ending in 9 cents (=1) or some other digit (=0).
    (For the thesis I have other similar models but with these everything went ok).

    The problem in this case is that the dummy nineCent == 1 only happens on about 20% of the data.

    tab nineCent

    nineCent | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 46,638 80.40 80.40
    1 | 11,368 19.60 100.00
    ------------+-----------------------------------
    Total | 58,006 100.00


    probit nineCent i.SUSTAINABLE i.EXCLUSIVE i.NEW i.DESIGNER 0.GENDER regPrices

    Iteration 0: log likelihood = -28700.114
    Iteration 1: log likelihood = -26323.003
    Iteration 2: log likelihood = -26073.216
    Iteration 3: log likelihood = -26045.151
    Iteration 4: log likelihood = -26042.071
    Iteration 5: log likelihood = -26041.931
    Iteration 6: log likelihood = -26041.931

    Probit regression Number of obs = 58,006
    LR chi2(6) = 5316.37
    Prob > chi2 = 0.0000
    Log likelihood = -26041.931 Pseudo R2 = 0.0926

    -------------------------------------------------------------------------------
    nineCent | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    --------------+----------------------------------------------------------------
    1.SUSTAINABLE | -.0544931 .0189034 -2.88 0.004 -.0915432 -.0174431
    1.EXCLUSIVE | -.3375287 .1162686 -2.90 0.004 -.565411 -.1096463
    1.NEW | .0295299 .0167944 1.76 0.079 -.0033865 .0624462
    1.DESIGNER | -1.560002 .3078543 -5.07 0.000 -2.163385 -.9566186
    0.GENDER | .1262113 .0128617 9.81 0.000 .1010027 .1514198
    regPrices | -.0087595 .0001489 -58.81 0.000 -.0090515 -.0084676
    _cons | -.1118127 .0164357 -6.80 0.000 -.1440262 -.0795993
    -------------------------------------------------------------------------------
    Note: 267 failures and 0 successes completely determined.


    NOTE: I put "0.GENDER" because despite having 4 gender classes: female, male, all, kids -- for interpretation I'm only interested in females (coded as 0) versus the others.


    What happens is that when doing "estat classification", the fit of the model is not good in the sense that it gives an overall fit of 80.4% but because the model always predicts 0 and the 0s represent 80% of the date (I hope I'm not making any misinterpretation!) -- having sensitivity=0%.


    estat classification

    Probit model for nineCent

    -------- True --------
    Classified | D ~D | Total
    -----------+--------------------------+-----------
    + | 0 0 | 0
    - | 11368 46638 | 58006
    -----------+--------------------------+-----------
    Total | 11368 46638 | 58006

    Classified + if predicted Pr(D) >= .5
    True D defined as nineCent != 0
    --------------------------------------------------
    Sensitivity Pr( +| D) 0.00%
    Specificity Pr( -|~D) 100.00%
    Positive predictive value Pr( D| +) .%
    Negative predictive value Pr(~D| -) 80.40%
    --------------------------------------------------
    False + rate for true ~D Pr( +|~D) 0.00%
    False - rate for true D Pr( -| D) 100.00%
    False + rate for classified + Pr(~D| +) .%
    False - rate for classified - Pr( D| -) 19.60%
    --------------------------------------------------
    Correctly classified 80.40%
    --------------------------------------------------


    After spending this week searching, I thought that a good solution would be to do under sampling: from the group of nineCent ==0 randomly extract 11,368 data points so that the "total sample" would then be 50% 0s and 50% 1s . However, I don't know how to do it...

    On the other hand, I've also read that one possible solution is to change the cutoff value in the "estat classification" to a lower value. In this case, after running "lsen", the value pointed to is about 0.2. So, running the command again, it would give a much better fit. I just don't know if this will be the best solution because in my head this only changes the fit in the "estat classification" command and not the probit model itself.

    Hence, can someone help me with this topic?

    Thank you so much in advance!!


    estat classification, cutoff(0.2)

    Probit model for nineCent

    -------- True --------
    Classified | D ~D | Total
    -----------+--------------------------+-----------
    + | 7969 19368 | 27337
    - | 3399 27270 | 30669
    -----------+--------------------------+-----------
    Total | 11368 46638 | 58006

    Classified + if predicted Pr(D) >= .2
    True D defined as nineCent != 0
    --------------------------------------------------
    Sensitivity Pr( +| D) 70.10%
    Specificity Pr( -|~D) 58.47%
    Positive predictive value Pr( D| +) 29.15%
    Negative predictive value Pr(~D| -) 88.92%
    --------------------------------------------------
    False + rate for true ~D Pr( +|~D) 41.53%
    False - rate for true D Pr( -| D) 29.90%
    False + rate for classified + Pr(~D| +) 70.85%
    False - rate for classified - Pr( D| -) 11.08%
    --------------------------------------------------
    Correctly classified 60.75%
    --------------------------------------------------

  • #2
    What happens is that when doing "estat classification", the fit of the model is not good in the sense that it gives an overall fit of 80.4% but because the model always predicts 0 and the 0s represent 80% of the date (I hope I'm not making any misinterpretation!) -- having sensitivity=0%.

    It's the opposite. The "fit" according to the hit ratio is very good at 80% as the scale is 0 "Poor", 100 "Perfect".


    The problem in this case is that the dummy nineCent == 1 only happens on about 20% of the data.

    tab nineCent

    nineCent | Freq. Percent Cum.
    ------------+-----------------------------------
    0 | 46,638 80.40 80.40
    1 | 11,368 19.60 100.00
    ------------+-----------------------------------
    Total | 58,006 100.00

    20% on a sample size of 50,000+ is not small on any objective measure. The event cannot even be classified as rare. It's a 1 in 5 chance.


    On the other hand, I've also read that one possible solution is to change the cutoff value in the "estat classification" to a lower value. In this case, after running "lsen", the value pointed to is about 0.2.

    If you are worried that 80% is too good a fit but your model has low explanatory power, a popular choice as you allude is to set the cutoff at the mean value of the outcome in the sample.

    Code:
    sum nineCent if e(sample)
    estat classification, cutoff(`r(mean)')
    The intuition is summarized below from my lecture notes.

    Click image for larger version

Name:	Capture.PNG
Views:	1
Size:	160.6 KB
ID:	1665321


    Last edited by Andrew Musau; 19 May 2022, 05:58.

    Comment


    • #3
      Thank you Professor!! Andrew Musau

      It's the opposite. The "fit" according to the hit ratio is very good at 80% as the scale is 0 "Poor", 100 "Perfect".
      Can you give me any Reference for me to cite on my thesis regarding a threshold value for that count R2 (given by "correctly classified" in "estat classification" command) value? Because it makes sense that the threshold is 50% since above 50% it would mean that the model "hits more than fails" and is better predicting than just "by chance"... However I did not find any reference to cite about that.

      If you are worried that 80% is too good a fit but your model has low explanatory power, a popular choice as you allude is to set the cutoff at the mean value of the outcome in the sample.
      Thank you for your suggestion! It worked well. Is there any Reference that I could cite to justify this solution?


      Thank you once again!

      Best Regards,
      Mariana Goncalves

      Comment


      • #4
        There is a discussion of the hit ratio (named count R-squared) in Long and Freese's textbook. The default calculation utilizes a cutoff of 0.5, so you do not have to justify that. For a general concept of the hit ratio, see the definition of the performance metric "accuracy" in the article by Fawcett giving an introduction to ROC analysis.


        References
        Fawcett, Tom (2006). An Introduction to ROC Analysis. Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010.
        Freese, Jeremy and J. Scott Long. Regression Models for Categorical Dependent Variables Using Stata. College Station: Stata Press, 2014.

        Comment


        • #5
          Thank you Andrew Musau , once again!!

          Best regards

          Comment

          Working...
          X