Unbalanced data - probit

Mariana Goncalves

Join Date: May 2022

Posts: 3
#1

Unbalanced data - probit

19 May 2022, 03:47

Good Morning.

I'm running a probit model to analyze the probability of a price ending in 9 cents (=1) or some other digit (=0).
(For the thesis I have other similar models but with these everything went ok).

The problem in this case is that the dummy nineCent == 1 only happens on about 20% of the data.

tab nineCent

nineCent | Freq. Percent Cum.
------------+-----------------------------------
0 | 46,638 80.40 80.40
1 | 11,368 19.60 100.00
------------+-----------------------------------
Total | 58,006 100.00

probit nineCent i.SUSTAINABLE i.EXCLUSIVE i.NEW i.DESIGNER 0.GENDER regPrices

Iteration 0: log likelihood = -28700.114
Iteration 1: log likelihood = -26323.003
Iteration 2: log likelihood = -26073.216
Iteration 3: log likelihood = -26045.151
Iteration 4: log likelihood = -26042.071
Iteration 5: log likelihood = -26041.931
Iteration 6: log likelihood = -26041.931

Probit regression Number of obs = 58,006
LR chi2(6) = 5316.37
Prob > chi2 = 0.0000
Log likelihood = -26041.931 Pseudo R2 = 0.0926

-------------------------------------------------------------------------------
nineCent | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
1.SUSTAINABLE | -.0544931 .0189034 -2.88 0.004 -.0915432 -.0174431
1.EXCLUSIVE | -.3375287 .1162686 -2.90 0.004 -.565411 -.1096463
1.NEW | .0295299 .0167944 1.76 0.079 -.0033865 .0624462
1.DESIGNER | -1.560002 .3078543 -5.07 0.000 -2.163385 -.9566186
0.GENDER | .1262113 .0128617 9.81 0.000 .1010027 .1514198
regPrices | -.0087595 .0001489 -58.81 0.000 -.0090515 -.0084676
_cons | -.1118127 .0164357 -6.80 0.000 -.1440262 -.0795993
-------------------------------------------------------------------------------
Note: 267 failures and 0 successes completely determined.

NOTE: I put "0.GENDER" because despite having 4 gender classes: female, male, all, kids -- for interpretation I'm only interested in females (coded as 0) versus the others.

What happens is that when doing "estat classification", the fit of the model is not good in the sense that it gives an overall fit of 80.4% but because the model always predicts 0 and the 0s represent 80% of the date (I hope I'm not making any misinterpretation!) -- having sensitivity=0%.

estat classification

Probit model for nineCent

-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 0 0 | 0
- | 11368 46638 | 58006
-----------+--------------------------+-----------
Total | 11368 46638 | 58006

Classified + if predicted Pr(D) >= .5
True D defined as nineCent != 0
--------------------------------------------------
Sensitivity Pr( +| D) 0.00%
Specificity Pr( -|~D) 100.00%
Positive predictive value Pr( D| +) .%
Negative predictive value Pr(~D| -) 80.40%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 0.00%
False - rate for true D Pr( -| D) 100.00%
False + rate for classified + Pr(~D| +) .%
False - rate for classified - Pr( D| -) 19.60%
--------------------------------------------------
Correctly classified 80.40%
--------------------------------------------------

After spending this week searching, I thought that a good solution would be to do under sampling: from the group of nineCent ==0 randomly extract 11,368 data points so that the "total sample" would then be 50% 0s and 50% 1s . However, I don't know how to do it...

On the other hand, I've also read that one possible solution is to change the cutoff value in the "estat classification" to a lower value. In this case, after running "lsen", the value pointed to is about 0.2. So, running the command again, it would give a much better fit. I just don't know if this will be the best solution because in my head this only changes the fit in the "estat classification" command and not the probit model itself.

Hence, can someone help me with this topic?

Thank you so much in advance!!

estat classification, cutoff(0.2)

Probit model for nineCent

-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 7969 19368 | 27337
- | 3399 27270 | 30669
-----------+--------------------------+-----------
Total | 11368 46638 | 58006

Classified + if predicted Pr(D) >= .2
True D defined as nineCent != 0
--------------------------------------------------
Sensitivity Pr( +| D) 70.10%
Specificity Pr( -|~D) 58.47%
Positive predictive value Pr( D| +) 29.15%
Negative predictive value Pr(~D| -) 88.92%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 41.53%
False - rate for true D Pr( -| D) 29.90%
False + rate for classified + Pr(~D| +) 70.85%
False - rate for classified - Pr( D| -) 11.08%
--------------------------------------------------
Correctly classified 60.75%
--------------------------------------------------
Tags: classification, probit, sampling, sensitivity, unbalanced
Andrew Musau

Join Date: Oct 2014

Posts: 10482
#2

19 May 2022, 05:18

What happens is that when doing "estat classification", the fit of the model is not good in the sense that it gives an overall fit of 80.4% but because the model always predicts 0 and the 0s represent 80% of the date (I hope I'm not making any misinterpretation!) -- having sensitivity=0%.

It's the opposite. The "fit" according to the hit ratio is very good at 80% as the scale is 0 "Poor", 100 "Perfect".

The problem in this case is that the dummy nineCent == 1 only happens on about 20% of the data.

tab nineCent

nineCent | Freq. Percent Cum.
------------+-----------------------------------
0 | 46,638 80.40 80.40
1 | 11,368 19.60 100.00
------------+-----------------------------------
Total | 58,006 100.00

20% on a sample size of 50,000+ is not small on any objective measure. The event cannot even be classified as rare. It's a 1 in 5 chance.

On the other hand, I've also read that one possible solution is to change the cutoff value in the "estat classification" to a lower value. In this case, after running "lsen", the value pointed to is about 0.2.

If you are worried that 80% is too good a fit but your model has low explanatory power, a popular choice as you allude is to set the cutoff at the mean value of the outcome in the sample.

Code:

sum nineCent if e(sample) estat classification, cutoff(`r(mean)')

The intuition is summarized below from my lecture notes.

Last edited by Andrew Musau; 19 May 2022, 05:58.
Comment
Mariana Goncalves

Join Date: May 2022

Posts: 3
#3

20 May 2022, 03:16

Thank you Professor!! Andrew Musau

It's the opposite. The "fit" according to the hit ratio is very good at 80% as the scale is 0 "Poor", 100 "Perfect".

Can you give me any Reference for me to cite on my thesis regarding a threshold value for that count R2 (given by "correctly classified" in "estat classification" command) value? Because it makes sense that the threshold is 50% since above 50% it would mean that the model "hits more than fails" and is better predicting than just "by chance"... However I did not find any reference to cite about that.

If you are worried that 80% is too good a fit but your model has low explanatory power, a popular choice as you allude is to set the cutoff at the mean value of the outcome in the sample.

Thank you for your suggestion! It worked well. Is there any Reference that I could cite to justify this solution?

Thank you once again!

Best Regards,
Mariana Goncalves
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10482
#4

20 May 2022, 06:37

There is a discussion of the hit ratio (named count R-squared) in Long and Freese's textbook. The default calculation utilizes a cutoff of 0.5, so you do not have to justify that. For a general concept of the hit ratio, see the definition of the performance metric "accuracy" in the article by Fawcett giving an introduction to ROC analysis.

References
Fawcett, Tom (2006). An Introduction to ROC Analysis. Pattern Recognition Letters. 27 (8): 861–874. doi:10.1016/j.patrec.2005.10.010.
Freese, Jeremy and J. Scott Long. Regression Models for Categorical Dependent Variables Using Stata. College Station: Stata Press, 2014.
Comment
Mariana Goncalves

Join Date: May 2022

Posts: 3
#5

20 May 2022, 07:28

Thank you Andrew Musau , once again!!

Best regards
Comment

Announcement

Unbalanced data - probit

Comment

Comment

Comment

Comment