Good Morning.
I'm running a probit model to analyze the probability of a price ending in 9 cents (=1) or some other digit (=0).
(For the thesis I have other similar models but with these everything went ok).
The problem in this case is that the dummy nineCent == 1 only happens on about 20% of the data.
tab nineCent
nineCent | Freq. Percent Cum.
------------+-----------------------------------
0 | 46,638 80.40 80.40
1 | 11,368 19.60 100.00
------------+-----------------------------------
Total | 58,006 100.00
probit nineCent i.SUSTAINABLE i.EXCLUSIVE i.NEW i.DESIGNER 0.GENDER regPrices
Iteration 0: log likelihood = -28700.114
Iteration 1: log likelihood = -26323.003
Iteration 2: log likelihood = -26073.216
Iteration 3: log likelihood = -26045.151
Iteration 4: log likelihood = -26042.071
Iteration 5: log likelihood = -26041.931
Iteration 6: log likelihood = -26041.931
Probit regression Number of obs = 58,006
LR chi2(6) = 5316.37
Prob > chi2 = 0.0000
Log likelihood = -26041.931 Pseudo R2 = 0.0926
-------------------------------------------------------------------------------
nineCent | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
1.SUSTAINABLE | -.0544931 .0189034 -2.88 0.004 -.0915432 -.0174431
1.EXCLUSIVE | -.3375287 .1162686 -2.90 0.004 -.565411 -.1096463
1.NEW | .0295299 .0167944 1.76 0.079 -.0033865 .0624462
1.DESIGNER | -1.560002 .3078543 -5.07 0.000 -2.163385 -.9566186
0.GENDER | .1262113 .0128617 9.81 0.000 .1010027 .1514198
regPrices | -.0087595 .0001489 -58.81 0.000 -.0090515 -.0084676
_cons | -.1118127 .0164357 -6.80 0.000 -.1440262 -.0795993
-------------------------------------------------------------------------------
Note: 267 failures and 0 successes completely determined.
NOTE: I put "0.GENDER" because despite having 4 gender classes: female, male, all, kids -- for interpretation I'm only interested in females (coded as 0) versus the others.
What happens is that when doing "estat classification", the fit of the model is not good in the sense that it gives an overall fit of 80.4% but because the model always predicts 0 and the 0s represent 80% of the date (I hope I'm not making any misinterpretation!) -- having sensitivity=0%.
estat classification
Probit model for nineCent
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 0 0 | 0
- | 11368 46638 | 58006
-----------+--------------------------+-----------
Total | 11368 46638 | 58006
Classified + if predicted Pr(D) >= .5
True D defined as nineCent != 0
--------------------------------------------------
Sensitivity Pr( +| D) 0.00%
Specificity Pr( -|~D) 100.00%
Positive predictive value Pr( D| +) .%
Negative predictive value Pr(~D| -) 80.40%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 0.00%
False - rate for true D Pr( -| D) 100.00%
False + rate for classified + Pr(~D| +) .%
False - rate for classified - Pr( D| -) 19.60%
--------------------------------------------------
Correctly classified 80.40%
--------------------------------------------------
After spending this week searching, I thought that a good solution would be to do under sampling: from the group of nineCent ==0 randomly extract 11,368 data points so that the "total sample" would then be 50% 0s and 50% 1s . However, I don't know how to do it...
On the other hand, I've also read that one possible solution is to change the cutoff value in the "estat classification" to a lower value. In this case, after running "lsen", the value pointed to is about 0.2. So, running the command again, it would give a much better fit. I just don't know if this will be the best solution because in my head this only changes the fit in the "estat classification" command and not the probit model itself.
Hence, can someone help me with this topic?
Thank you so much in advance!!
estat classification, cutoff(0.2)
Probit model for nineCent
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 7969 19368 | 27337
- | 3399 27270 | 30669
-----------+--------------------------+-----------
Total | 11368 46638 | 58006
Classified + if predicted Pr(D) >= .2
True D defined as nineCent != 0
--------------------------------------------------
Sensitivity Pr( +| D) 70.10%
Specificity Pr( -|~D) 58.47%
Positive predictive value Pr( D| +) 29.15%
Negative predictive value Pr(~D| -) 88.92%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 41.53%
False - rate for true D Pr( -| D) 29.90%
False + rate for classified + Pr(~D| +) 70.85%
False - rate for classified - Pr( D| -) 11.08%
--------------------------------------------------
Correctly classified 60.75%
--------------------------------------------------
I'm running a probit model to analyze the probability of a price ending in 9 cents (=1) or some other digit (=0).
(For the thesis I have other similar models but with these everything went ok).
The problem in this case is that the dummy nineCent == 1 only happens on about 20% of the data.
tab nineCent
nineCent | Freq. Percent Cum.
------------+-----------------------------------
0 | 46,638 80.40 80.40
1 | 11,368 19.60 100.00
------------+-----------------------------------
Total | 58,006 100.00
probit nineCent i.SUSTAINABLE i.EXCLUSIVE i.NEW i.DESIGNER 0.GENDER regPrices
Iteration 0: log likelihood = -28700.114
Iteration 1: log likelihood = -26323.003
Iteration 2: log likelihood = -26073.216
Iteration 3: log likelihood = -26045.151
Iteration 4: log likelihood = -26042.071
Iteration 5: log likelihood = -26041.931
Iteration 6: log likelihood = -26041.931
Probit regression Number of obs = 58,006
LR chi2(6) = 5316.37
Prob > chi2 = 0.0000
Log likelihood = -26041.931 Pseudo R2 = 0.0926
-------------------------------------------------------------------------------
nineCent | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
1.SUSTAINABLE | -.0544931 .0189034 -2.88 0.004 -.0915432 -.0174431
1.EXCLUSIVE | -.3375287 .1162686 -2.90 0.004 -.565411 -.1096463
1.NEW | .0295299 .0167944 1.76 0.079 -.0033865 .0624462
1.DESIGNER | -1.560002 .3078543 -5.07 0.000 -2.163385 -.9566186
0.GENDER | .1262113 .0128617 9.81 0.000 .1010027 .1514198
regPrices | -.0087595 .0001489 -58.81 0.000 -.0090515 -.0084676
_cons | -.1118127 .0164357 -6.80 0.000 -.1440262 -.0795993
-------------------------------------------------------------------------------
Note: 267 failures and 0 successes completely determined.
NOTE: I put "0.GENDER" because despite having 4 gender classes: female, male, all, kids -- for interpretation I'm only interested in females (coded as 0) versus the others.
What happens is that when doing "estat classification", the fit of the model is not good in the sense that it gives an overall fit of 80.4% but because the model always predicts 0 and the 0s represent 80% of the date (I hope I'm not making any misinterpretation!) -- having sensitivity=0%.
estat classification
Probit model for nineCent
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 0 0 | 0
- | 11368 46638 | 58006
-----------+--------------------------+-----------
Total | 11368 46638 | 58006
Classified + if predicted Pr(D) >= .5
True D defined as nineCent != 0
--------------------------------------------------
Sensitivity Pr( +| D) 0.00%
Specificity Pr( -|~D) 100.00%
Positive predictive value Pr( D| +) .%
Negative predictive value Pr(~D| -) 80.40%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 0.00%
False - rate for true D Pr( -| D) 100.00%
False + rate for classified + Pr(~D| +) .%
False - rate for classified - Pr( D| -) 19.60%
--------------------------------------------------
Correctly classified 80.40%
--------------------------------------------------
After spending this week searching, I thought that a good solution would be to do under sampling: from the group of nineCent ==0 randomly extract 11,368 data points so that the "total sample" would then be 50% 0s and 50% 1s . However, I don't know how to do it...
On the other hand, I've also read that one possible solution is to change the cutoff value in the "estat classification" to a lower value. In this case, after running "lsen", the value pointed to is about 0.2. So, running the command again, it would give a much better fit. I just don't know if this will be the best solution because in my head this only changes the fit in the "estat classification" command and not the probit model itself.
Hence, can someone help me with this topic?
Thank you so much in advance!!
estat classification, cutoff(0.2)
Probit model for nineCent
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 7969 19368 | 27337
- | 3399 27270 | 30669
-----------+--------------------------+-----------
Total | 11368 46638 | 58006
Classified + if predicted Pr(D) >= .2
True D defined as nineCent != 0
--------------------------------------------------
Sensitivity Pr( +| D) 70.10%
Specificity Pr( -|~D) 58.47%
Positive predictive value Pr( D| +) 29.15%
Negative predictive value Pr(~D| -) 88.92%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 41.53%
False - rate for true D Pr( -| D) 29.90%
False + rate for classified + Pr(~D| +) 70.85%
False - rate for classified - Pr( D| -) 11.08%
--------------------------------------------------
Correctly classified 60.75%
--------------------------------------------------

Comment