GOF of Logit Model: Pearson's chi2, Hosmer and Lemeshow's test

Lei Jin

Join Date: May 2021

Posts: 20
#1

GOF of Logit Model: Pearson's chi2, Hosmer and Lemeshow's test

26 Nov 2021, 05:55

Hi everyone,

I am using a logit model (attached below) to investigate the impact of minority status of borrowers on the loan approval probability, but both the Pearson's chi2 and HL test indicated a poor gof.

So I have the following questions,

1) Is the poor gof caused by the large sample, which is in a size of 2,491,476 ? I think my model has already included a rich set of controls that are in appropriate forms because I followed the controls recent studies used.

2) Despite the poor gof from Pearson and HL, the "percent correctly predicted" of the model is around 87%, which is very high. Can I regard my model as very predictive even though the poor gof from Pearson and HL?

Thanks!
Lei

The following is the test result:

I used Pearson's chi2 to exam the gof of the model and got :

Number of observations = 2,491,476
Number of covariate patterns = 1,636,678
Pearson chi2(1636649) = 2.48e+06
Prob > chi2 = 0.0000

which indicates a poor gof for the model.

In addition, I used HL test to exam the gof and got:

Number of observations = 2,491,476
Number of groups = 10
Hosmer–Lemeshow chi2(8) = 260.64
Prob > chi2 = 0.0000

which also indicates a poor gof. But look at the table below, the observed and expected cell frequencies in each group are in very good agreement, at this point, I think the model's gof should be good.

Table collapsed on quantiles of estimated probabilities
+-----------------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+--------+----------+--------+----------+--------|
| 1 | 0.6193 | 77335 | 77379.4 | 171813 | 171768.6 | 249148 |
| 2 | 0.7471 | 170945 | 172131.2 | 78204 | 77017.8 | 249149 |
| 3 | 0.8465 | 200800 | 198841.8 | 48346 | 50304.2 | 249146 |
| 4 | 0.8855 | 216880 | 216712.1 | 32268 | 32435.9 | 249148 |
| 5 | 0.9037 | 223495 | 223061.9 | 25652 | 26085.1 | 249147 |
|-------+--------+--------+----------+--------+----------+--------|
| 6 | 0.9166 | 227089 | 226821.4 | 22059 | 22326.6 | 249148 |
| 7 | 0.9275 | 229835 | 229762.4 | 19314 | 19386.6 | 249149 |
| 8 | 0.9378 | 232215 | 232373.8 | 16932 | 16773.2 | 249147 |
| 9 | 0.9492 | 234556 | 235019.0 | 14591 | 14128.0 | 249147 |
| 10 | 0.9900 | 237511 | 238557.9 | 11636 | 10589.1 | 249147 |
+-----------------------------------------------------------------+

The following is the logit model, with approval decision as the outcome variable, and a set of explanatory variables which are either dummy or continuous variables, there is no interaction or squared term:

logit approval income_w dti20 dti20_30 dti30_36 dti36_49 dti50_60 fico680_699 fico700_719 fico720_739 ltv80 ltv80_85 ltv85_90 ltv90_95 origination_2019 refinance minority female age62 lender_top100 shadowbank fintech aus tract_minority_population_percen tract_owner_occupied_units tract_one_to_four_family_homes tract_median_age_of_housing_unit cra fhfa_index

Here is the sample data, I divided it into two parts, due to the variables number limited by dataex:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float approval long income_w float(dti20 dti20_30 dti30_36 dti36_49 dti50_60 fico680_699 fico700_719 fico720_739 ltv80 ltv80_85 ltv85_90 ltv90_95 origination_2019 refinance) 1 208 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 190 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 132 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 127 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 171 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 125 0 0 0 1 0 1 0 0 0 0 0 0 1 0 1 152 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 150 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 208 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 208 0 0 0 1 0 1 0 0 0 0 0 0 1 0 end

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(minority female age62 lender_top100 shadowbank fintech aus tract_minority_population_percen) int(tract_owner_occupied_units tract_one_to_four_family_homes) byte tract_median_age_of_housing_unit float cra double fhfa_index 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 5.11 0 0 0 0 0 0 1 46.07 13975 15386 8 0 4.47 0 0 0 0 0 0 1 46.07 13975 15386 8 0 5.11 0 0 0 0 0 0 1 11.43 6612 7636 12 0 11.99 0 0 0 0 0 0 1 3.55 6004 6742 12 0 5.76 0 1 0 0 0 0 1 34.96 6938 8788 13 0 6.11 end
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#2

26 Nov 2021, 06:21

re: #1 - yes, the problem is "caused" by the large N; one way to look at this that might help is to use a calibration plot (lowess depvar pred_value - substitute your real variable names and note that pred-value is referring to the predicted probabilities from the model)

re: #2 - "pure" accuracy is not necessarily a good criterion as it may hide issues esp if the proportion of 1's for your outcome variable is not close to .5; you can get a start on a better breakdown by using -estat classification- after estimating your logistic regression but for a better answer you need to better specify what you want and whether the "costs" of each type of error differ and if so how and by how much

also, please show your results in CODE blocks (as you did for your dataex example) and please show the exact command you used to obtain your estimates (as requested in the FAQ)
1 like
Comment
Lei Jin

Join Date: May 2021

Posts: 20
#3

26 Nov 2021, 19:08

Hi Rich, thanks for the suggestions, I am now doing research about them.

And thanks for the reminder, I have now shown the command and results in code blocks.

Unfortunately, I tried, but my laptop has too low computation power for the lowess command, it cannot generate a graph after 30 mins

Last edited by Lei Jin; 26 Nov 2021, 19:59.
Comment

Lei Jin

Join Date: May 2021
Posts: 20

26 Nov 2021, 19:09

Code:

************************ final model *************************************

quietly logit approval income_w dti20 dti20_30 dti30_36 dti36_49 dti50_60  fico680_699 fico700_719 fico720_739 ltv80 ltv80_85 ltv85_90 ltv90_95  origination_2019  refinance minority female age62 lender_top100 shadowbank fintech aus tract_minority_population_percen tract_owner_occupied_units tract_one_to_four_family_homes tract_median_age_of_housing_unit cra fhfa_index

*pearson's chi2, hosmer and lemeshow's test, and percent correctly predicted

estat gof
lfit, group(10) table
estat classification

Comment

Lei Jin

Join Date: May 2021
Posts: 20

26 Nov 2021, 19:10

Code:

. estat gof

Goodness-of-fit test after logistic model
Variable: approval

      Number of observations = 2,491,476
Number of covariate patterns = 1,636,678
       Pearson chi2(1636649) =  2.48e+06
                 Prob > chi2 =    0.0000

. lfit, group(10) table
note: obs collapsed on 10 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: approval

  Table collapsed on quantiles of estimated probabilities
  +-----------------------------------------------------------------+
  | Group |   Prob |  Obs_1 |    Exp_1 |  Obs_0 |    Exp_0 |  Total |
  |-------+--------+--------+----------+--------+----------+--------|
  |     1 | 0.6193 |  77335 |  77379.4 | 171813 | 171768.6 | 249148 |
  |     2 | 0.7471 | 170945 | 172131.2 |  78204 |  77017.8 | 249149 |
  |     3 | 0.8465 | 200800 | 198841.8 |  48346 |  50304.2 | 249146 |
  |     4 | 0.8855 | 216880 | 216712.1 |  32268 |  32435.9 | 249148 |
  |     5 | 0.9037 | 223495 | 223061.9 |  25652 |  26085.1 | 249147 |
  |-------+--------+--------+----------+--------+----------+--------|
  |     6 | 0.9166 | 227089 | 226821.4 |  22059 |  22326.6 | 249148 |
  |     7 | 0.9275 | 229835 | 229762.4 |  19314 |  19386.6 | 249149 |
  |     8 | 0.9378 | 232215 | 232373.8 |  16932 |  16773.2 | 249147 |
  |     9 | 0.9492 | 234556 | 235019.0 |  14591 |  14128.0 | 249147 |
  |    10 | 0.9900 | 237511 | 238557.9 |  11636 |  10589.1 | 249147 |
  +-----------------------------------------------------------------+

 Number of observations = 2,491,476
       Number of groups =        10
Hosmer–Lemeshow chi2(8) =    260.64
            Prob > chi2 =    0.0000

. estat classification

Logistic model for approval

              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |   2018323        302736  |    2321059
     -     |     32338        138079  |     170417
-----------+--------------------------+-----------
   Total   |   2050661        440815  |    2491476

Classified + if predicted Pr(D) >= .5
True D defined as approval != 0
--------------------------------------------------
Sensitivity                     Pr( +| D)   98.42%
Specificity                     Pr( -|~D)   31.32%
Positive predictive value       Pr( D| +)   86.96%
Negative predictive value       Pr(~D| -)   81.02%
--------------------------------------------------
False + rate for true ~D        Pr( +|~D)   68.68%
False - rate for true D         Pr( -| D)    1.58%
False + rate for classified +   Pr(~D| +)   13.04%
False - rate for classified -   Pr( D| -)   18.98%
--------------------------------------------------
Correctly classified                        86.55%
--------------------------------------------------

Comment

Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#6

27 Nov 2021, 06:01

first, for a data set of that size, I would expect -lowess- to take at least a couple of hours regardless of how fast your machine is; however, you can often get what you need by taking a random sample of the data and using lowess on that; see

Code:

help sample

given the imbalance of your +/- counts, I doubt very much that using the default cutoff of .5 is particularly useful; while there are ad hoc methods of choosing a cutoff (e.g., to match the data or based on the constant from a logistic model), it is clearly best to use substantive knowledge for this

there is literature on how many groups to use for an H-L test as the power grows very fast with the N; however, even this lit is not very helpful for an N of almost 2.5 million

you might want to use -brier- but carefully read the documentation to see which, if any, of the versions of calibration in that are useful in your situation; see

Code:

h brier
Comment
Lei Jin

Join Date: May 2021

Posts: 20
#7

27 Nov 2021, 18:45

Thank you for the new suggestion Rich, I think I need to spend a bit of time doing research about them.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

27 Nov 2021, 20:42

A philosophical point. I think there is a tendency to fetishize goodness-of-fit tests. They work very poorly in the context of very large data sets where tiny departures from perfect fit become "highly statistically significant." In fact, even something is trivial as the fit of a logistic model to data that is generated truly by a probit model, the two models being nearly identical except in the far tails, can appear to be extremely "significant" under any goodness of fit test when you have a sample size even in the tens of thousands, let alone the millions.

You yourself, looking at the table of observed and expected values you showed in #1, expressed surprise because the fit looks good. Believe your eyes, not the "statistical significance" tests. Look again at that table. Ask yourself, is there any conceivable real-world, practical situation under which this degree of fit would be insufficient for a real-world purpose? I don't think there is.

Ignore the test statistic: just look at how close the observed and expected values are, and rely on practical judgment about whether the fit is good enough for the purposes at hand.
3 likes
Comment
Lei Jin

Join Date: May 2021

Posts: 20
#9

29 Nov 2021, 04:49

Thank you, Clyde. Before you posted in #8, I have already read your old postings concerned with GOF in a sample with large N. And that posting actually has solved my problems.

So I will try to quickly summarize what I did to test the GOF of the logit model with large N:

Part1: test the calibration

1) I first use HL test - estat gof, group(10) table - to generate the HL table, and I visually inspect the agreement between the observed and the predicted frequency. If the observed and the predicted frequency are closed enough, we are confident about the GOF of the model. Nevertheless, "how closed is enough" depends on the research topic.

2) use - calibrationbelt - to generate the confidence belt and compare it with the diagonal line. If the confidence belt overlaps with the diagonal line appropriately, the model has a good calibration. Once again, how "appropriately" is the overlap depends on the research topic. In my case, I got a belt that almost coincided with the diagonal line, I regarded it as strong evidence of a good calibration.

3) use - estat class - to generate a classification table, which gives the "overall" percent correctly predicted, along with the "outcome 0" percent correctly predicted and the "outcome 1" percent correctly predicted. To have a better trade-off between the "outcome 0" and the "outcome 1" percent correctly predicted, we need to set the default cutoff point in -estat class - to a more meaningful value. The new cutoff can be found through - lsens - which plots both sensitivity and specificity versus probability cutoff c. The cutoff c at the cross point between the line of sensitivity and specificity is often the new cutoff c we need. In my case, after reset the cutoff point, I got the sensitivity and specificity both of around 70%, I think I am satisfied with the model's calibration.

Part2: test the discrimination

1) use - lroc - to generate AUC, compare the AUC value, that is: A graph of sensitivity versus one minus specificity as the cutoff c is varied—and calculates the area under it. A model with no predictive power would be a 45 line. The greater the predictive power, the more bowed the curve. An AUC value above 0.70 is a crude indicator for the qualified discrimination for the model. In my case, I got an AUC= 0.79, I concluded the model is good in discrimination.

Part3: test the multicollinearity

1) use - collin - to get the variance inflation factor (VIF) of the covariables. As a rule of thumb, a tolerance of 0.1 or less (equivalently VIF of 10 or greater) is a cause of concern. In my case, most of the variables have a VIF way below 10, except two variables are marginally qualified with VIF=11.

Put together, I regard my model as sound in GOF.

In addition:

The above procedure is basically summarized from the postings of Clyde, the links of the postings are below, I think they can be pretty helpful,

1. Hosmer Lemeshow test for large data:
https://www.statalist.org/forums/for...for-large-data

2. How to interpret goodness of fit for multivariate logistic regression model:
https://www.statalist.org/forums/for...gression-model

Last edited by Lei Jin; 29 Nov 2021, 04:54.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#10

29 Nov 2021, 06:48

my "3-part" procedure is slightly different - collinearity is rarely an issue of concern and with a data set of your size I would say it is not an issue of concern; however, discrimination (your part 2) and calibration (your part 1) are not quite sufficient as I think some form of validation is also desirable; if you have, or have access to, external data, then external validation is best; otherwise, I suggest use of bootstrap for internal validation; see, e.g.,

Harrell, FE, Jr. (2015), Regression modeling strategies, second edition, Springer

Steyerberg, EW (2019), Clinical Prediction models, second edition, Springer
2 likes
Comment
Lei Jin

Join Date: May 2021

Posts: 20
#11

30 Nov 2021, 04:00

hey Rich, thanks for the new materials, I will be back if I have any progress.
Comment

Announcement

GOF of Logit Model: Pearson's chi2, Hosmer and Lemeshow's test

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment