Time-varying logistic regression postestimation problems

Rodrigo Primor

Join Date: Jan 2018
Posts: 27

Time-varying logistic regression postestimation problems

11 Jan 2018, 10:13

Hello!

I'm writing my master's thesis about macroeconomic influence on company bankruptcies in europe. The dependent variable assumes 1 if the company is bankrupted and 0 otherwise. The dataset is composed by 60 000 year-firm observations (panel data).
I've estimated a logit model with 6 variables, all of them with a p-value below 0.05 and with a pseudo-R2 around 0.40. When i run ROC analysis, i get an AUC of 0.98, what it seems to be good.
In the other hand, when i perform the hosmer-lemeshow test, the model does not seem to be fit.
Am i misinterpretating the statistics? This is valid model to use?
I'm sorry for this question, i'm a stata newbie.

The outputs are bellow:

Code:

. logit status V1 V4 V16 V19 V23 V24

Iteration 0:   log likelihood = -2697.1256  
Iteration 1:   log likelihood =  -2567.349  
Iteration 2:   log likelihood = -2058.2462  (backed up)
Iteration 3:   log likelihood = -1741.7657  
Iteration 4:   log likelihood = -1669.0762  
Iteration 5:   log likelihood = -1637.6074  
Iteration 6:   log likelihood = -1632.0543  
Iteration 7:   log likelihood = -1631.2841  
Iteration 8:   log likelihood =  -1623.069  
Iteration 9:   log likelihood = -1620.5694  
Iteration 10:  log likelihood =  -1620.115  
Iteration 11:  log likelihood = -1620.1138  
Iteration 12:  log likelihood = -1620.1138  

Logistic regression                             Number of obs     =     52,876
                                                LR chi2(6)        =    2154.02
                                                Prob > chi2       =     0.0000
Log likelihood = -1620.1138                     Pseudo R2         =     0.3993

------------------------------------------------------------------------------
      status |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          V1 |   .0862743   .0220024     3.92   0.000     .0431504    .1293982
          V4 |  -5.499801   .1735809   -31.68   0.000    -5.840013   -5.159589
         V16 |   .1206167   .0415667     2.90   0.004     .0391474     .202086
         V19 |  -.6504074   .1139776    -5.71   0.000    -.8737993   -.4270155
         V23 |   1.833306   .2800535     6.55   0.000     1.284411    2.382201
         V24 |   3.021288   .3109562     9.72   0.000     2.411826    3.630751
       _cons |  -5.772843   .5186897   -11.13   0.000    -6.789457    -4.75623
------------------------------------------------------------------------------
Note: 86 failures and 0 successes completely determined.

Hosmer-Lameshow test

Code:

Logistic model for status, goodness-of-fit test

  (Table collapsed on quantiles of estimated probabilities)

       number of observations =     52876
             number of groups =        10
      Hosmer-Lemeshow chi2(8) =        88.71
                  Prob > chi2 =         0.0000

Best regards,
Rodrigo

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

11 Jan 2018, 10:41

You should ignore the p-value of the Hosmer-Lemeshow procedure when you have a very large sample like this one. The H-L chi square is based on an approximation to the chi square distribution in the first place. On top of that, it is unlikely that in the real world the true data generating process is an exact match to any logistic regression model. At a sample size of 52,876, the H-L test is far too sensitive and will detect trivially small discrepancies of this nature. I recommend re-running -estat gof- specifying the -table- option. Then just eyeball-compare, or, better still, graph, the observed and expected numbers in the calibration table you get. Forget about the p-value here.
Comment

Rodrigo Primor

Join Date: Jan 2018
Posts: 27

11 Jan 2018, 11:03

Thanks for the answer, Clyde.
When you say graph, do you mean the sensivity and specificity graph? Or it's something else?
I've ran the classification table with a cutoff value of 0,5 and the observed and expected values were not very promising. Should i use the the cutoff point that maximizes both sensivity and specificity?

Code:

Logistic model for status

              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |        43            69  |        112
     -     |       429         52335  |      52764
-----------+--------------------------+-----------
   Total   |       472         52404  |      52876

Classified + if predicted Pr(D) >= .5
True D defined as status != 0
--------------------------------------------------
Sensitivity                     Pr( +| D)    9.11%
Specificity                     Pr( -|~D)   99.87%
Positive predictive value       Pr( D| +)   38.39%
Negative predictive value       Pr(~D| -)   99.19%
--------------------------------------------------
False + rate for true ~D        Pr( +|~D)    0.13%
False - rate for true D         Pr( -| D)   90.89%
False + rate for classified +   Pr(~D| +)   61.61%
False - rate for classified -   Pr( D| -)    0.81%
--------------------------------------------------
Correctly classified                        99.06%
--------------------------------------------------

Comment

Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#4

11 Jan 2018, 11:08

if you look at #2 in https://www.statalist.org/forums/for...libration-plot you will see an example of how to set up a related graphical idea, with a citation discussing using lowess for this purpose
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

11 Jan 2018, 11:10

Neither of these. When you run -estat gof, table- you will get something that looks like this:

Code:

. sysuse auto, clear
(1978 Automobile Data)

.
. logit foreign mpg headroom trunk

Iteration 0:   log likelihood =  -45.03321  
Iteration 1:   log likelihood = -38.121878  
Iteration 2:   log likelihood = -37.866633  
Iteration 3:   log likelihood = -37.866189  
Iteration 4:   log likelihood = -37.866189  

Logistic regression                             Number of obs     =         74
                                                LR chi2(3)        =      14.33
                                                Prob > chi2       =     0.0025
Log likelihood = -37.866189                     Pseudo R2         =     0.1591

------------------------------------------------------------------------------
     foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   .1070817   .0578698     1.85   0.064     -.006341    .2205044
    headroom |  -.2653362   .4593021    -0.58   0.563    -1.165552    .6348795
       trunk |  -.0963487    .100015    -0.96   0.335    -.2923745    .0996772
       _cons |  -1.222328   2.131202    -0.57   0.566    -5.399408    2.954751
------------------------------------------------------------------------------

.
. estat gof, table group(10)

Logistic model for foreign, goodness-of-fit test

  (Table collapsed on quantiles of estimated probabilities)
  +--------------------------------------------------------+
  | Group |   Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
  |-------+--------+-------+-------+-------+-------+-------|
  |     1 | 0.0883 |     0 |   0.6 |     8 |   7.4 |     8 |
  |     2 | 0.1094 |     0 |   0.7 |     7 |   6.3 |     7 |
  |     3 | 0.1399 |     1 |   1.0 |     7 |   7.0 |     8 |
  |     4 | 0.1956 |     2 |   1.2 |     5 |   5.8 |     7 |
  |     5 | 0.2225 |     1 |   1.5 |     6 |   5.5 |     7 |
  |-------+--------+-------+-------+-------+-------+-------|
  |     6 | 0.3129 |     3 |   2.2 |     5 |   5.8 |     8 |
  |     7 | 0.3995 |     4 |   2.6 |     3 |   4.4 |     7 |
  |     8 | 0.5004 |     5 |   3.6 |     3 |   4.4 |     8 |
  |     9 | 0.6068 |     2 |   4.0 |     5 |   3.0 |     7 |
  |    10 | 0.7728 |     4 |   4.8 |     3 |   2.2 |     7 |
  +--------------------------------------------------------+

       number of observations =        74
             number of groups =        10
      Hosmer-Lemeshow chi2(8) =         7.58
                  Prob > chi2 =         0.4754

It is that last table I'm referring to. Transfer Obs_1 and Exp_1 to a data set and then do a scatterplot. It's a good way to get a visual impression of model calibration. And if the calibration is good over some ranges of predicted outcome and poor over other ranges, you'll see that and it will perhaps suggest how to improve the model.

Actually, with a sample as large as yours, I would probably want to look at a more fine-grained view of calibration. So I'd be more likely to use -estat, table group(50)- or an even higher number of groups. With so much data, you can well afford to chop it up into small groups here.

I wouldn't pay very much attention to the classification table generated with the default cutoff of 0.5. If you want to generate a classification table, you need to invest some time exploring different cutoffs to find the one(s) that are most useful.

Comment

Rodrigo Primor

Join Date: Jan 2018

Posts: 27
#6

11 Jan 2018, 12:11

Ok, thank you, i will try to do that. I was worried about that and the "low" pseudo-R2.

Best regards,
Rodrigo
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

11 Jan 2018, 13:51

Pseudo-R2 measures for logistic regression are not particularly useful. You certainly should not interpret them as if they were analogous to R² following OLS linear regression. Calibration a la Hosmer-Lemeshow (but ignoring the p-value in very large samples) and discrimination a la ROC curve, are the most important measures to look at for most purposes.
Comment
Rodrigo Primor

Join Date: Jan 2018

Posts: 27
#8

11 Jan 2018, 14:39

Right, now makes more sense when i try to estimate models with other variables.

Thanks once again, Clyde.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

11 Jan 2018, 15:32

Nothing in your code acknowledges the longitudinal nature of the observatiions. You need a discrete survival analysis, not a single logistic regression. Therefore the HL test is not relevant. See the chapter Discrete in the ST manual and Lessone 3 and 6 in Stephen Jenkins's fine web page Survival Analysis with Stata, which includes a book draft. " As Stephen stated in another post, you'll need to assume that time-varyingl covariates are constant in a year. If you have endogenous predictors, endogenous, you run the risk of reverse causation; see: Goodliffe, Jay "The Hazards of Time-Varying Covariates".

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

11 Jan 2018, 16:39

Yikes! Steve Samuels is right. I did not notice that your data set is longitudinal.
Comment

Rodrigo Primor

Join Date: Jan 2018
Posts: 27

#11

12 Jan 2018, 05:13

Thanks, but what do you mean by "you'll need to assume that time-varyingl covariates are constant in a year"? My only variables that are constant in a year are the macroeconomic variables, because for each company, the GDP growth is the same every year.
This is an example of data. (Note: The status variable assumes 1 when company defaults, but in this example there are no 1's)
The "V" variables (financial variables) are the ones that have unique values and the "M" variables (macroeconomic) are the ones that are the same for all companies within a country (I have companies from 12 countries).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(ID year) str2 country float(V1 V2 V3 M1 M2 M3) double status
 1 2011 "ES"            .          .          .          .   .      . 0
 1 2012 "ES"    .02863554   .5933174 .025175553  -.9987649   3  96.94 0
 1 2013 "ES"    .03829085   .9237983  .02728747 -2.9277506 2.4  99.31 0
 1 2014 "ES"    .04135316   .9648329  .02840617  -1.705705 1.5 100.83 0
 1 2015 "ES"    .02440547   .8870857 .031735305  1.3799973 -.2 100.63 0
 1 2016 "ES"   .034617458   .9899329 .029933535  3.4322534 -.6    100 0
 2 2011 "DE"            .          .          .          .   .      . 0
 2 2012 "DE"    .03966032  1.4739264  .04795682    3.71753 2.5   95.5 0
 2 2013 "DE"   .035081245  1.5049646  .04644886   .6859426 2.1   97.5 0
 2 2014 "DE"    .03918796  1.5048126  .05984759   .5997086 1.6   99.1 0
 2 2015 "DE"  -.004929194   1.452039  .05702682  1.9266936  .8   99.9 0
 2 2016 "DE"    .04340171     1.5898  .05638583   1.504274  .1    100 0
 3 2011 "FR"            .          .          .          .   .      . 0
 3 2012 "FR"    .03934901  1.0926318  .16431323  2.1021771 2.3   96.2 0
 3 2013 "FR"    .03923682  1.1469225   .1553398  .22519706 2.2  98.33 0
 3 2014 "FR"    .04699147   1.200281  .14004377   .6115883   1  99.31 0
 3 2015 "FR"   .034162756  1.1957985  .15219976   .9919784  .6  99.91 0
 3 2016 "FR"    .03355124  1.1512209  .13675214   .9756857  .1    100 0
 4 2011 "FR"            .          .          .          .   .      . 0
 4 2012 "FR"     .0622653  1.5035735  .15396953  2.1021771 2.3   96.2 0
 4 2013 "FR"    .05765766   1.565886  .15335463  .22519706 2.2  98.33 0
 4 2014 "FR"    .06133559   1.553956  .15195884   .6115883   1  99.31 0
 4 2015 "FR"    .04537582  1.4312435   .1303905   .9919784  .6  99.91 0
 4 2016 "FR"    .04384987  1.4814814  .14096916   .9756857  .1    100 0
 5 2011 "NL"            .          .          .          .   .      . 0
 5 2012 "NL"   .021125983          .          .   1.664079 2.5  94.32 0
 5 2013 "NL"    .01552168          .          . -1.0571523 2.8  96.99 0
 5 2014 "NL"    .01091033  1.3244977          .  -.1211138 2.6  99.47 0
 5 2015 "NL"  -.007593446          .          .   1.419018  .3  99.79 0
 5 2016 "NL" -.0022035143   1.529794          .  2.2602112  .2    100 0
 6 2011 "IT"            .          .          .          .   .      . 0
 6 2012 "IT"   .010545292  .02750814   .2510228   .7200381 2.9   95.3 0
 6 2013 "IT"    .02424003 .010180204  .11123391  -2.851726 3.3   98.4 0
 6 2014 "IT"   .013097074   .3316552  .07372819  -1.748934 1.2   99.7 0
 6 2015 "IT"   .031346668  .04007523  .04581777   .1933345  .2   99.9 0
 6 2016 "IT"   .034093436   .6140327  .04672733   .8752443  .1    100 0
 7 2011 "IT"            .          .          .          .   .      . 0
 7 2012 "IT"   .035222474    .449058  .56900704   .7200381 2.9   95.3 0
 7 2013 "IT"  -.009290695   .4738996   .3978082  -2.851726 3.3   98.4 0
 7 2014 "IT"    .03664255   .5687359   .5134792  -1.748934 1.2   99.7 0
 7 2015 "IT"    .04028174    .614176  .45914716   .1933345  .2   99.9 0
 7 2016 "IT"   .035411738   .6293251   .4476604   .8752443  .1    100 0
 8 2011 "NL"            .          .          .          .   .      . 0
 8 2012 "NL"    .04100656          .          .   1.664079 2.5  94.32 0
 8 2013 "NL"  -.036393993          .          . -1.0571523 2.8  96.99 0
 8 2014 "NL"     .0392525          .          .  -.1211138 2.6  99.47 0
 8 2015 "NL"     .0739008  1.9314884          .   1.419018  .3  99.79 0
 8 2016 "NL"    .05183174          .          .  2.2602112  .2    100 0
 9 2011 "ES"            .          .          .          .   .      . 0
 9 2012 "ES"    .07523796   .3767411 .028908575  -.9987649   3  96.94 0
 9 2013 "ES"    .05408113   .3301124  .02156817 -2.9277506 2.4  99.31 0
 9 2014 "ES"    .07539804  .26931778 .017943764  -1.705705 1.5 100.83 0
 9 2015 "ES"    .04315431  .26215613 .016879762  1.3799973 -.2 100.63 0
 9 2016 "ES"    .04478771  .28575364  .03685077  3.4322534 -.6    100 0
10 2011 "ES"            .          .          .          .   .      . 0
10 2012 "ES"     .1186191   1.199332 .030447204  -.9987649   3  96.94 0
10 2013 "ES"     .1012679  1.2170254 .027773524 -2.9277506 2.4  99.31 0
10 2014 "ES"    .09839948  1.2186548   .0293712  -1.705705 1.5 100.83 0
10 2015 "ES"    .09633794  1.1899049 .022572866  1.3799973 -.2 100.63 0
10 2016 "ES"    .10917173  1.3461415  .02717553  3.4322534 -.6    100 0
11 2011 "AT"            .          .          .          .   .      . 0
11 2012 "AT"  -.028465595  1.0316819  .06561013  2.9350016 3.6  93.35 0
11 2013 "AT"   .016308697  1.1232989  .06168289   .6616123 2.6  95.75 0
11 2014 "AT"    .02310198  1.1731714  .06961647 .007276793 2.1  97.77 0
11 2015 "AT"    .00724687   1.402179  .07553788   .9232553 1.5   99.2 0
11 2016 "AT"   .018028235   1.362713 .070598714  1.0743308  .8    100 0
12 2011 "IT"            .          .          .          .   .      . 0
12 2012 "IT"    .05150793  .11172492   .4115693   .7200381 2.9   95.3 0
12 2013 "IT"    .05080076  .26522544   .3541926  -2.851726 3.3   98.4 0
12 2014 "IT"    .05088236 .012727932   .3242301  -1.748934 1.2   99.7 0
12 2015 "IT"     .0490761 .008688791   .3362733   .1933345  .2   99.9 0
12 2016 "IT"       .04421 .007728237   .3064964   .8752443  .1    100 0
13 2011 "ES"            .          .          .          .   .      . 0
13 2012 "ES"   .013215705   .2363464  .16169885  -.9987649   3  96.94 0
13 2013 "ES"  -.003061371   .2415128   .2863053 -2.9277506 2.4  99.31 0
13 2014 "ES"   .009039336    .232953  .45294535  -1.705705 1.5 100.83 0
13 2015 "ES"    .02673267  .24628834   .3787209  1.3799973 -.2 100.63 0
13 2016 "ES"   .017238198  .28199103   .2627524  3.4322534 -.6    100 0
14 2011 "BE"            .          .          .          .   .      . 0
14 2012 "BE"    .03853326   .8138638  .04201384  1.7985336 3.4  95.18 0
14 2013 "BE"    .03392126   .7911052  .03968284  .23439406 2.6  97.68 0
14 2014 "BE"  -.006871799   .5448933 .034353264   .2007261 1.2   98.9 0
14 2015 "BE"     .0522836   .8327809 .031290103  1.3516864  .5  99.38 0
14 2016 "BE"    .05320701   .7529722          1  1.4043202  .6    100 0
15 2011 "NL"            .          .          .          .   .      . 0
15 2012 "NL"    .07565893          .          .   1.664079 2.5  94.32 0
15 2013 "NL"      .068795          .          . -1.0571523 2.8  96.99 0
15 2014 "NL"    .08078537   .6173055          .  -.1211138 2.6  99.47 0
15 2015 "NL"    .10079005          .          .   1.419018  .3  99.79 0
15 2016 "NL"     .0786534          .          .  2.2602112  .2    100 0
16 2011 "NL"            .          .          .          .   .      . 0
16 2012 "NL"    .07576337   .5993228          .   1.664079 2.5  94.32 0
16 2013 "NL"    .06890423   .6301915          . -1.0571523 2.8  96.99 0
16 2014 "NL"     .1395717   .6173055          .  -.1211138 2.6  99.47 0
16 2015 "NL"    .10079005   .5356144          .   1.419018  .3  99.79 0
16 2016 "NL"    .07870578   .4905612          .  2.2602112  .2    100 0
17 2011 "IT"            .          .          .          .   .      . 0
17 2012 "IT"    .07304463   .8516167   .3182524   .7200381 2.9   95.3 0
17 2013 "IT"   .035730366   .7921458  .26150808  -2.851726 3.3   98.4 0
17 2014 "IT"  -.008525445   .6440989   .2937095  -1.748934 1.2   99.7 0
end
format %ty year

Thanks in advance

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#12

12 Jan 2018, 14:47

Stephen meant that even if a predictor changes during a year (and you know the changes), you cannot use that information. The model predicts the probability of failure in a year conditional on what is known at the start. One consequence: if a unique variable is an average of values during a year, you can't use it as a predictor for that year.

Last edited by Steve Samuels; 12 Jan 2018, 14:53.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Rodrigo Primor

Join Date: Jan 2018

Posts: 27
#13

12 Jan 2018, 17:17

On what is known at the start of that year? Or at the begining of the observations (in my case 2007). Sorry for the ignorance, but im really new in stata.
Comment
Rodrigo Primor

Join Date: Jan 2018

Posts: 27
#14

12 Jan 2018, 20:42

Oh sorry, now i get it. That is only in the case of predictors that change within a year right? My predictors are unique variables during a year because they are values from income statements and balance sheets (only 1 per year)
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#15

13 Jan 2018, 14:23

Correct. Be sure to read Goodliffe's article.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

Time-varying logistic regression postestimation problems

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment