Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Time-varying logistic regression postestimation problems

    Hello!

    I'm writing my master's thesis about macroeconomic influence on company bankruptcies in europe. The dependent variable assumes 1 if the company is bankrupted and 0 otherwise. The dataset is composed by 60 000 year-firm observations (panel data).
    I've estimated a logit model with 6 variables, all of them with a p-value below 0.05 and with a pseudo-R2 around 0.40. When i run ROC analysis, i get an AUC of 0.98, what it seems to be good.
    In the other hand, when i perform the hosmer-lemeshow test, the model does not seem to be fit.
    Am i misinterpretating the statistics? This is valid model to use?
    I'm sorry for this question, i'm a stata newbie.

    The outputs are bellow:

    Code:
    . logit status V1 V4 V16 V19 V23 V24
    
    Iteration 0:   log likelihood = -2697.1256  
    Iteration 1:   log likelihood =  -2567.349  
    Iteration 2:   log likelihood = -2058.2462  (backed up)
    Iteration 3:   log likelihood = -1741.7657  
    Iteration 4:   log likelihood = -1669.0762  
    Iteration 5:   log likelihood = -1637.6074  
    Iteration 6:   log likelihood = -1632.0543  
    Iteration 7:   log likelihood = -1631.2841  
    Iteration 8:   log likelihood =  -1623.069  
    Iteration 9:   log likelihood = -1620.5694  
    Iteration 10:  log likelihood =  -1620.115  
    Iteration 11:  log likelihood = -1620.1138  
    Iteration 12:  log likelihood = -1620.1138  
    
    Logistic regression                             Number of obs     =     52,876
                                                    LR chi2(6)        =    2154.02
                                                    Prob > chi2       =     0.0000
    Log likelihood = -1620.1138                     Pseudo R2         =     0.3993
    
    ------------------------------------------------------------------------------
          status |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
              V1 |   .0862743   .0220024     3.92   0.000     .0431504    .1293982
              V4 |  -5.499801   .1735809   -31.68   0.000    -5.840013   -5.159589
             V16 |   .1206167   .0415667     2.90   0.004     .0391474     .202086
             V19 |  -.6504074   .1139776    -5.71   0.000    -.8737993   -.4270155
             V23 |   1.833306   .2800535     6.55   0.000     1.284411    2.382201
             V24 |   3.021288   .3109562     9.72   0.000     2.411826    3.630751
           _cons |  -5.772843   .5186897   -11.13   0.000    -6.789457    -4.75623
    ------------------------------------------------------------------------------
    Note: 86 failures and 0 successes completely determined.
    Hosmer-Lameshow test

    Code:
    Logistic model for status, goodness-of-fit test
    
      (Table collapsed on quantiles of estimated probabilities)
    
           number of observations =     52876
                 number of groups =        10
          Hosmer-Lemeshow chi2(8) =        88.71
                      Prob > chi2 =         0.0000
    Best regards,
    Rodrigo

  • #2
    You should ignore the p-value of the Hosmer-Lemeshow procedure when you have a very large sample like this one. The H-L chi square is based on an approximation to the chi square distribution in the first place. On top of that, it is unlikely that in the real world the true data generating process is an exact match to any logistic regression model. At a sample size of 52,876, the H-L test is far too sensitive and will detect trivially small discrepancies of this nature. I recommend re-running -estat gof- specifying the -table- option. Then just eyeball-compare, or, better still, graph, the observed and expected numbers in the calibration table you get. Forget about the p-value here.

    Comment


    • #3
      Thanks for the answer, Clyde.
      When you say graph, do you mean the sensivity and specificity graph? Or it's something else?
      I've ran the classification table with a cutoff value of 0,5 and the observed and expected values were not very promising. Should i use the the cutoff point that maximizes both sensivity and specificity?

      Code:
      Logistic model for status
      
                    -------- True --------
      Classified |         D            ~D  |      Total
      -----------+--------------------------+-----------
           +     |        43            69  |        112
           -     |       429         52335  |      52764
      -----------+--------------------------+-----------
         Total   |       472         52404  |      52876
      
      Classified + if predicted Pr(D) >= .5
      True D defined as status != 0
      --------------------------------------------------
      Sensitivity                     Pr( +| D)    9.11%
      Specificity                     Pr( -|~D)   99.87%
      Positive predictive value       Pr( D| +)   38.39%
      Negative predictive value       Pr(~D| -)   99.19%
      --------------------------------------------------
      False + rate for true ~D        Pr( +|~D)    0.13%
      False - rate for true D         Pr( -| D)   90.89%
      False + rate for classified +   Pr(~D| +)   61.61%
      False - rate for classified -   Pr( D| -)    0.81%
      --------------------------------------------------
      Correctly classified                        99.06%
      --------------------------------------------------

      Comment


      • #4
        if you look at #2 in https://www.statalist.org/forums/for...libration-plot you will see an example of how to set up a related graphical idea, with a citation discussing using lowess for this purpose

        Comment


        • #5
          Neither of these. When you run -estat gof, table- you will get something that looks like this:

          Code:
          . sysuse auto, clear
          (1978 Automobile Data)
          
          .
          . logit foreign mpg headroom trunk
          
          Iteration 0:   log likelihood =  -45.03321  
          Iteration 1:   log likelihood = -38.121878  
          Iteration 2:   log likelihood = -37.866633  
          Iteration 3:   log likelihood = -37.866189  
          Iteration 4:   log likelihood = -37.866189  
          
          Logistic regression                             Number of obs     =         74
                                                          LR chi2(3)        =      14.33
                                                          Prob > chi2       =     0.0025
          Log likelihood = -37.866189                     Pseudo R2         =     0.1591
          
          ------------------------------------------------------------------------------
               foreign |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------+----------------------------------------------------------------
                   mpg |   .1070817   .0578698     1.85   0.064     -.006341    .2205044
              headroom |  -.2653362   .4593021    -0.58   0.563    -1.165552    .6348795
                 trunk |  -.0963487    .100015    -0.96   0.335    -.2923745    .0996772
                 _cons |  -1.222328   2.131202    -0.57   0.566    -5.399408    2.954751
          ------------------------------------------------------------------------------
          
          .
          . estat gof, table group(10)
          
          Logistic model for foreign, goodness-of-fit test
          
            (Table collapsed on quantiles of estimated probabilities)
            +--------------------------------------------------------+
            | Group |   Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
            |-------+--------+-------+-------+-------+-------+-------|
            |     1 | 0.0883 |     0 |   0.6 |     8 |   7.4 |     8 |
            |     2 | 0.1094 |     0 |   0.7 |     7 |   6.3 |     7 |
            |     3 | 0.1399 |     1 |   1.0 |     7 |   7.0 |     8 |
            |     4 | 0.1956 |     2 |   1.2 |     5 |   5.8 |     7 |
            |     5 | 0.2225 |     1 |   1.5 |     6 |   5.5 |     7 |
            |-------+--------+-------+-------+-------+-------+-------|
            |     6 | 0.3129 |     3 |   2.2 |     5 |   5.8 |     8 |
            |     7 | 0.3995 |     4 |   2.6 |     3 |   4.4 |     7 |
            |     8 | 0.5004 |     5 |   3.6 |     3 |   4.4 |     8 |
            |     9 | 0.6068 |     2 |   4.0 |     5 |   3.0 |     7 |
            |    10 | 0.7728 |     4 |   4.8 |     3 |   2.2 |     7 |
            +--------------------------------------------------------+
          
                 number of observations =        74
                       number of groups =        10
                Hosmer-Lemeshow chi2(8) =         7.58
                            Prob > chi2 =         0.4754
          It is that last table I'm referring to. Transfer Obs_1 and Exp_1 to a data set and then do a scatterplot. It's a good way to get a visual impression of model calibration. And if the calibration is good over some ranges of predicted outcome and poor over other ranges, you'll see that and it will perhaps suggest how to improve the model.

          Actually, with a sample as large as yours, I would probably want to look at a more fine-grained view of calibration. So I'd be more likely to use -estat, table group(50)- or an even higher number of groups. With so much data, you can well afford to chop it up into small groups here.

          I wouldn't pay very much attention to the classification table generated with the default cutoff of 0.5. If you want to generate a classification table, you need to invest some time exploring different cutoffs to find the one(s) that are most useful.

          Comment


          • #6
            Ok, thank you, i will try to do that. I was worried about that and the "low" pseudo-R2.

            Best regards,
            Rodrigo

            Comment


            • #7
              Pseudo-R2 measures for logistic regression are not particularly useful. You certainly should not interpret them as if they were analogous to R2 following OLS linear regression. Calibration a la Hosmer-Lemeshow (but ignoring the p-value in very large samples) and discrimination a la ROC curve, are the most important measures to look at for most purposes.

              Comment


              • #8
                Right, now makes more sense when i try to estimate models with other variables.

                Thanks once again, Clyde.

                Comment


                • #9
                  Nothing in your code acknowledges the longitudinal nature of the observatiions. You need a discrete survival analysis, not a single logistic regression. Therefore the HL test is not relevant. See the chapter Discrete in the ST manual and Lessone 3 and 6 in Stephen Jenkins's fine web page Survival Analysis with Stata, which includes a book draft. " As Stephen stated in another post, you'll need to assume that time-varyingl covariates are constant in a year. If you have endogenous predictors, endogenous, you run the risk of reverse causation; see: Goodliffe, Jay "The Hazards of Time-Varying Covariates".
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    Yikes! Steve Samuels is right. I did not notice that your data set is longitudinal.

                    Comment


                    • #11
                      Thanks, but what do you mean by "you'll need to assume that time-varyingl covariates are constant in a year"? My only variables that are constant in a year are the macroeconomic variables, because for each company, the GDP growth is the same every year.
                      This is an example of data. (Note: The status variable assumes 1 when company defaults, but in this example there are no 1's)
                      The "V" variables (financial variables) are the ones that have unique values and the "M" variables (macroeconomic) are the ones that are the same for all companies within a country (I have companies from 12 countries).

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input double(ID year) str2 country float(V1 V2 V3 M1 M2 M3) double status
                       1 2011 "ES"            .          .          .          .   .      . 0
                       1 2012 "ES"    .02863554   .5933174 .025175553  -.9987649   3  96.94 0
                       1 2013 "ES"    .03829085   .9237983  .02728747 -2.9277506 2.4  99.31 0
                       1 2014 "ES"    .04135316   .9648329  .02840617  -1.705705 1.5 100.83 0
                       1 2015 "ES"    .02440547   .8870857 .031735305  1.3799973 -.2 100.63 0
                       1 2016 "ES"   .034617458   .9899329 .029933535  3.4322534 -.6    100 0
                       2 2011 "DE"            .          .          .          .   .      . 0
                       2 2012 "DE"    .03966032  1.4739264  .04795682    3.71753 2.5   95.5 0
                       2 2013 "DE"   .035081245  1.5049646  .04644886   .6859426 2.1   97.5 0
                       2 2014 "DE"    .03918796  1.5048126  .05984759   .5997086 1.6   99.1 0
                       2 2015 "DE"  -.004929194   1.452039  .05702682  1.9266936  .8   99.9 0
                       2 2016 "DE"    .04340171     1.5898  .05638583   1.504274  .1    100 0
                       3 2011 "FR"            .          .          .          .   .      . 0
                       3 2012 "FR"    .03934901  1.0926318  .16431323  2.1021771 2.3   96.2 0
                       3 2013 "FR"    .03923682  1.1469225   .1553398  .22519706 2.2  98.33 0
                       3 2014 "FR"    .04699147   1.200281  .14004377   .6115883   1  99.31 0
                       3 2015 "FR"   .034162756  1.1957985  .15219976   .9919784  .6  99.91 0
                       3 2016 "FR"    .03355124  1.1512209  .13675214   .9756857  .1    100 0
                       4 2011 "FR"            .          .          .          .   .      . 0
                       4 2012 "FR"     .0622653  1.5035735  .15396953  2.1021771 2.3   96.2 0
                       4 2013 "FR"    .05765766   1.565886  .15335463  .22519706 2.2  98.33 0
                       4 2014 "FR"    .06133559   1.553956  .15195884   .6115883   1  99.31 0
                       4 2015 "FR"    .04537582  1.4312435   .1303905   .9919784  .6  99.91 0
                       4 2016 "FR"    .04384987  1.4814814  .14096916   .9756857  .1    100 0
                       5 2011 "NL"            .          .          .          .   .      . 0
                       5 2012 "NL"   .021125983          .          .   1.664079 2.5  94.32 0
                       5 2013 "NL"    .01552168          .          . -1.0571523 2.8  96.99 0
                       5 2014 "NL"    .01091033  1.3244977          .  -.1211138 2.6  99.47 0
                       5 2015 "NL"  -.007593446          .          .   1.419018  .3  99.79 0
                       5 2016 "NL" -.0022035143   1.529794          .  2.2602112  .2    100 0
                       6 2011 "IT"            .          .          .          .   .      . 0
                       6 2012 "IT"   .010545292  .02750814   .2510228   .7200381 2.9   95.3 0
                       6 2013 "IT"    .02424003 .010180204  .11123391  -2.851726 3.3   98.4 0
                       6 2014 "IT"   .013097074   .3316552  .07372819  -1.748934 1.2   99.7 0
                       6 2015 "IT"   .031346668  .04007523  .04581777   .1933345  .2   99.9 0
                       6 2016 "IT"   .034093436   .6140327  .04672733   .8752443  .1    100 0
                       7 2011 "IT"            .          .          .          .   .      . 0
                       7 2012 "IT"   .035222474    .449058  .56900704   .7200381 2.9   95.3 0
                       7 2013 "IT"  -.009290695   .4738996   .3978082  -2.851726 3.3   98.4 0
                       7 2014 "IT"    .03664255   .5687359   .5134792  -1.748934 1.2   99.7 0
                       7 2015 "IT"    .04028174    .614176  .45914716   .1933345  .2   99.9 0
                       7 2016 "IT"   .035411738   .6293251   .4476604   .8752443  .1    100 0
                       8 2011 "NL"            .          .          .          .   .      . 0
                       8 2012 "NL"    .04100656          .          .   1.664079 2.5  94.32 0
                       8 2013 "NL"  -.036393993          .          . -1.0571523 2.8  96.99 0
                       8 2014 "NL"     .0392525          .          .  -.1211138 2.6  99.47 0
                       8 2015 "NL"     .0739008  1.9314884          .   1.419018  .3  99.79 0
                       8 2016 "NL"    .05183174          .          .  2.2602112  .2    100 0
                       9 2011 "ES"            .          .          .          .   .      . 0
                       9 2012 "ES"    .07523796   .3767411 .028908575  -.9987649   3  96.94 0
                       9 2013 "ES"    .05408113   .3301124  .02156817 -2.9277506 2.4  99.31 0
                       9 2014 "ES"    .07539804  .26931778 .017943764  -1.705705 1.5 100.83 0
                       9 2015 "ES"    .04315431  .26215613 .016879762  1.3799973 -.2 100.63 0
                       9 2016 "ES"    .04478771  .28575364  .03685077  3.4322534 -.6    100 0
                      10 2011 "ES"            .          .          .          .   .      . 0
                      10 2012 "ES"     .1186191   1.199332 .030447204  -.9987649   3  96.94 0
                      10 2013 "ES"     .1012679  1.2170254 .027773524 -2.9277506 2.4  99.31 0
                      10 2014 "ES"    .09839948  1.2186548   .0293712  -1.705705 1.5 100.83 0
                      10 2015 "ES"    .09633794  1.1899049 .022572866  1.3799973 -.2 100.63 0
                      10 2016 "ES"    .10917173  1.3461415  .02717553  3.4322534 -.6    100 0
                      11 2011 "AT"            .          .          .          .   .      . 0
                      11 2012 "AT"  -.028465595  1.0316819  .06561013  2.9350016 3.6  93.35 0
                      11 2013 "AT"   .016308697  1.1232989  .06168289   .6616123 2.6  95.75 0
                      11 2014 "AT"    .02310198  1.1731714  .06961647 .007276793 2.1  97.77 0
                      11 2015 "AT"    .00724687   1.402179  .07553788   .9232553 1.5   99.2 0
                      11 2016 "AT"   .018028235   1.362713 .070598714  1.0743308  .8    100 0
                      12 2011 "IT"            .          .          .          .   .      . 0
                      12 2012 "IT"    .05150793  .11172492   .4115693   .7200381 2.9   95.3 0
                      12 2013 "IT"    .05080076  .26522544   .3541926  -2.851726 3.3   98.4 0
                      12 2014 "IT"    .05088236 .012727932   .3242301  -1.748934 1.2   99.7 0
                      12 2015 "IT"     .0490761 .008688791   .3362733   .1933345  .2   99.9 0
                      12 2016 "IT"       .04421 .007728237   .3064964   .8752443  .1    100 0
                      13 2011 "ES"            .          .          .          .   .      . 0
                      13 2012 "ES"   .013215705   .2363464  .16169885  -.9987649   3  96.94 0
                      13 2013 "ES"  -.003061371   .2415128   .2863053 -2.9277506 2.4  99.31 0
                      13 2014 "ES"   .009039336    .232953  .45294535  -1.705705 1.5 100.83 0
                      13 2015 "ES"    .02673267  .24628834   .3787209  1.3799973 -.2 100.63 0
                      13 2016 "ES"   .017238198  .28199103   .2627524  3.4322534 -.6    100 0
                      14 2011 "BE"            .          .          .          .   .      . 0
                      14 2012 "BE"    .03853326   .8138638  .04201384  1.7985336 3.4  95.18 0
                      14 2013 "BE"    .03392126   .7911052  .03968284  .23439406 2.6  97.68 0
                      14 2014 "BE"  -.006871799   .5448933 .034353264   .2007261 1.2   98.9 0
                      14 2015 "BE"     .0522836   .8327809 .031290103  1.3516864  .5  99.38 0
                      14 2016 "BE"    .05320701   .7529722          1  1.4043202  .6    100 0
                      15 2011 "NL"            .          .          .          .   .      . 0
                      15 2012 "NL"    .07565893          .          .   1.664079 2.5  94.32 0
                      15 2013 "NL"      .068795          .          . -1.0571523 2.8  96.99 0
                      15 2014 "NL"    .08078537   .6173055          .  -.1211138 2.6  99.47 0
                      15 2015 "NL"    .10079005          .          .   1.419018  .3  99.79 0
                      15 2016 "NL"     .0786534          .          .  2.2602112  .2    100 0
                      16 2011 "NL"            .          .          .          .   .      . 0
                      16 2012 "NL"    .07576337   .5993228          .   1.664079 2.5  94.32 0
                      16 2013 "NL"    .06890423   .6301915          . -1.0571523 2.8  96.99 0
                      16 2014 "NL"     .1395717   .6173055          .  -.1211138 2.6  99.47 0
                      16 2015 "NL"    .10079005   .5356144          .   1.419018  .3  99.79 0
                      16 2016 "NL"    .07870578   .4905612          .  2.2602112  .2    100 0
                      17 2011 "IT"            .          .          .          .   .      . 0
                      17 2012 "IT"    .07304463   .8516167   .3182524   .7200381 2.9   95.3 0
                      17 2013 "IT"   .035730366   .7921458  .26150808  -2.851726 3.3   98.4 0
                      17 2014 "IT"  -.008525445   .6440989   .2937095  -1.748934 1.2   99.7 0
                      end
                      format %ty year
                      Thanks in advance

                      Comment


                      • #12
                        Stephen meant that even if a predictor changes during a year (and you know the changes), you cannot use that information. The model predicts the probability of failure in a year conditional on what is known at the start. One consequence: if a unique variable is an average of values during a year, you can't use it as a predictor for that year.
                        Last edited by Steve Samuels; 12 Jan 2018, 14:53.
                        Steve Samuels
                        Statistical Consulting
                        [email protected]

                        Stata 14.2

                        Comment


                        • #13
                          On what is known at the start of that year? Or at the begining of the observations (in my case 2007). Sorry for the ignorance, but im really new in stata.

                          Comment


                          • #14
                            Oh sorry, now i get it. That is only in the case of predictors that change within a year right? My predictors are unique variables during a year because they are values from income statements and balance sheets (only 1 per year)

                            Comment


                            • #15
                              Correct. Be sure to read Goodliffe's article.
                              Steve Samuels
                              Statistical Consulting
                              [email protected]

                              Stata 14.2

                              Comment

                              Working...
                              X