Binary Probit Regression with Panel Data

Katharina Maier

Join Date: May 2019
Posts: 29

Binary Probit Regression with Panel Data

01 Jul 2019, 12:38

Hi Statalisters,

I am a novice user in Stata. I'm working with Stata.14 and Windows 7.

I'm working on a Panel Data Set for all commerical banks in the U.S. for the period 1995 - 2018 (time variable). So I have data on a bank-year level. I created the ID Variable with the variables bank name and cert I already calculated four bank risk proxies: Z-Score, NPA (non-performing assets), LLP (loan loss provisions) and LLR (loan loss reserves) on a bank-year level.
I calculated the Risk Proxy Z_score and I would like to run the binary probability model explaining the occurrence of a bank failure ( Failure = 1, Active = 0) with the risk proxy (lagged by one year).
I did this command to get for "Failure" = 1 and for "Active" = 0 for my binary outcome variable.

Code:

merge m:1 cert using `dataset1', assert(match master)

    Result                           # of obs.
    -----------------------------------------
    not matched                       674,977
        from master                   674,977  (_merge==1)
        from using                          0  (_merge==2)

    matched                            23,238  (_merge==3)
    -----------------------------------------

. gen byte status = (_merge == 3)

. label define status  0     "Active"    1    "Failed"

. label values status status

This is the dataset with 172 431 observations:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id year status Z_score)
  7 1995 1 -1.4038005
 10 1995 0  -1.434213
 11 1995 0 -1.5771302
 14 1995 0  -1.758422
 16 1995 0 -1.4295077
 21 1995 0  -1.329172
 27 1995 0 -1.3730284
 32 1995 0 -1.3627455
 34 1995 0  -1.908463
 38 1995 0 -1.8048723
 41 1995 0 -1.5905398
 46 1995 0  -1.533159
 47 1995 0  -1.663955
 48 1995 0  -1.518417
 52 1995 0 -1.3701818
 53 1995 0  -1.485621
 56 1995 0   -1.63249
 59 1995 0  -1.476241
 76 1995 0 -1.3577138
 82 1995 0 -1.3661845
 84 1995 0 -1.3885205
 85 1995 0 -1.5949416
 87 1995 0 -2.0597448
 99 1995 0 -2.2821965
101 1995 0 -1.5937258
104 1995 0 -1.6237373
end
format %ty year

I have Panel Data, so I started with this commands to run the probit regression.[ I forgot to add the ,vce (cluster id) and I think the cformat(%09.0g) pformat(%05.0g) sformat(%08.0g) is irrelevant]

Code:

xtset id year, yearly
       panel variable:  id (unbalanced)
        time variable:  year, 1995 to 2018, but with gaps
                delta:  1 year

The binary probability model explaining the occurrence of a bank failure ( Failure = 1, Active = 0) with the Z_score (lagged by one year).

Code:

xtprobit status Z_score L.year, re

This is the regression result:

Code:

 Fitting comparison model:

Iteration 0:   log likelihood = -22279.067  
Iteration 1:   log likelihood = -20346.952  
Iteration 2:   log likelihood = -20275.616  
Iteration 3:   log likelihood = -20275.347  
Iteration 4:   log likelihood = -20275.347  

Fitting full model:

rho =  0.0     log likelihood = -20275.347
rho =  0.1     log likelihood = -14462.788
rho =  0.2     log likelihood = -12201.376
rho =  0.3     log likelihood = -10943.278
rho =  0.4     log likelihood = -10144.299
rho =  0.5     log likelihood = -9623.8447
rho =  0.6     log likelihood = -9287.5278
rho =  0.7     log likelihood = -9127.5396
rho =  0.8     log likelihood = -9205.7192

Iteration 0:   log likelihood = -9047.5173  
Iteration 1:   log likelihood = -7511.1642  
Iteration 2:   log likelihood = -4988.8547  
Iteration 3:   log likelihood = -4534.5219  
Iteration 4:   log likelihood = -3701.1825  
Iteration 5:   log likelihood =  -3659.677  (not concave)
Iteration 6:   log likelihood = -3595.4789  
Iteration 7:   log likelihood = -3595.4789  (backed up)
Iteration 8:   log likelihood = -3564.2206  
Iteration 9:   log likelihood = -3557.6508  
Iteration 10:  log likelihood = -3557.6302  
Iteration 11:  log likelihood = -3557.6302  

Random-effects probit regression                Number of obs     =    156,147
Group variable: id                              Number of groups  =     14,692

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =       10.6
                                                              max =         23

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(2)      =     111.93
Log likelihood  = -3557.6302                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
      status |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     Z_score |   .8909937   .0877437    10.15   0.000     .7190193    1.062968
             |
        year |
         L1. |  -.0197096   .0049606    -3.97   0.000    -.0294322   -.0099869
             |
       _cons |   34.67511    9.94973     3.49   0.000     15.17399    54.17622
-------------+----------------------------------------------------------------
    /lnsig2u |   2.519077   .0220374                      2.475884    2.562269
-------------+----------------------------------------------------------------
     sigma_u |   3.523794   .0388276                      3.448509    3.600723
         rho |   .9254684   .0015201                      .9224338    .9283934
------------------------------------------------------------------------------
LR test of rho=0: chibar2(01) = 3.3e+04                Prob >= chibar2 = 0.000

Question1: It took a long time to receive the estimation results. Well, I'm working with Stata.14 and Windows 7 and with 172 431 observations, but is there a code to run it quicker?
Question2: In my "Guiding Paper" they assess the Z-Score Model on its Pseudo R2. I know that the "normal" probit regeression Output shows me the Pseudo R2 and there is a way to calculate the Pseudo R2 for the xtprobit Panel Data Probit Regression. I know that the pseudo R2 is stored with e(r2 p) and I got to this calculation https://www.stata.com/support/faqs/s...ics/r-squared/

Code:

regress weight length
predict weightp if e(sample)
corr weight weightp if e(sample)
di r(rho)^2

Unfortunatly I can't get it together to calculate the Pseude R2 for my xtprobit case.

Concern: The regression results are far away from them in my Guiding Paper and I think I did something wrong ... Maybe with the binary dependent variable status?

Thank you very much for your support!

Last edited by Katharina Maier; 01 Jul 2019, 12:47.

Tags: None

Kye Lippold

Join Date: Jun 2019

Posts: 67
#2

01 Jul 2019, 14:24

A few suggestions:

1. I recommend you review -help tsvarlist- about lagged variables in Stata. In particular, when you include L.year in your model, you are entering in the lagged value of the year variable (so this will be the numbers 1995, 1996, etc). This is not the same as the lagged value of your risk score variable. You want to include L.Z_score instead.
In particular, if I understand your model correctly, you only care about the lagged risk score to predict bank failure (not the current risk score). That means you are will want to run the following:

Code:

xtprobit status L.Z_score, re vce(cluster id)

2. xtprobit is generally pretty slow. But including the correct lagged variable could speed it up. If you want initial results very quickly (but with a linear probability model and fixed effects rather than a probit with random effects--so may not be the final results you want, but may be close and could be used to test different specifications), you can use the (very fast) reghdfe command:

Code:

ssc install reghdfe reghdfe status L.Z_score, a(id) cluster(id)

3. The construction of your binary variable looks fine to me (assuming dataset1 is a list of failed banks, and "cert" uniquely identifies banks).

4. For the pseudo R-squared--the article you linked will not give you what you want (note point 6--a regular R-squared for an OLS model is completely different from a pseudo R-squared for a probit model). Try the procedure at this link instead, where you can compute the psuedo R-squared using the log likeliehoods: https://www.stata.com/support/faqs/s...r2-for-probit/.

Hope that helps!
1 like
Comment

Katharina Maier

Join Date: May 2019
Posts: 29

02 Jul 2019, 09:27

Thank you for your support!

First I tryed it with your recommanded command:

Code:

xtprobit status L.Z_score, re vce(cluster id)

And the result is the following [I broke the command up after ca 4 hours, and that the Iterations are not concave is also not good]

Code:

Fitting comparison model:

Iteration 0:   log pseudolikelihood = -22279.067  
Iteration 1:   log pseudolikelihood = -21891.452  
Iteration 2:   log pseudolikelihood = -21888.313  
Iteration 3:   log pseudolikelihood = -21888.313  

Fitting full model:

rho =  0.0     log pseudolikelihood = -21888.313
rho =  0.1     log pseudolikelihood = -15375.908
rho =  0.2     log pseudolikelihood = -12850.968
rho =  0.3     log pseudolikelihood = -11440.412
rho =  0.4     log pseudolikelihood = -10537.061
rho =  0.5     log pseudolikelihood = -9940.1426
rho =  0.6     log pseudolikelihood = -9550.6066
rho =  0.7     log pseudolikelihood = -9322.6123
rho =  0.8     log pseudolikelihood = -9353.2658

Iteration 0:   log pseudolikelihood = -9241.8189  
Iteration 1:   log pseudolikelihood = -9152.1174  
Iteration 2:   log pseudolikelihood = -8210.2593  (not concave)
Iteration 3:   log pseudolikelihood = -8206.3819  (not concave)
Iteration 4:   log pseudolikelihood = -8197.2562  (not concave)
Iteration 5:   log pseudolikelihood = -8189.6892  (not concave)
Iteration 6:   log pseudolikelihood = -8180.6843  (not concave)
Iteration 7:   log pseudolikelihood = -8159.8362  (not concave)
--Break--
r(1);

As my model is Pr(Failure = 1) = F(x´_it-1 ß) and my "Guiding Paper" is saying "we report marginal effects of probit regressions" I think I have to use the Population-averaged (PA) model instead of the Random-effects (RE) model. After the literature review I came to the conclusion, that I would like to have a model which should explain the occurence of a bank failure of the entire "banking population" and not of a specific bank (e.g. Deutsche Bank).

So I was running the command with the correct correction of the standard error with vce(robust):

Code:

xtprobit status L.Z_score, pa vce(robust)

And the result was within 2 minutes:

Code:

Iteration 1: tolerance = .86103329
Iteration 2: tolerance = .503125
Iteration 3: tolerance = .29113738
Iteration 4: tolerance = .22620098
Iteration 5: tolerance = .14532103
Iteration 6: tolerance = .10679244
Iteration 7: tolerance = .06929776
Iteration 8: tolerance = .05079473
Iteration 9: tolerance = .03304497
Iteration 10: tolerance = .02339874
Iteration 11: tolerance = .01565962
Iteration 12: tolerance = .01090075
Iteration 13: tolerance = .0073888
Iteration 14: tolerance = .00510128
Iteration 15: tolerance = .00347811
Iteration 16: tolerance = .00239194
Iteration 17: tolerance = .00163531
Iteration 18: tolerance = .00112255
Iteration 19: tolerance = .00076844
Iteration 20: tolerance = .00052703
Iteration 21: tolerance = .000361
Iteration 22: tolerance = .00024749
Iteration 23: tolerance = .00016957
Iteration 24: tolerance = .00011623
Iteration 25: tolerance = .00007964
Iteration 26: tolerance = .00005459
Iteration 27: tolerance = .00003741
Iteration 28: tolerance = .00002564
Iteration 29: tolerance = .00001757
Iteration 30: tolerance = .00001204
Iteration 31: tolerance = 8.251e-06
Iteration 32: tolerance = 5.655e-06
Iteration 33: tolerance = 3.875e-06
Iteration 34: tolerance = 2.656e-06
Iteration 35: tolerance = 1.820e-06
Iteration 36: tolerance = 1.247e-06
Iteration 37: tolerance = 8.548e-07

GEE population-averaged model                   Number of obs     =    156,147
Group variable:                         id      Number of groups  =     14,692
Link:                               probit      Obs per group:
Family:                           binomial                    min =          1
Correlation:                  exchangeable                    avg =       10.6
                                                              max =         23
                                                Wald chi2(1)      =      24.18
Scale parameter:                         1      Prob > chi2       =     0.0000

                                     (Std. Err. adjusted for clustering on id)
------------------------------------------------------------------------------
             |             Semirobust
      status |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     Z_score |
         L1. |   .0335066   .0068144     4.92   0.000     .0201506    .0468626
             |
       _cons |  -1.658208   .0214663   -77.25   0.000    -1.700282   -1.616135
------------------------------------------------------------------------------

Does this seems correct to you?
And how do I calculate the Pseudo R squared in this case?

Thank you very much!

Comment

Kye Lippold

Join Date: Jun 2019

Posts: 67
#4

03 Jul 2019, 16:38

Using a population-averaged versus random-effects model involves assumptions about the correlation across the observations, so you should confirm which assumptions are appropriate for your case. This article might be helpful: https://www.stata.com/support/faqs/s...tion-averaged/

But leaving aside the pa versus re model difference, your model might be failing to converge because the cluster variable is at the same level as your panel variable. As the Stata manual for xtprobit states,

The panel variable must be nested within the cluster variable because of the within-panel correlation that is generally induced by the random-effects transform when there is heteroskedasticity or within-panel serial correlation in the idiosyncratic errors.

So running

Code:

xtprobit status L.Z_score, re vce(robust)

may converge better, while still using random effects. Worth a try anyway.

For clarification: your guiding paper's statement that they report "marginal effects" is not necessarily related to whether you use a random effects or population averaged probit model. What the "marginal effects" probably means is that they get their estimates from the margins command after running their model.

Code:

xtprobit status L.Z_score, re vce(robust) margins, dydx(L.z_score)

i.e. you want to know how having a 1-standard deviation higher risk score last year affects the *probability* of failure (which comes from the margins command), not just the coefficient on lagged risk score in the probit model (which is not easily interpretable).

To get the Pseudo R-squared, just follow the directions at the link I posted above (https://www.stata.com/support/faqs/s...r2-for-probit/). Hint: you will need to run both your main (full) model and the (constant-only) model

Code:

xtprobit status , re vce(robust)

then compare the log likelihoods, as discussed at the link.
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2289
#5

03 Jul 2019, 21:26

I have several comments.

1. First, using RE probit is not buying you much over just using pooled probit because you're assuming the heterogeneity is independent of the explanatory variable. Using "robust" or "vce(cluster id)" is admitting the model is wrong, and it has no know robustness properties. So you're admitting your estimators are inconsistent. I'm not opposed to that, but you should know that's what you're doing.
2. Given (1), I would use use pooled probit -- which should be the same as the "PA" option in xtreg. Cluster your standard errors because you are not modeling the serial correlation. Use the margins command to get the magnitude of the effect of Z_score.
3. The pooled probit will run much faster than RE probit.
4. Lagging Z_score is peculiar with RE probit. Any variable must be strictly exogenous in RE probit. Lagging does not make a variable strictly exogenous.
5. You need to control for time with so many periods. The most flexible way to do that is to use i.year, which puts in a full set of time effects.
6. If you want to allow Z_score to be correlated with heterogeneity, you can use a correlated random effects approach, which I discuss in a 2019 Journal of Econometrics paper ("Correlated Random Effects Models for Unbalanced Panels.")
7. Reliance on a pseudo R-squared is just not interesting. Obtain the average marginal effect and determine whether it is practically important. If you want something comparable to the R-squared for linear regression, then compute the square of the correlation between y and the fitted probit probabilities.
8. In addition to probit, you should use a linear model even though you have a binary response. Then you can easily use fixed effects, and even compare with the CRE Model mentioned in point (6). The coefficient on Z_score (or its lag, if you insist) should be compare to the average marginal effect.
9. FYI: I talk about virtually all of these issues in a three-day summer school in early June at Michigan State University. It's called ESTIMATE.
3 likes
Comment

Katharina Maier

Join Date: May 2019
Posts: 29

08 Jul 2019, 09:38

Thank you very much for your detailed answeres, Kye and Jeff.

As I asked my Professor (who published my "Guiding Paper" "Bank Risk Proxies and the Crisis of 2007/09: A Comparison") for advice, he answered:

- He did not use any FE, because they are always a bit "complicated" with using them in a probit/logit regression.
- And I should just regress the model like them and use a Robustness-Check with Year FE

And as my model still is Pr(Failure = 1) = F(x´_it-1 ß) I think I still have Panel Data, but I was running the pooled probit regression for the proxy NPA(Non-Performing Assets):

Code:

probit status L.NPA, vce(cluster id)

Iteration 0:   log pseudolikelihood = -22279.067  
Iteration 1:   log pseudolikelihood =  -21892.33  
Iteration 2:   log pseudolikelihood = -21861.899  
Iteration 3:   log pseudolikelihood = -21861.858  
Iteration 4:   log pseudolikelihood = -21861.858  

Probit regression                               Number of obs     =    156,147
                                                Wald chi2(1)      =     294.85
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -21861.858               Pseudo R2         =     0.0187

                                (Std. Err. adjusted for 14,692 clusters in id)
------------------------------------------------------------------------------
             |               Robust
      status |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         NPA |
         L1. |   6.843765   .3985632    17.17   0.000     6.062596    7.624934
             |
       _cons |  -1.945288   .0206756   -94.09   0.000    -1.985811   -1.904764
------------------------------------------------------------------------------

. margins, dydx(L.NPA)

Average marginal effects                        Number of obs     =    156,147
Model VCE    : Robust

Expression   : Pr(status), predict()
dy/dx w.r.t. : L.NPA

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         NPA |
         L1. |   .4862504   .0351063    13.85   0.000     .4174434    .5550574
------------------------------------------------------------------------------

. fitstat

Measures of Fit for probit of status

Log-Lik Intercept Only:   -22279.067     Log-Lik Full Model:       -21861.858
D(156145):                 43723.716     LR(1):                       834.417
                                         Prob > LR:                     0.000
McFadden's R2:                 0.019     McFadden's Adj R2:             0.019
Maximum Likelihood R2:         0.005     Cragg & Uhler's R2:            0.021
McKelvey and Zavoina's R2:     0.017     Efron's R2:                    0.009
Variance of y*:                1.017     Variance of error:             1.000
Count R2:                      0.968     Adj Count R2:                 -0.002
AIC:                           0.280     AIC*n:                     43727.716
BIC:                      -1.824e+06     BIC':                       -822.459

.

I still think I did something wrong, because the Pseudo R Squared in my Guiding Paper is 0.4494. [As my prof was doing this paper, I would like to stay with the Pseudo R squared and the Lagged Proxies]. Beside this, the calculated average marginal coeffiants are not fitting as well ...

I am very thankful for every support!

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#7

08 Jul 2019, 10:38

Katharina:
I'm not sure whether what follows could be helpful but your regression code seems to lack of -i.year- among your set of predictors (ie, the "Year FE" recommended by your professor).

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment

Katharina Maier

Join Date: May 2019
Posts: 29

09 Jul 2019, 03:15

Thank you Carlo for your advice. So I was running this:

Code:

xtset id year
       panel variable:  id (unbalanced)
        time variable:  year, 1995 to 2018, but with gaps
                delta:  1 unit

. probit status L.NPA i.year, vce(cluster id)

note: 2017.year != 0 predicts failure perfectly
      2017.year dropped and 4848 obs not used

note: 2018.year != 0 predicts failure perfectly
      2018.year dropped and 4636 obs not used

Iteration 0:   log pseudolikelihood = -21957.554  
Iteration 1:   log pseudolikelihood = -20335.612  
Iteration 2:   log pseudolikelihood = -20093.873  
Iteration 3:   log pseudolikelihood = -20081.849  
Iteration 4:   log pseudolikelihood = -20081.524  
Iteration 5:   log pseudolikelihood = -20081.523  

Probit regression                               Number of obs     =    146,663
                                                Wald chi2(21)     =     845.63
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -20081.523               Pseudo R2         =     0.0854

                                (Std. Err. adjusted for 14,508 clusters in id)
------------------------------------------------------------------------------
             |               Robust
      status |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         NPA |
         L1. |   12.62209   .5712653    22.09   0.000     11.50243    13.74175
             |
        year |
       1997  |    .033304   .0107776     3.09   0.002     .0121804    .0544276
       1998  |   .0903989   .0155846     5.80   0.000     .0598536    .1209442
       1999  |   .1198117    .019252     6.22   0.000     .0820786    .1575449
       2000  |   .1746079   .0218685     7.98   0.000     .1317465    .2174693
       2001  |   .2197541   .0233866     9.40   0.000     .1739171    .2655911
       2002  |   .2116425   .0248355     8.52   0.000     .1629658    .2603192
       2003  |   .2298964    .026194     8.78   0.000      .178557    .2812357
       2004  |   .2783759   .0271742    10.24   0.000     .2251156    .3316363
       2005  |   .3184363   .0278422    11.44   0.000     .2638667     .373006
       2006  |   .3787239   .0284011    13.33   0.000     .3230588    .4343891
       2007  |   .4003768   .0287657    13.92   0.000     .3439971    .4567566
       2008  |   .3376161   .0289047    11.68   0.000     .2809639    .3942684
       2009  |   .0204806   .0331114     0.62   0.536    -.0444166    .0853777
       2010  |  -.4149216   .0443944    -9.35   0.000    -.5019331   -.3279101
       2011  |  -.8057231   .0615298   -13.09   0.000    -.9263194   -.6851269
       2012  |    -1.0068    .075852   -13.27   0.000    -1.155468   -.8581333
       2013  |  -1.088158   .0888461   -12.25   0.000    -1.262293    -.914023
       2014  |  -1.182951    .102993   -11.49   0.000    -1.384814   -.9810889
       2015  |    -1.2949   .1294782   -10.00   0.000    -1.548673   -1.041128
       2016  |  -1.424314   .1634085    -8.72   0.000    -1.744588   -1.104039
       2017  |          0  (empty)
       2018  |          0  (empty)
             |
       _cons |   -2.03892    .027685   -73.65   0.000    -2.093181   -1.984658
------------------------------------------------------------------------------

Code:

note: 2017.year != 0 predicts failure perfectly
      2017.year dropped and 4848 obs not used

note: 2018.year != 0 predicts failure perfectly
      2018.year dropped and 4636 obs not used

But I am a bit confused from this results.
And the Pseudo R2 is with 0.0854 again very low.

So I tryed the xtprobit version and this is the Output:

Code:

xtset id year
       panel variable:  id (unbalanced)
        time variable:  year, 1995 to 2018, but with gaps
                delta:  1 unit

. xtprobit status L.NPA i.year, re
note: 2017.year != 0 predicts failure perfectly
      2017.year dropped and 4848 obs not used

note: 2018.year != 0 predicts failure perfectly
      2018.year dropped and 4636 obs not used


Fitting comparison model:

Iteration 0:   log likelihood = -21957.554  
Iteration 1:   log likelihood = -20335.612  
Iteration 2:   log likelihood = -20093.873  
Iteration 3:   log likelihood = -20081.849  
Iteration 4:   log likelihood = -20081.524  
Iteration 5:   log likelihood = -20081.523  

Fitting full model:

rho =  0.0     log likelihood = -20081.523
rho =  0.1     log likelihood =  -14310.04
rho =  0.2     log likelihood = -12050.396
rho =  0.3     log likelihood = -10783.589
rho =  0.4     log likelihood = -9969.6013
rho =  0.5     log likelihood = -9436.7252
rho =  0.6     log likelihood = -9084.9945
rho =  0.7     log likelihood = -8913.9813
rho =  0.8     log likelihood = -8951.9779

Iteration 0:   log likelihood = -8833.5465  
Iteration 1:   log likelihood = -7301.6758  
Iteration 2:   log likelihood = -4813.7926  
Iteration 3:   log likelihood = -4457.4763  (not concave)
Iteration 4:   log likelihood = -4006.4765  
Iteration 5:   log likelihood = -3879.9508  
Iteration 6:   log likelihood = -3740.7192  
Iteration 7:   log likelihood = -3734.2257  
Iteration 8:   log likelihood = -3734.2257  (backed up)
Iteration 9:   log likelihood = -3727.4494  
Iteration 10:  log likelihood = -3727.4133  
Iteration 11:  log likelihood = -3727.4133  

Random-effects probit regression                Number of obs     =    146,663
Group variable: id                              Number of groups  =     14,508

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =       10.1
                                                              max =         21

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(21)     =     128.67
Log likelihood  = -3727.4133                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
      status |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         NPA |
         L1. |   13.31323   1.423313     9.35   0.000     10.52358    16.10287
             |
        year |
       1997  |   .0975093   .1429227     0.68   0.495    -.1826141    .3776326
       1998  |   .1948864    .148809     1.31   0.190     -.096774    .4865468
       1999  |    .219814   .1537134     1.43   0.153    -.0814588    .5210868
       2000  |   .2990826   .1547355     1.93   0.053    -.0041934    .6023587
       2001  |   .3571102   .1538511     2.32   0.020     .0555676    .6586527
       2002  |   .3500996   .1551923     2.26   0.024     .0459284    .6542709
       2003  |   .3804756    .154545     2.46   0.014      .077573    .6833783
       2004  |   .4496399   .1528864     2.94   0.003     .1499881    .7492916
       2005  |   .4926398   .1524605     3.23   0.001     .1938227     .791457
       2006  |   .5684893   .1502201     3.78   0.000     .2740635    .8629152
       2007  |    .612208   .1486495     4.12   0.000     .3208603    .9035557
       2008  |    .544607   .1506098     3.62   0.000     .2494172    .8397968
       2009  |   .1756179    .168733     1.04   0.298    -.1550928    .5063285
       2010  |  -.3062257   .2126451    -1.44   0.150    -.7230024    .1105511
       2011  |  -.7539135   .2806535    -2.69   0.007    -1.303984   -.2038427
       2012  |  -.9800521   .3366247    -2.91   0.004    -1.639824   -.3202797
       2013  |  -1.250682   .4133143    -3.03   0.002    -2.060763   -.4406009
       2014  |  -1.601034   .5088541    -3.15   0.002    -2.598369   -.6036978
       2015  |  -2.196771    .541836    -4.05   0.000     -3.25875   -1.134792
       2016  |  -2.637606    .598203    -4.41   0.000    -3.810063    -1.46515
       2017  |          0  (empty)
       2018  |          0  (empty)
             |
       _cons |  -6.003151   .1282317   -46.81   0.000    -6.254481   -5.751822
-------------+----------------------------------------------------------------
    /lnsig2u |   2.327295    .025184                      2.277935    2.376654
-------------+----------------------------------------------------------------
     sigma_u |   3.201589   .0403144                      3.123542    3.281587
         rho |   .9111125   .0020396                      .9070331    .9150297
------------------------------------------------------------------------------
LR test of rho=0: chibar2(01) = 3.3e+04                Prob >= chibar2 = 0.000

With xtprobit I am able to calculate the Pseudo R2 with running the Constand-Only Model

Code:

xtprobit status i.year, re

Should I stay with the pooled probit regression -probit status L.NPA i.year, vce(cluster id)- or the -xtprobit status L.NPA i.year, re- ?

Thank you very much!

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#9

09 Jul 2019, 03:41

Katharina:
as per LR test outcome (please, see the -xtprobit- outcome table footnote), I would stay with -xtprobit-.
That said, I would investigate with your professor what makes your pseudo-R2 so different from hers/his.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment

Katharina Maier

Join Date: May 2019
Posts: 29

#10

09 Jul 2019, 05:51

Actually, I think so, too. May I ask you why the i.year command dropped me the year 2017 and 2018 and what extactly means -2018.year != 0 predicts failure perfectly- ?
This is the conastant only (reduced model)

Code:

xtprobit status i.year, re
note: 2017.year != 0 predicts failure perfectly
      2017.year dropped and 4954 obs not used

note: 2018.year != 0 predicts failure perfectly
      2018.year dropped and 4751 obs not used


Fitting comparison model:

Iteration 0:   log likelihood = -24742.891  
Iteration 1:   log likelihood = -23830.706  
Iteration 2:   log likelihood = -23735.781  
Iteration 3:   log likelihood = -23731.451  
Iteration 4:   log likelihood = -23731.412  
Iteration 5:   log likelihood = -23731.412  

Fitting full model:

rho =  0.0     log likelihood = -23731.412
rho =  0.1     log likelihood = -16491.057
rho =  0.2     log likelihood = -13760.909
rho =  0.3     log likelihood = -12249.202
rho =  0.4     log likelihood = -11285.362
rho =  0.5     log likelihood = -10653.081
rho =  0.6     log likelihood =  -10237.52
rho =  0.7     log likelihood = -10002.933
rho =  0.8     log likelihood =  -10012.32

Iteration 0:   log likelihood = -9924.3625  
Iteration 1:   log likelihood = -8202.4154  
Iteration 2:   log likelihood = -5346.1539  
Iteration 3:   log likelihood = -5184.5764  (not concave)
Iteration 4:   log likelihood = -4453.7311  
Iteration 5:   log likelihood =  -4303.206  
Iteration 6:   log likelihood = -3813.4529  (not concave)
Iteration 7:   log likelihood = -3761.5152  (not concave)
Iteration 8:   log likelihood = -3761.5152  (not concave)
Iteration 9:   log likelihood = -3720.2809  
Iteration 10:  log likelihood = -3677.3811  
Iteration 11:  log likelihood = -3601.6063  
Iteration 12:  log likelihood = -3597.4424  
Iteration 13:  log likelihood = -3597.1533  
Iteration 14:  log likelihood = -3597.1509  
Iteration 15:  log likelihood = -3597.1509  

Random-effects probit regression                Number of obs     =    162,726
Group variable: id                              Number of groups  =     15,988

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =       10.2
                                                              max =         22

Integration method: mvaghermite                 Integration pts.  =         12

                                                Wald chi2(21)     =      55.39
Log likelihood  = -3597.1509                    Prob > chi2       =     0.0001

------------------------------------------------------------------------------
      status |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        year |
       1996  |   .0894077   .1416024     0.63   0.528    -.1881279    .3669432
       1997  |   .1388191   .1466484     0.95   0.344    -.1486065    .4262447
       1998  |   .1832042   .1488738     1.23   0.218    -.1085831    .4749915
       1999  |   .2206148   .1496072     1.47   0.140    -.0726098    .5138395
       2000  |   .2680973   .1492081     1.80   0.072    -.0243453    .5605398
       2001  |   .2992227   .1491867     2.01   0.045     .0068222    .5916232
       2002  |   .3236388   .1493597     2.17   0.030     .0308993    .6163784
       2003  |   .3589946   .1481358     2.42   0.015     .0686537    .6493355
       2004  |   .4015694    .146465     2.74   0.006     .1145032    .6886356
       2005  |   .4440397   .1443581     3.08   0.002      .161103    .7269764
       2006  |   .5024368   .1418899     3.54   0.000     .2243377    .7805359
       2007  |   .5198905   .1413706     3.68   0.000     .2428093    .7969717
       2008  |   .5155576   .1425503     3.62   0.000     .2361642     .794951
       2009  |   .3996658   .1521478     2.63   0.009     .1014616    .6978699
       2010  |   .2306244   .1720158     1.34   0.180    -.1065204    .5677692
       2011  |   .0320487   .2050353     0.16   0.876     -.369813    .4339105
       2012  |  -.1148814   .2393756    -0.48   0.631     -.584049    .3542862
       2013  |  -.3213102   .2941591    -1.09   0.275    -.8978513     .255231
       2014  |  -.5526248   .3606976    -1.53   0.125    -1.259579    .1543296
       2015  |  -.7842658   .4367953    -1.80   0.073    -1.640369    .0718372
       2016  |  -1.113078   .6059185    -1.84   0.066    -2.300657    .0745002
       2017  |          0  (empty)
       2018  |          0  (empty)
             |
       _cons |  -6.934767   .1032842   -67.14   0.000      -7.1372   -6.732334
-------------+----------------------------------------------------------------
    /lnsig2u |   2.705031    .020681                      2.664497    2.745565
-------------+----------------------------------------------------------------
     sigma_u |   3.867141   .0399882                      3.789554    3.946316
         rho |   .9373229    .001215                      .9348989    .9396624
------------------------------------------------------------------------------
LR test of rho=0: chibar2(01) = 4.0e+04                Prob >= chibar2 = 0.000

And If Im calculating the Pseudo R2 with the log likelihoods as mentioned in this link https://www.stata.com/support/faqs/s...r2-for-probit/

Pseudo R2 = (3597.1509 - 3727.4133) / 3597.1509 = - 0.03621

The Outcoming is even a negative Pseudo R2

Last edited by Katharina Maier; 09 Jul 2019, 05:59.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#11

09 Jul 2019, 07:09

Katharina:
as reported by Stata, in 2017 and 2018 there's no variation in terms of outcome: all fail. If there's no variation, there's nothing that -(xt)probit- can calculate.
I would also check whether your model is well specified (ie, whether all the predictors and interactions needed to give a fair and true view of the data generating process were actually plugged in).

Kind regards,
Carlo
(Stata 19.0)
Comment
Katharina Maier

Join Date: May 2019

Posts: 29
#12

09 Jul 2019, 09:11

as reported by Stata, in 2017 and 2018 there's no variation in terms of outcome: all fail. If there's no variation, there's nothing that -(xt)probit- can calculate.

Ah, true. You can see in my post #1 that and in this post https://www.statalist.org/forums/for...ith-panel-data that I merged the "Bank Failure List" (with only the failed banks included) and the complete Dataset (active and failed banks included).
In the year 2017 6 banks failed out of 4954 banks (active and failed). Originally I started my Dataset with quarterly data and assumed that the Q4 adds up Q1, Q2, Q3 so I only kept Q4 and dropped all others [the suggestion from my Prof]. So if a bank failed in Q1, Q2, Q3 its data droped as well and so there are no "failed" banks in 2017.
Can I assume that Q4 = "the last recorded quarter" and just change the date e.g. for Q2 to Q4 just that Stata is recognizing it? Sorry if I'm confusing.

For 2018 are no failed banks in the Failed Bank List stated.

as reported by Stata, in 2017 and 2018 there's no variation in terms of outcome: all fail.

In my post #1 it is also seen, that I wanted to have for the status "FAILURE" = 1 and for "ACTIVE" = 0. This also holds for my data example. But why is Stata saying that all bank failed? Maybe my data labeling is wrong and that's why my Pseudo R2 is so low?

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(year id status NPA) 1995 7 1 .002305296 1995 10 0 .00155195 1995 11 0 .002352379 1995 14 0 .006836672 1995 16 0 .0011487601 1995 21 0 .0043070414 1995 27 0 .003016669 1995 32 0 .02691917 1995 34 0 .0042499905 1995 38 0 .0019577951 1995 41 0 .011103634 1995 46 0 .005814799 1995 47 0 .0002461478 1995 48 0 .0038119294 1995 52 0 .005574361 1995 53 0 .0034799736 1995 56 0 .01513538 1995 59 0 .008440489 1995 76 0 .003782088 1995 82 0 .002611102 1995 84 0 .004086576 1995 85 0 .002612464 1995 87 0 0 1995 99 0 .018377377 1995 101 0 .0018665423 1995 104 0 .00557228 end
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2289
#13

09 Jul 2019, 11:33

You're getting a negative pseudo R^2 because the estimations use different sets of observations. By lagging the explanatory variable you are losing one complete time period. If you insist on computing the pseudo R^2 then the constant-only model should be estimated on the same set of data as the model with L.NPA. Make sure the same observations are used. If you have missing data on NPA, you should restrict the constant-only model to the same cases as used in the L.NPA regression.

And it is far from clear that random effects probit using xtprobit is preferred to pooled probit. RE probit requires that there is no extra serial correlation beyond that accounted for by the unobserved effect. If there is extra serial correlation, RE probit is inconsistent. Pooled probit is consistent for any kind of serial correlation pattern. If you get acceptable results using pooled probit with clustered standard errors you should probably use that. It's what I recommend in my MIT Press book and the short courses I teach.

JW
2 likes
Comment
Kye Lippold

Join Date: Jun 2019

Posts: 67
#14

10 Jul 2019, 13:44

Katharina: There seem to be several data issues you should resolve. Your description of how you dealt with banks failing in different quarters sounds off. You probably want your "status" variable to tell you if a bank *ever* failed in a given year. As you mention, if you drop quarters 1-3 and keep only quarter 4, you are missing any banks that failed in quarters 1-3.

Furthermore, it sounds like you are missing the time dimension in your merge (I previously thought that your "cert" variable included time, but I see now that it probably doesn't). That means your status variable is telling you "did the bank ever fail from 1995-2018", not "did the bank fail this year". You likely intended the second definition.

Based on your posts, here is how I would proceed:
1. Get your bank failure list on an annual (rather than quarterly) basis (it isn't clear to me how that data is currently structured). You can use -collapse- to do this if needed.
2. Do any changes to your initial list of banks, including keeping only the 4th quarter (so that all your independent variables are measured as of that date).
3. Then merge in the bank failure list, including both bank ID and year in the match variables.

Code:

merge 1:1 cert year using `dataset1', assert(match master)

Note that I switch from a m:1 to a 1:1 merge, because you should only see each bank once per year in both the list of banks and the list of failed banks. (If you don't, something is off with the data structure).
4. Define status the way you did previously, based on the _merge variable.

If you set up the data this way, your status variable will mean "did this bank ever fail this year" (which I believe is what you want to be measuring), rather than "did this bank ever fail" or "did this bank ever fail in the 4th quarter" (which aren't very meaningful).

Regarding the "2017.year != 0 predicts failure perfectly" -- note that Stata has no idea that your variable reflects bank "failures", so don't be confused by the use of the word "failure" in the output. What Stata is telling you is that all observations with year = 2017 have status = 0 (in probit terms, a 0 is called a "failure", a 1 is a "success"). So you have zero failed banks in the year 2017, meaning that year's fixed effect can't be computed. When you fix your dependent variable (as above), this problem should go away. (2018 will still be dropped if no banks did indeed fail that year).

For the pooled versus RE probit: it sounds like you want to match your professor's results. So it seems to me that you should ask your professor if he ran RE probit or pooled probit. You have little chance of matching what he did if you use a different method. (Even if one or the other is a better method in general).

Jeff's response covers the reason for the negative pseudo R^2. Note that the link I posted about the method warns you about this:

Be careful when obtaining the log likelihood for the constant-only model that you fit the model on the same estimation subsample on which you fitted the full model. Remember, Stata drops observations in which variables have missing values and, in the constant-only model, you are not specifying those variables. Probably the safest thing to do is refit the full model and then fit the constant-only model if e(sample).

My previous description of this pseudo-R^2 method was a little vague (I wanted to give you a chance to work through it). But to be more explicit: in code, to use this method I would run

Code:

xtprobit status L.NPA i.year, re xtprobit status if e(sample), re

i.e. the constant-only model should not have any independent variables in it, and should be estimated on the same sample as the full model. To echo others, I would only do this because you want a pseudo-R^2 statistic to match your professor (being aware that the statistic isn't meaningful).
1 like
Comment

Announcement

Binary Probit Regression with Panel Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment