Multiple fixed effects for Binary outcomes model with Cross Sectional Data

Khloe Le

Join Date: Aug 2020

Posts: 5
#1

Multiple fixed effects for Binary outcomes model with Cross Sectional Data

15 Aug 2020, 15:50

Hello Statalist,

I'm currently analyzing cross-section data of individual loans originated by 2 banks in 2017-2018 years in the country. (each individual loan appear once only). My goal is to estimate the effect of a relief lending program to loan outcomes (overdue/default).
Dependent variable Outcome: binary 1 if overdue/default within 2 years from the origination date, and 0 otherwise
I would like to add bank fixed effects and time fixed effects and zip-code fixed effects (note that the time variable is the quarter when the loan is issued so it repeated within bank and zip code). Because my data is not a time series there for function xtset is not applicable. Please help to advise me:
1. Is there any way to add fixed effects into a logit/probit model when xtset cannot be used? I tried:

Code:

xtlogit outcome i.relief loan_control borrower_control i.bank i.zip i.date, vce(cluster bank)

where relief is dummy variable indicates whether the loan is qualified for the relief lending program or not. But because there are more than 900 zipcodes in my data, so Stata showed errors of matsize too small.
I also tried:

Code:

egen zip_date = group(zip date) areg outcome i.relief loan_control borrower_control, absorb(gse zip_date)

however absord does not allow me to add 2 variables in.
2. One of the 2 banks in my sample issues an overwhelming numbers of loans compared to the other. I wonder if I need to do anything with this unbalance?
Thank you in advance.
Tags: None
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#2

15 Aug 2020, 16:22

Khloe: Replace xtlogit with logit in your first command and it should be fine.

JW
Comment
Khloe Le

Join Date: Aug 2020

Posts: 5
#3

16 Aug 2020, 00:38

Thank you very much Prof. Wooldridge, I really appreciate it. However, as I mentioned, the i.zip made the result unable to show due to too many parameters (I have more than 900 zip codes), please advise me how should I fixed this problem?
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10197
#4

16 Aug 2020, 04:13

egen zip_date = group(zip date)
areg outcome i.relief loan_control borrower_control, absorb(gse zip_date)

If you are using areg with a binary outcome, then you are estimating a linear probability model. For linear models, thanks to high-dimensional fixed effects, there are no issues with having a large number of indicators. Install reghdfe from SSC and absorb all the indicators.

Code:

ssc install reghdfe egen zip_date = group(zip date) reghdfe outcome i.relief loan_control borrower_control, absorb(gse zip_date) *or reghdfe outcome loan_control borrower_control, absorb(relief gse zip_date)

logit outcome i.relief loan_control borrower_control i.bank i.zip i.date, vce(cluster bank)

For nonlinear models, e.g., logit, I am yet to see anyone who has programmed an estimaor with high-dimensional fixed effects [except for poisson (poi2hdfe from SSC)]. Therefore, try seeing whether increasing the limit size of you matrix helps. If not, you may be able to estimate the model with Stata MP. See

Code:

help limits
Comment
Khloe Le

Join Date: Aug 2020

Posts: 5
#5

16 Aug 2020, 06:01

Dear Andrew Musau,
Thank you for your kind note.
Yes, I am aware that I estimate the LPM when I used areg. My point is that if there is any way to do the same with logit/probit model. And thank you for your answer, I think I might choose another fixed effects on location instead of zip code (e.g. state level).
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10197
#6

16 Aug 2020, 07:10

However, as I mentioned, the i.zip made the result unable to show due to too many parameters (I have more than 900 zip codes),

Do you mean that the estimation is successful but you are not able to see some estimates or do you receive a -matsize too small- error? If the former, the fix is simple.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#7

16 Aug 2020, 07:26

You might try

Code:

xtset zip xtlogit outcome i.relief i.bank i.year ..., fe

This accounts for zip code fixed effects. This won’t get you marginal effects.

You could use correlated rand effects probit by including within zip code averages along with the other variables.
1 like
Comment
Khloe Le

Join Date: Aug 2020

Posts: 5
#8

16 Aug 2020, 07:38

Originally posted by Andrew Musau View Post

Do you mean that the estimation is successful but you are not able to see some estimates or do you receive a -matsize too small- error? If the former, the fix is simple.

I got the error of matsize too small.
Comment
Khloe Le

Join Date: Aug 2020

Posts: 5
#9

16 Aug 2020, 07:39

Originally posted by Jeff Wooldridge View Post

You might try

Code:

xtset zip xtlogit outcome i.relief i.bank i.year ..., fe

This accounts for zip code fixed effects. This won’t get you marginal effects.

You could use correlated rand effects probit by including within zip code averages along with the other variables.

Thank you Prof., I'll try to read about the correlated rand effects probit you mentioned.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10197

#10

16 Aug 2020, 07:44

Jeff Wooldridge , xtlogit implements conditional fixed effects. I would guess that estimates differ from unconditional fixed effects as the example below shows.

Code:

webuse grunfeld, clear
set seed 08162020
gen outcome=runiformint(0,1)
logit outcome mvalue kstock i.company
xtset company
xtlogit outcome mvalue kstock, fe

Res.:

Code:

. logit outcome mvalue kstock i.company

Iteration 0:   log likelihood = -138.37933  
Iteration 1:   log likelihood = -128.30824  
Iteration 2:   log likelihood = -128.27376  
Iteration 3:   log likelihood = -128.27376  

Logistic regression                             Number of obs     =        200
                                                LR chi2(11)       =      20.21
                                                Prob > chi2       =     0.0425
Log likelihood = -128.27376                     Pseudo R2         =     0.0730

------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      mvalue |  -.0003398   .0004719    -0.72   0.471    -.0012648    .0005851
      kstock |    .000985   .0007127     1.38   0.167    -.0004119    .0023818
             |
     company |
          2  |  -1.778794   1.261069    -1.41   0.158    -4.250444    .6928568
          3  |  -.5834973   1.254485    -0.47   0.642    -3.042243    1.875249
          4  |   -.936505   1.739525    -0.54   0.590    -4.345912    2.472902
          5  |  -2.847785   2.033451    -1.40   0.161    -6.833277    1.137706
          6  |  -.1653838   1.865359    -0.09   0.929     -3.82142    3.490653
          7  |  -1.513497   2.007012    -0.75   0.451    -5.447167    2.420174
          8  |  -.5033782     1.7469    -0.29   0.773     -3.92724    2.920483
          9  |  -1.031798   1.922085    -0.54   0.591    -4.799016     2.73542
         10  |  -1.439941   1.998427    -0.72   0.471    -5.356786    2.476903
             |
       _cons |   1.052722   1.976743     0.53   0.594    -2.821622    4.927066
------------------------------------------------------------------------------

. 
. xtset company
       panel variable:  company (balanced)

. 
. xtlogit outcome mvalue kstock, fe
note: multiple positive outcomes within groups encountered.

Iteration 0:   log likelihood = -112.34032  
Iteration 1:   log likelihood = -111.46028  
Iteration 2:   log likelihood = -111.45918  
Iteration 3:   log likelihood = -111.45918  

Conditional fixed-effects logistic regression   Number of obs     =        200
Group variable: company                         Number of groups  =         10

                                                Obs per group:
                                                              min =         20
                                                              avg =       20.0
                                                              max =         20

                                                LR chi2(2)        =       1.96
Log likelihood  = -111.45918                    Prob > chi2       =     0.3750

------------------------------------------------------------------------------
     outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      mvalue |  -.0003223   .0004598    -0.70   0.483    -.0012235     .000579
      kstock |   .0009351   .0006939     1.35   0.178    -.0004249     .002295
------------------------------------------------------------------------------

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#11

18 Aug 2020, 13:07

Good point, Andrew. I guess that takes us to the question of how many observations per zip code. It's probably a lot, in which case putting in zip code dummies is ideal. But xtlogit will at least account for the zip code effects in a way that is computationally less intensive, I think.
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10197

#12

18 Aug 2020, 16:02

Thanks Jeff. Here, I create a dataset of 25,000 observations with varying number of observations per zip code, ranging from 10 to 5,000 (N=25,000). There are virtually no differences in the estimated conditional and unconditional logit coefficients and standard errors when the number of observations in a zip code is large. The code uses esttab from SSC to present the results.

Code:

eststo clear
foreach zip of numlist 10 500 1000 5000{
    clear
    set obs 25000
    egen double zip=  seq(), block(`zip')
    set seed 08182020
    gen zip`zip'= runiformint(0,1)
    forval i=1/3{
        gen var`i'=int(rnormal(0,`=`i'/3'))
    }
    logit zip`zip' var* i.zip
    eststo logit`zip'    
    xtset zip
    xtlogit zip`zip' var*
    eststo xtlogit`zip'
}
esttab logit*, keep(var*)
esttab xtlogit*, keep(var*)

Res.:

Code:

. esttab logit*, keep(var*)

----------------------------------------------------------------------------
                      (1)             (2)             (3)             (4)  
                    zip10          zip500         zip1000         zip5000  
----------------------------------------------------------------------------
main                                                                        
var1               0.0824           0.108           0.101           0.102  
                   (0.29)          (0.41)          (0.39)          (0.39)  

var2               0.0136         0.00795         0.00614         0.00652  
                   (0.36)          (0.24)          (0.18)          (0.19)  

var3               0.0248          0.0125          0.0126          0.0132  
                   (1.19)          (0.66)          (0.67)          (0.70)  
----------------------------------------------------------------------------
N                   24970           25000           25000           25000  
----------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

.
. esttab xtlogit*, keep(var*)

----------------------------------------------------------------------------
                      (1)             (2)             (3)             (4)  
                    zip10          zip500         zip1000         zip5000  
----------------------------------------------------------------------------
main                                                                        
var1                0.102           0.102           0.102           0.102  
                   (0.39)          (0.39)          (0.39)          (0.39)  

var2              0.00677         0.00672         0.00677         0.00673  
                   (0.20)          (0.20)          (0.20)          (0.20)  

var3               0.0131          0.0131          0.0131          0.0131  
                   (0.70)          (0.70)          (0.70)          (0.70)  
----------------------------------------------------------------------------
N                   25000           25000           25000           25000  
----------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

.

Comment

Jeff Wooldridge

Join Date: Apr 2014

Posts: 2168
#13

18 Aug 2020, 17:05

Right, Andrew. With many observations per zip code -- maybe 20 or 30 is enough -- there will be little bias from the incidental parameters problem in just putting in zip code dummies. You can see with 10 observations per zip code putting in the zip code dummies does cause some bias. Column (1) with logit is notably different from the others. xtlogit produces stable estimates across all configurations.
2 likes
Comment
Doug Hassanali

Join Date: Sep 2018

Posts: 14
#14

07 Jun 2021, 08:07

Hello everyone,

I have a similar case.
I have a question about logit regression model - using repeated cross sectional data.
I am using two survey waves (2002 and 2004) to estimate the impact of conflict on school attendance (binary). Conflict intensity is measured as cummulative number of fatalities/conflict events in the previous academic years (2001 and 2003 respectively). I do have year of birth, district of residence, individual ID and household ID.

In Stata I have used the following codes:
cov = male emigrated age_hhead malehh edyrshh hsize agrichh urban poor orphan distsch

logit attend fatalities5 $cov i. intvyr i.district [pweight= mult], vce(cluster district)

glm attend fatalities5 $cov i. intvyr i.district [pweight= mult], f(bin) l(logit) vce(cluster district)
where intvyr is interview year and district is place of residence.

Is this correct way to account for the Fixed Effects (year and district), what other FE should I look out for or consider?

I also tried the following reghdfe code:
reghdfe attend conflict5 $cov [pweight= mult], abs(district intvyr) vce(cluster district)
Comment

Announcement