Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple fixed effects for Binary outcomes model with Cross Sectional Data

    Hello Statalist,

    I'm currently analyzing cross-section data of individual loans originated by 2 banks in 2017-2018 years in the country. (each individual loan appear once only). My goal is to estimate the effect of a relief lending program to loan outcomes (overdue/default).
    Dependent variable Outcome: binary 1 if overdue/default within 2 years from the origination date, and 0 otherwise
    I would like to add bank fixed effects and time fixed effects and zip-code fixed effects (note that the time variable is the quarter when the loan is issued so it repeated within bank and zip code). Because my data is not a time series there for function xtset is not applicable. Please help to advise me:
    1. Is there any way to add fixed effects into a logit/probit model when xtset cannot be used? I tried:
    Code:
    xtlogit outcome i.relief loan_control borrower_control i.bank i.zip i.date, vce(cluster bank)
    where relief is dummy variable indicates whether the loan is qualified for the relief lending program or not. But because there are more than 900 zipcodes in my data, so Stata showed errors of matsize too small.
    I also tried:
    Code:
    egen zip_date = group(zip date)
    areg outcome i.relief loan_control borrower_control, absorb(gse zip_date)
    however absord does not allow me to add 2 variables in.
    2. One of the 2 banks in my sample issues an overwhelming numbers of loans compared to the other. I wonder if I need to do anything with this unbalance?
    Thank you in advance.

  • #2
    Khloe: Replace xtlogit with logit in your first command and it should be fine.

    JW

    Comment


    • #3
      Thank you very much Prof. Wooldridge, I really appreciate it. However, as I mentioned, the i.zip made the result unable to show due to too many parameters (I have more than 900 zip codes), please advise me how should I fixed this problem?

      Comment


      • #4
        egen zip_date = group(zip date)
        areg outcome i.relief loan_control borrower_control, absorb(gse zip_date)
        If you are using areg with a binary outcome, then you are estimating a linear probability model. For linear models, thanks to high-dimensional fixed effects, there are no issues with having a large number of indicators. Install reghdfe from SSC and absorb all the indicators.

        Code:
        ssc install reghdfe
        egen zip_date = group(zip date)
        reghdfe outcome i.relief loan_control borrower_control, absorb(gse zip_date)
        *or
        reghdfe outcome loan_control borrower_control, absorb(relief gse zip_date)
        logit outcome i.relief loan_control borrower_control i.bank i.zip i.date, vce(cluster bank)
        For nonlinear models, e.g., logit, I am yet to see anyone who has programmed an estimaor with high-dimensional fixed effects [except for poisson (poi2hdfe from SSC)]. Therefore, try seeing whether increasing the limit size of you matrix helps. If not, you may be able to estimate the model with Stata MP. See

        Code:
        help limits

        Comment


        • #5
          Dear Andrew Musau,
          Thank you for your kind note.
          Yes, I am aware that I estimate the LPM when I used areg. My point is that if there is any way to do the same with logit/probit model. And thank you for your answer, I think I might choose another fixed effects on location instead of zip code (e.g. state level).

          Comment


          • #6
            However, as I mentioned, the i.zip made the result unable to show due to too many parameters (I have more than 900 zip codes),

            Do you mean that the estimation is successful but you are not able to see some estimates or do you receive a -matsize too small- error? If the former, the fix is simple.

            Comment


            • #7
              You might try

              Code:
              xtset zip
              xtlogit outcome i.relief i.bank i.year ..., fe
              This accounts for zip code fixed effects. This won’t get you marginal effects.

              You could use correlated rand effects probit by including within zip code averages along with the other variables.

              Comment


              • #8
                Originally posted by Andrew Musau View Post


                Do you mean that the estimation is successful but you are not able to see some estimates or do you receive a -matsize too small- error? If the former, the fix is simple.
                I got the error of matsize too small.

                Comment


                • #9
                  Originally posted by Jeff Wooldridge View Post
                  You might try

                  Code:
                  xtset zip
                  xtlogit outcome i.relief i.bank i.year ..., fe
                  This accounts for zip code fixed effects. This won’t get you marginal effects.

                  You could use correlated rand effects probit by including within zip code averages along with the other variables.
                  Thank you Prof., I'll try to read about the correlated rand effects probit you mentioned.

                  Comment


                  • #10
                    Jeff Wooldridge , xtlogit implements conditional fixed effects. I would guess that estimates differ from unconditional fixed effects as the example below shows.

                    Code:
                    webuse grunfeld, clear
                    set seed 08162020
                    gen outcome=runiformint(0,1)
                    logit outcome mvalue kstock i.company
                    xtset company
                    xtlogit outcome mvalue kstock, fe
                    Res.:

                    Code:
                    . logit outcome mvalue kstock i.company
                    
                    Iteration 0:   log likelihood = -138.37933  
                    Iteration 1:   log likelihood = -128.30824  
                    Iteration 2:   log likelihood = -128.27376  
                    Iteration 3:   log likelihood = -128.27376  
                    
                    Logistic regression                             Number of obs     =        200
                                                                    LR chi2(11)       =      20.21
                                                                    Prob > chi2       =     0.0425
                    Log likelihood = -128.27376                     Pseudo R2         =     0.0730
                    
                    ------------------------------------------------------------------------------
                         outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                    -------------+----------------------------------------------------------------
                          mvalue |  -.0003398   .0004719    -0.72   0.471    -.0012648    .0005851
                          kstock |    .000985   .0007127     1.38   0.167    -.0004119    .0023818
                                 |
                         company |
                              2  |  -1.778794   1.261069    -1.41   0.158    -4.250444    .6928568
                              3  |  -.5834973   1.254485    -0.47   0.642    -3.042243    1.875249
                              4  |   -.936505   1.739525    -0.54   0.590    -4.345912    2.472902
                              5  |  -2.847785   2.033451    -1.40   0.161    -6.833277    1.137706
                              6  |  -.1653838   1.865359    -0.09   0.929     -3.82142    3.490653
                              7  |  -1.513497   2.007012    -0.75   0.451    -5.447167    2.420174
                              8  |  -.5033782     1.7469    -0.29   0.773     -3.92724    2.920483
                              9  |  -1.031798   1.922085    -0.54   0.591    -4.799016     2.73542
                             10  |  -1.439941   1.998427    -0.72   0.471    -5.356786    2.476903
                                 |
                           _cons |   1.052722   1.976743     0.53   0.594    -2.821622    4.927066
                    ------------------------------------------------------------------------------
                    
                    . 
                    . xtset company
                           panel variable:  company (balanced)
                    
                    . 
                    . xtlogit outcome mvalue kstock, fe
                    note: multiple positive outcomes within groups encountered.
                    
                    Iteration 0:   log likelihood = -112.34032  
                    Iteration 1:   log likelihood = -111.46028  
                    Iteration 2:   log likelihood = -111.45918  
                    Iteration 3:   log likelihood = -111.45918  
                    
                    Conditional fixed-effects logistic regression   Number of obs     =        200
                    Group variable: company                         Number of groups  =         10
                    
                                                                    Obs per group:
                                                                                  min =         20
                                                                                  avg =       20.0
                                                                                  max =         20
                    
                                                                    LR chi2(2)        =       1.96
                    Log likelihood  = -111.45918                    Prob > chi2       =     0.3750
                    
                    ------------------------------------------------------------------------------
                         outcome |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                    -------------+----------------------------------------------------------------
                          mvalue |  -.0003223   .0004598    -0.70   0.483    -.0012235     .000579
                          kstock |   .0009351   .0006939     1.35   0.178    -.0004249     .002295
                    ------------------------------------------------------------------------------

                    Comment


                    • #11
                      Good point, Andrew. I guess that takes us to the question of how many observations per zip code. It's probably a lot, in which case putting in zip code dummies is ideal. But xtlogit will at least account for the zip code effects in a way that is computationally less intensive, I think.

                      Comment


                      • #12
                        Thanks Jeff. Here, I create a dataset of 25,000 observations with varying number of observations per zip code, ranging from 10 to 5,000 (N=25,000). There are virtually no differences in the estimated conditional and unconditional logit coefficients and standard errors when the number of observations in a zip code is large. The code uses esttab from SSC to present the results.

                        Code:
                        eststo clear
                        foreach zip of numlist 10 500 1000 5000{
                            clear
                            set obs 25000
                            egen double zip=  seq(), block(`zip')
                            set seed 08182020
                            gen zip`zip'= runiformint(0,1)
                            forval i=1/3{
                                gen var`i'=int(rnormal(0,`=`i'/3'))
                            }
                            logit zip`zip' var* i.zip
                            eststo logit`zip'    
                            xtset zip
                            xtlogit zip`zip' var*
                            eststo xtlogit`zip'
                        }
                        esttab logit*, keep(var*)
                        esttab xtlogit*, keep(var*)
                        Res.:

                        Code:
                        . esttab logit*, keep(var*)
                        
                        ----------------------------------------------------------------------------
                                              (1)             (2)             (3)             (4)  
                                            zip10          zip500         zip1000         zip5000  
                        ----------------------------------------------------------------------------
                        main                                                                        
                        var1               0.0824           0.108           0.101           0.102  
                                           (0.29)          (0.41)          (0.39)          (0.39)  
                        
                        var2               0.0136         0.00795         0.00614         0.00652  
                                           (0.36)          (0.24)          (0.18)          (0.19)  
                        
                        var3               0.0248          0.0125          0.0126          0.0132  
                                           (1.19)          (0.66)          (0.67)          (0.70)  
                        ----------------------------------------------------------------------------
                        N                   24970           25000           25000           25000  
                        ----------------------------------------------------------------------------
                        t statistics in parentheses
                        * p<0.05, ** p<0.01, *** p<0.001
                        
                        .
                        . esttab xtlogit*, keep(var*)
                        
                        ----------------------------------------------------------------------------
                                              (1)             (2)             (3)             (4)  
                                            zip10          zip500         zip1000         zip5000  
                        ----------------------------------------------------------------------------
                        main                                                                        
                        var1                0.102           0.102           0.102           0.102  
                                           (0.39)          (0.39)          (0.39)          (0.39)  
                        
                        var2              0.00677         0.00672         0.00677         0.00673  
                                           (0.20)          (0.20)          (0.20)          (0.20)  
                        
                        var3               0.0131          0.0131          0.0131          0.0131  
                                           (0.70)          (0.70)          (0.70)          (0.70)  
                        ----------------------------------------------------------------------------
                        N                   25000           25000           25000           25000  
                        ----------------------------------------------------------------------------
                        t statistics in parentheses
                        * p<0.05, ** p<0.01, *** p<0.001
                        
                        .

                        Comment


                        • #13
                          Right, Andrew. With many observations per zip code -- maybe 20 or 30 is enough -- there will be little bias from the incidental parameters problem in just putting in zip code dummies. You can see with 10 observations per zip code putting in the zip code dummies does cause some bias. Column (1) with logit is notably different from the others. xtlogit produces stable estimates across all configurations.

                          Comment


                          • #14
                            Hello everyone,

                            I have a similar case.
                            I have a question about logit regression model - using repeated cross sectional data.
                            I am using two survey waves (2002 and 2004) to estimate the impact of conflict on school attendance (binary). Conflict intensity is measured as cummulative number of fatalities/conflict events in the previous academic years (2001 and 2003 respectively). I do have year of birth, district of residence, individual ID and household ID.

                            In Stata I have used the following codes:
                            cov = male emigrated age_hhead malehh edyrshh hsize agrichh urban poor orphan distsch

                            logit attend fatalities5 $cov i. intvyr i.district [pweight= mult], vce(cluster district)

                            glm attend fatalities5 $cov i. intvyr i.district [pweight= mult], f(bin) l(logit) vce(cluster district)
                            where intvyr is interview year and district is place of residence.

                            Is this correct way to account for the Fixed Effects (year and district), what other FE should I look out for or consider?

                            I also tried the following reghdfe code:
                            reghdfe attend conflict5 $cov [pweight= mult], abs(district intvyr) vce(cluster district)

                            Comment

                            Working...
                            X