Problem with observations quantity

Firangiz Aghayeva

Join Date: Oct 2021

Posts: 33
#1

Problem with observations quantity

30 Oct 2023, 05:20

Dear Forum Users,

I am working on a binomial logistic regression with around 16mln observations (panel data). My model analyzes the investors' decision to invest. I use logit command and it involves fixed effects. My main fixed effects are investor and the firm (they invest in) IDs. There are more than 7k investor and 4k firm IDs. Running the whole data is very time-consuming. Therefore, I use random sampling to select 100k subsample out of my sample to test the model. It runs around 12-24hrs, depending on the number of independent variables and additional fixed effects. When I run a simple model without fixed effects the number of observations is around 94k. However, when I run the same model with firm ID fixed effects the number of observations drops to 10k. There are no missing values in ID dummies. Do you have any idea of why it is so few?

Kind regards,
Firangiz
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10254

30 Oct 2023, 05:40

Possibly, the outcome does not vary for a large number of investors in the dataset. How large is your \(T\) dimension? If you have just a few observations per investor, you need to estimate a conditional FE logit model due to the incidental parameters problem. But that may itself be a problem given the size of your dataset. The example below illustrates the issue of time-invariant outcomes for some units.

Code:

webuse union, clear
xtset idcode year
xtlogit union age grade i.not_smsa south##c.year, fe

Res.:

Code:

. xtlogit union age grade i.not_smsa south##c.year, fe
note: multiple positive outcomes within groups encountered.
note: 2,744 groups (14,165 obs) dropped because of all positive or
      all negative outcomes.

Iteration 0:   log likelihood = -4516.5881  
Iteration 1:   log likelihood = -4510.8906  
Iteration 2:   log likelihood =  -4510.888  
Iteration 3:   log likelihood =  -4510.888  

Conditional fixed-effects logistic regression   Number of obs     =     12,035
Group variable: idcode                          Number of groups  =      1,690

                                                Obs per group:
                                                              min =          2
                                                              avg =        7.1
                                                              max =         12

                                                LR chi2(6)        =      78.60
Log likelihood  =  -4510.888                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
       union | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0710973   .0960536     0.74   0.459    -.1171643    .2593589
       grade |   .0816111   .0419074     1.95   0.051    -.0005259     .163748
  1.not_smsa |   .0224809   .1131786     0.20   0.843     -.199345    .2443069
     1.south |  -2.856488   .6765694    -4.22   0.000    -4.182539   -1.530436
        year |  -.0636853   .0967747    -0.66   0.510    -.2533602    .1259896
             |
south#c.year |
          1  |   .0264136   .0083216     3.17   0.002     .0101036    .0427235
------------------------------------------------------------------------------

.

Comment

Firangiz Aghayeva

Join Date: Oct 2021

Posts: 33
#3

08 Nov 2023, 06:36

You are right, Andrew. Thank you for your answer. I will need to increase the number of observations in my sample to get proper results.
Comment

Announcement

Problem with observations quantity

Comment

Comment