Estimating a binary outcome model with a large, unbalanced panel dataset

Richard Lin

Join Date: Apr 2014

Posts: 10
#1

Estimating a binary outcome model with a large, unbalanced panel dataset

29 Jun 2014, 09:53

I have been trying to estimate a binary outcome variable on a large, unbalanced panel dataset. I have Stata MP and good hardware (SSD, 60GB RAM, 8 core). And yet, it is almost impossible to estimate logit or probit models (takes several days before Stata became non-responsive, then crashed). Any suggestions?

I have tried the following, and neither has worked (i.e. finishing the estimation task) --

1. xi:logit y x1 x2 x3 ... x10 i.userid (also tried probit)
2. xtlogit y x1 x2 ... x10, fe (also tried probit)

My sample:
N= 40810
t=11215

Highly unbalanced with some units having very long series, whereas some just a few periods.

So far it seems a linear probability model is the only thing feasible (I tried -areg y x1 x2 ... x10, absorb(userid)-), but I was hoping to see if the estimates are similar to those from a logit or probit model. Yet, I have not been able to finish running either logit or probit model, even once. Other than taking random samples of units, are there anything else I can try?

Thanks to anyone who can share some suggestions.
Tags: None
Stephen Jenkins

Join Date: Apr 2014

Posts: 1438
#2

29 Jun 2014, 11:13

Your model #1 does not provide consistent estimates, I understand, and estimates are also biased unless the panel length is long. (Check in standard texts such as by Greene or Wooldridge; I don't have them to hand. ) Model #2 (logit version) is the same model as what economists call a conditional logit (and also fitted by clogit) and provides consistent estimates except that you cannot identify the coefficients on covariates that are fixed over time. [You don't tell us about the nature of the x1, ...x10 variables.] Have you tried xtlogit, re? Or have you ruled it out?
More generally, I'm saying: decide which model features that you are trying to find out about, and note the strengths and weaknesses of the different approaches to accounting for unobserved heterogeneity that different types of approach provide in the context of binary dependent variables.
Also, it may be the case that your problems with, say, Model 2, are issues of convergence, not of the model per se or imbalance -- and the convergence problems may be related to, say, collinearity of regressors or something like that. You need to provide direct evidence of what you typed (the command and the iteration log), and report them within CODE delimiters (see the FAQ) to the Forum. And also report more about the nature of your covariates and the relationships between them.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5011
#3

29 Jun 2014, 14:44

First, I am curious what kind of data set has t = 11,215. As I will note later, that may just be too monstrous to do what you want.

I suggest looking at Allison's green sage book on fixed effects models. See http://www.amazon.com/Effects-Regres.../dp/0761924973

Building on Stephen's comments, Model 1 is definitely wrong:

* Unless you are using some hopelessly antiquated version of Stata (e.g. Version 9, released around 1953, give or take a few decades) you should not use the xi: prefix. Use factor variables instead. Type -help fvvarlist- for details. Also, use xtreg, fe instead of areg.

* The model 1 approach requires the creation and analysis of nearly 41,000 dummy variables. I am sure that will be a bit challenging for Stata if it can do it at all.

* To add insult to injury, even if it did eventually run, the results would be wrong. As Allison explains, the dummy variable approach is called Unconditional Maximum Likelihood. It can be more or less ok for xtreg. But, as Allison and Stephen both point out, the estimates are biased in a logit analysis.

Turning to your 2nd approach -- here, you are using Conditional Maximum Likelihood, which Allison says is the right way to go. Still, you may have some problems.

* As Stephen says, there could be problems with your regressors. I would suggest starting really simple, e.g. have a model with x1 only, if that works then add x2, etc. Maybe you will find that everything works ok until x7 is added. Analyzing a subset of the data may also help to speed up the problem solving process.

* Even then I am not sure the problem is solvable. Here is an excerpt from p. 32 of Allison's book:

So, if I am reading that right, it sounds like an absolutely monstrous calculation. Suppose there were 11,000 records for a case and there were 5,000 cases where the event occurred. That means you would have 11,000 choose 5,000 possible combos, and you would have to compare the probability of the combo that did happen with all the combos that did not happen. Maybe Stata has some super-efficient way of doing that, but if so it hasn't shown up so far in the current calculations. (I rarely run anything bigger than t = 5 though, so I'd be curious to hear if I am basically right about this or totally off-base).

So in short, my recommendations are (a) abandon Approach 1 -- even if it did someday run the results would be wrong (b) try a simplified approach 2 -- if you are lucky maybe you will find that it runs if you get rid of 1 or 2 problematic variables. (c) don't be surprised if it doesn't run at all though. If it were legitimate to greatly reduce the number of time periods, maybe it would run. (d) like Stephen says, the random effects model may be the way to go, unless you have strong reasons for believing the results will be highly biased.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Richard Lin

Join Date: Apr 2014

Posts: 10
#4

29 Jun 2014, 15:43

Thank you Stephen and Richard. I will look at the variables more closely under Model 2. Regarding the RE option, I was hoping to do FE then RE and then compare the results (e.g. Hausman test). Regarding the monstrous number of periods, that came from the high frequency underlying data aggregated to hourly intervals. I will try using lower frequency instead to reduce the number of periods. But just out of curiosity: Will linear probability models make any sense at all under such circumstances?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5011
#5

29 Jun 2014, 17:04

I was taught that the Linear Probability Model (LPM) is the work of Satan, but some Economists seem to like it anyway, at least for some purposes. I don't know if there is any work assessing the merits of a fixed effects LPM. My intuition goes against it (for one thing, fixed effects logit could be analyzing very different cases than fixed effects LPM, since FE Logit discards cases where the dependent variable does not vary across time) but my intuition probably shouldn't be the decisive factor in this case. I'd be interested to hear what others say.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Richard Lin

Join Date: Apr 2014

Posts: 10
#6

29 Jun 2014, 17:13

LOL I was taught the same way. I have to admit though (characteristic of Satan perhaps), it's very tempting in these cases!
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5011
#7

29 Jun 2014, 18:27

Don't give in to the dark side yet. You may be able to estimate a FE Logit using the suggestions above. Or maybe somebody will come up with some great citations for or against a FE LPM. You've never said what your variables measure or what your goals are, so maybe a RE model makes more sense anyway.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5011
#8

30 Jun 2014, 07:10

If you wanted to use the LPM, this author would probably agree with you:

http://marcfbellemare.com/wordpress/...les-technical/

Among other things he says "The probit and logit are not well-suited to the use of fixed effects because of the incidental parameters problem" and "if you want to use fixed effects, and if you are not interested in forecasting the value of Y , you should prefer the LPM with robust standard errors."

So, I interpret that as endorsing something like

xtreg binarydv x1 x2 x3, fe vce(robust)

It is just a blog entry though. I would want to see a stronger defense of the strategy before using it, and specific proof that xtreg, fe works ok. The xtlogit approach also deals with the problem of incidental parameters, but he doesn't mention it.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1438
#9

30 Jun 2014, 07:21

I think Bellamare's blog sums up the case for and again the LPM well, and I concur with his observation that:

Ultimately, I think the preference for one or the other is largely generational, with people who went to graduate school prior to the Credibility Revolution preferring the probit or logit to the LPM, and with people who went to graduate school during or after the Credibility Revolution preferring the LPM

Richard (Williams): I don't think that you'll see a "stronger defence" (the elements are all in the blog post) but, if you're interested, the case is made also in Angrist and Pischke's Mostly Harmless Econometrics book (not only in their blog). The context is economics; the "Credibility Revolution" cited refers to the huge emphasis put in empirical economics these days on estimating "causal effects" (to almost the exclusion of all else).
This brings us back to the original posting and comments on it. I still think that Richard Lin needs to reflect on what his research goals are -- what he wants to get out of his models. (It's not as easy as simply comparing FE and RE estimates.)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5011
#10

30 Jun 2014, 07:35

Here is a response to Bellamare and a response to the response:

http://prisonrodeo.tumblr.com/post/52055757707

http://marcfbellemare.com/wordpress/...ent-variables/

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment

Announcement

Estimating a binary outcome model with a large, unbalanced panel dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment