Zero-Inflated Negative Binomial Model for Panel Data

Bettina Wi

Join Date: Mar 2017

Posts: 6
#1

Zero-Inflated Negative Binomial Model for Panel Data

23 Mar 2017, 13:16

Hello everybody,

I am using Stata 14.2. I want to analyze unbalanced panel data. My dependent variable is a count variable, I have over-dispersion and I do have excess zeros (more than 40%).

That's why I am searching for a Stata command to do a zero-inflated negative binomial regression.
I am aware of the "zinb" command. But it doesn't take account of the panel structure of my date, does it?
I also know the "xtbnreg" command, but this one doesn't consider my excess zeros.

Do you know an appropriate Stata command for my data? What is the best way to do it?

Thanks a lot,
Bettina
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3886
#2

23 Mar 2017, 13:35

The mere presence of many zeros (and I would not necessarily call 40 per cent "excess") in the data does neither require nor justify a zero-inflated model. This model assumes that the zeros are a mixture of two data generating processes. If your theory does not point into this direction, I would be reluctant to jump to such model.

That said, I am not aware of a zero-inflated version of count models for panel data. Someone else might know more than I do.

Best
Daniel
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5024
#3

23 Mar 2017, 13:42

This is an interesting discussion (including the comments) on "Do We Really Need Zero-Inflated Models?"

http://statisticalhorizons.com/zero-inflated-models

Unfortunately, it isn't panel-data specific.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
1 like
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#4

23 Mar 2017, 14:49

Dear Bettina,

Like Daniel and Richard, I would be reluctant to use a ZI model. Anyway, I am not aware that you can include fixed effects in a ZI NB regression, and a random effects model would be too sensitive to distributional assumptions.

You do not say what is the purpose of your analysis, but it is likely that the best thing to do is simply to estimate a Poisson regression with fixed effects. This is very robust and is likely to give you a good approximation to the conditional mean.

All best wishes,

Joao
1 like
Comment
George Ngan

Join Date: Dec 2017

Posts: 2
#5

11 Dec 2017, 02:32

Dear tenured members,

I am a first year PhD student who is conducting quantitative research based on social theory. I have recently started preliminary test on a 4-year longitudinal panel dataset of about 10,000 firm-year obs on Chinese listed companies. The dataset contains both firm- and industry-level variables and use almost time-invariant (less than 2% of the obs has within changes) factor variable as main predictor. I am planning to use 2-level regression model (year nested under firm) with industry class as dummies in the first pass, then switch to 3-level model (year under firm under industry, but still keeping the industry dummies in the model for the fixed part) to make sure that the effects I am observing are not in-fact coming from the between-effects of the industry clusters. I intend to use both re (random-interceps) estimator and random-coefficient estimator to get a better feel for the consistency of estimation because omitted variables issues are always quite likely in my area of study.

I have a number of questions regarding regression on count data because one of my main dependent variables is a non-negative count variable. I am currently reading the book MLMUS (Rabe-Hesketh & Skrondal, 2012) but I hope that you can point me to further reading to understand the issues that I am dealing with.

One of my main issues is that the dv is over-dispersed and zero-inflated (73.6% of obs with zero value).
(1) From reading and internet search it seems that negative binomial model is the more appropriate model to deal with over-dispersed outcomes but I am not sure if the model is also suitable for data which is both over-dispersed and zero-inflated?
(2) I search for Stata14 command and see xtnbreg and menbreg are both feasible commands for 2-level model. But there is no xtme* commands which are available for to logit/Poisson/reg. From reading Stata manual, my understanding is that xtnbreg can only handle 2-level model with re/fe options. While menbreg will be able to handle random-coefficient estimators as well as multi-level regression model as long as I have the right level of clusters and cluster-related variables specified for higher-level cluster (similar in command structure to xtmixed)?

Thanks in advance
George

. sum csdictot

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
csdictot | 10,909 7.698964 31.73989 0 1428

. sum csdictot if csdictot!=0

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
csdictot | 2,877 29.19291 56.50862 1 1428
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#6

11 Dec 2017, 11:20

Dear George,

Notice that from the results you provide you cannot really say whether the data is over-dispersed because what matters is conditional over-dispersion. Also, you say your data is zero-inflated but again that is not meaningful because any count data distribution is compatible with arbitrarily high proportions of zeros.

Keep in mind that if you use a reasonably robust estimator such as Poisson regression with fixed effects it is unlikely that you will need to worry about over-dispersion and zero-inflation, so I suggest you take that as a staring point. More sophisticated models are generally more fragile in the sense that their validity is likely to depend on unrealistic distributional assumptions.

Best wishes,

Joao
1 like
Comment
George Ngan

Join Date: Dec 2017

Posts: 2
#7

11 Dec 2017, 12:58

Dear Joao

Thanks very much for the advise. I will review more books relating to the concepts and regression model of count variable before jumping into using Stata.

Best regards
George
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#8

11 Dec 2017, 13:14

Dear George,

That is a good approach; check also the section on count data in Wooldridge's books.

Keep in mind that books about count data have a lot of things that are only relevant if you want to compute probabilities. If you just want to estimate the conditional mean, which is often the case, Poisson is generally the best approach.

Joao
Comment
Leonardo Riano

Join Date: Jan 2018

Posts: 1
#9

17 Jan 2018, 14:00

Hello,
I am interested in learning and delving into the topic of models with inflated zeros and panel data. I am a recent graduate of economics and administration, so my bases are recent in econometrics and until now I am beginning to delve into the models of inflated zeros.

From what I have searched few programs support a model that can handle these two issues. However I have the impression that this can be modeled and two processes are required, the inflated zero model and include either fixed or random effects to it.

My model studies the change from innovative to non-innovative firms and vice versa, but the basis for the country that I study has very few companies that innovate so there is a high number of zeros. In the same way, this is reflected in important independent variables, for example in research or development, since few firms invest in this amount.

I would like to know about literature about it that could be recommended to me and if anyone has tried to model this type of model in order to imagine how I could program it for my model.

regards
Comment
Lars Pete

Join Date: Nov 2020

Posts: 118
#10

23 Mar 2024, 13:37

Dear Joao,

Joao Santos Silva What would be your recommendation when the dep var has negative values? We won't be able to use poisson there.

Regards.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#11

24 Mar 2024, 00:14

Dear Lars Pete,

We can use Poisson regression even if some observations are negative; the key condition is that the expectation is positive. So, if you only have a small number of negative observations, you may want to try the ivppml command that is available here (see the example in the same page that shows estimation with some negative observations).

Best wishes,

Joao
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2204
#12

24 Mar 2024, 08:16

If y can take on both negative and positive values, you have some work to do to argue against a linear FE analysis. Everything we do is an approximation. When y >=0 and we see zeros, we use an exponential conditional mean because we think it's a better approximation to a linear conditional mean. And, it gives us parameters that have percentage interpretations. When y can be negative, the argument becomes less compelling. There's an asymmetry here between the linear model and the exponential model. If you use linear FE when y >= 0, you might get negative predicted values, but at least you're not ruling out fitted values in the range of the observed y. If y can be negative and you use an exponential model, you know for sure you're going to have all positive predicted values when some y < 0. It's possible, of course, that an exponential model still fits better than a linear model but that's hardly guaranteed. And how should we define "a small number" of negative observations? The range of those negative observations must also be a consideration. I know prediction is not usually what we're interested in, but I still find it unnatural to use a model that cannot predict certain outcomes on y.

A similar example is using a linear model when 0 <= y <= 1. I may get fitted values outside the unit interval, but I know I can get any fitted value in [0,1]. If y can be negative or larger than one, would I use a logit functional form? I don't think so. This differs in one way from the exponential case in that y often takes on very large values along with some negative values, and the linear model might not fit well over the entire range.

So I'll respectfully disagree with Joao on this particular point while admitting that there are no obvious solutions.
1 like
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#13

24 Mar 2024, 11:04

Thank you, Jeff, for sharing your views on this. We may agree more than what it appears, so let me expand a bit on how I see it.

To motivate my point, suppose that the DGP is y = exp(x'b) + u, where u has a normal distribution. In this case, there is no doubt that an exponential model would be suitable, even though y can have many and large negative values. Maybe this is not a realistic DGP in many situations, but my point is just that negative values of y do not rule out Poisson regression. As you say, everything is an approximation, and it may be the case that exponential model is preferable to the linear model, even if some observations are negative. The choice between the two, in my view, is an empirical question.

As for being unnatural to use a model that cannot predict certain outcomes, we use Poisson even though it cannot predict zeros, and we use the logit and probit for binary data, even though they cannot predict the only possible values of the outcome. So, I am comfortable with this, but I understand that you and others may not be.

Anyway, thank you for providing food for thought.
1 like
Comment
Lars Pete

Join Date: Nov 2020

Posts: 118
#14

24 Mar 2024, 23:03

Hi Jeff Wooldridge Joao Santos Silva Thanks for your feedback.

In Stata,

Code:

. sum diff_MDsNonFederalandFederalT Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- diff_MDsNo~T | 20,529 5.728677 30.62172 -550 806

Stata gives an error saying the values must be >=0. So, I have to use another variable which is still zero inflated but only >=0.

In that, this: Arvis_Shepherd_Poisson_and_Gravity_Final_25October 2011.pdf (uni-muenchen.de) says that for poisson, Santos Silva and Tenreyro (2006), have used : P-QMLE. Is that correct?
Because when I read your original paper here: It says use P-PMLE.

Can P-PMLE and P-QMLE be used alternately for zero inflated count dep var? Are they the same thing? or should I use both and compare the results?

I am using xtpoisson with i.year, vce(robust). P-QMLE assumes no unobserved heterogeneity. Stata doesn't allow me to use i.year with xtpqml (P-QMLE). Also, it is distribution free and only conditional mean needs to be correctly specified. Woolridge (1999) & Woolridge (2023).

I can see that I am able to use ppmlhdfe with or without FE but the results are different. What is the most apt thing to do? I am doing DiD/DDD type analysis where I have 5 types of treatments for county level data with state level staggered treatments. Thanks. I really appreciate it.

Last edited by Lars Pete; 24 Mar 2024, 23:49. Reason: added word: staggered
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3022
#15

26 Mar 2024, 01:02

Dear Lars Pete,

I do not think I fully understand your post, but here are some notes.

From the summary statistics you provide, it is not at all clear why you would use Poisson regression in this context. Also, your data may have zeros, but you cannot say that is is zero inflated unless you have a benchmark with respect to which you have inflation.

There are different acronyms for Poisson regression, but they all mean the same thing; I prefer PPML.

Of course the results differ if you include fixed effects; that is why we include them

Best wishes,

Joao
1 like
Comment

Announcement

Zero-Inflated Negative Binomial Model for Panel Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment