Negative Binomial Regression on pooled panel data

Anton Kraft

Join Date: Nov 2017

Posts: 10
#1

Negative Binomial Regression on pooled panel data

22 Nov 2017, 10:59

Hello everyone,

I am new to this forum and Stata and have a question regarding my research project.

I want to analyse a set of variables and their influence on the location decision of European multinational companies, i.e. find out if factors like labor costs and infrastructure in a country increase the likelihood of European firms to invest in this country.

I look at five different (target) countries, a big number of firms and a time span of 10 years (panel data set). I want to do the analysis using a count data regression. This means my dependent variable is a count variable counting the number of subsidiaries of firm i in country j in year t, which gives me a number of T*I*J observations. From related papers I know that a zero-inflated negative binomial regression model is most suitable. I also believe that I have understood the basic function of the negative binomial model and the reason for the zero-inflation. I was also able to create the count variables in Stata.

However, I don't know how to actually get the model to work from there on. From papers on similiar research as well as statistics books I find that it is not possible to use the zero-inflated negative binomial model on panel data. So I always read in these papers, that instead the panel structure is ignored and the data is pooled / pooled estimation techinques are used. I believe I understand what that means. However, many problems seem to arise from this procedure, like "correlation of standard errors". The solutions seem to be using year-fixed and/or country-fixed and/or firm-fixed effects as well as "clustering standard errors" --> unfortunately, this is where papers don't go into more detail and I have no idea what is meant by that.

I spent the last days reading statistics books and haven't really made any progress. That's why I am asking here for help. Could anyone please explain what I have to take care of when "pooling panel data" in my specific case (it is special, since I look at different countries AND different firms, right?). What kind of fixed-effects do I have to consider and why (and maybe already how can I do this in Stata)? What is clustering standard errors and why and how can I do this?

Thanks a lot in advance! Any advice, literature reference and explanation would be highly appreciated.

Best regards,
Anton
Tags: None
Bobby Wood

Join Date: Jul 2017

Posts: 39
#2

22 Nov 2017, 11:25

Hi Anton,
There is no built-in command for zinb models with fixed effects in stata but in your case you could simply add dummies for each country which would do the trick.

Example: y is your count variable, x1 - x3 are your independent variables, year is your time variable and cid is a variable on the country IDs.

You could then estimate a zero inflated negative binomial model as follows:

zinb y x1 x2 x3 i.year i.cid , inflate(x1 x2 x3) cluster(cid)

In this case you would have (unconditional) country and year fixed effects and the standard errors would be clustered on year basis.

One further recommendation: instead of just saying that you use zinb models you could also use the user written Stata program countfit in order to test whether a zinb model is indeed more appropriate than zip or standard poisson or negbin models.
I can also highly recommend the textbook "Count Data" by Joseph Hilbe which provides an excellent introduction into empirical applications of count data models with many applications for Stata and R.
Comment
Anton Kraft

Join Date: Nov 2017

Posts: 10
#3

22 Nov 2017, 12:31

Hi Bobby,

thanks a lot for your fast response!

The book you recommended by Joseph Hilbe is indeed awesome and the most helpful book on this matter that I have encountered so far, but it doesn't talk about panel data.

Concerning your response:

1) Could you explain what it means that standard errors are clustered by countries on a year basis? What do I have to search for in the index of a statistics book or on google if I want to learn more about that? What kind of problem does it solve? I find a lot about cluster analysis, but that seems to be something different.

2) Could you please briefly explain, what exactly an unconditional country fixed effect does in this case?

3) Uncondional (and conditional) fixed effects is something I read a lot when reading about panel data models. Is there a difference between using fixed effects as you do it above and "fixed effects panel data models"? So is what you do above a panel data model or is it neglecting the panel data structure and "pooling" it to analyse it like cross-sectional data?

4) Does your above procedure eliminate all kinds of correlation of the data that arises from the panel data structure?

I hope my questions aren't too diffuse
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

23 Nov 2017, 00:08

Anton:
you may want to take a look at: https://www.stata.com/bookstore/micr...metrics-stata/

Kind regards,
Carlo
(Stata 19.0)
Comment
Bobby Wood

Join Date: Jul 2017

Posts: 39
#5

23 Nov 2017, 03:54

Hi Anton, sorry, it should be clustered on cid basis not year basis (as you can see from the code).
I also encourage you to read some introduction in econometrics textbooks as if you want to estimate a panel model you should at least know some absolute basics like standard error adjustments and what fixed effects are.

Best,
Bobby
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

23 Nov 2017, 04:11

Anton:
the reference I pointed you to in #4 covers what Bobby wisely suggest you should take a look at.

Kind regards,
Carlo
(Stata 19.0)
Comment
Anton Kraft

Join Date: Nov 2017

Posts: 10
#7

23 Nov 2017, 10:48

I went to the library right in the morning and worked through the book today. It is indeed very helpful and also written in a way that makes it understandable for people who work with statistics on this level for the first time. Thanks a lot for that advice! And you were both right, I was only looking through too specific books focusing on count data models, skipping a lot of basics.

Just to clearify my understanding: What you (@ Bobby) suggest in #2 is what is called a "Two-way fixed-effects specification model" to allow the intercept to vary over countries and time. That way I control for endogeneity problems. By clustering the standard errors by countries I avoid problems of serial correlation and heteroscedasticity. Is that correct?

Best regards,
Anton
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#8

23 Nov 2017, 11:38

Anton:
you're correct that, under -xtreg- (but it s not necessarily true for other -xt- commands, say -xtnbreg-, where the option -vce(cluster panelid)- is not available) clustering (or robustifying; unlike -regress-, under -xtreg- they do the same job) your standard errors takes both serial correlation and heteroskedasticity into account. However, the prerequirement there is that you have a large N, small T panel dataset.

Kind regards,
Carlo
(Stata 19.0)
Comment
Anton Kraft

Join Date: Nov 2017

Posts: 10
#9

24 Nov 2017, 05:31

Ok thanks, that is the case, N is much larger than T.

I am still a bit confused about the "three dimensional" character of my data (not just firms and years, but firms and countries and years). For example, if I would use the xtset command (xtset firm_ID year), I get an error, because the "year" variable doesn't uniquely describe the data. The structure needed would probably be this:

year firm ID depvar indepvar

2001 01

2001 02

2002 01

2002 02

However, it is structured like this:
year country firm ID depvar indepvar

2001 country1 01

2001 country1 02

2001 country2 01

2001 country2 02

2002 country1 01

... ... ...

And firm_ID appears at least twice in one year. How can I solve this problem? Can I tell Stata at this point that the data is also grouped by countries or do I have to reshape the data in some way?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#10

24 Nov 2017, 06:05

Anton:
you may want to consider -egen- -group- fuction to create a new panelid that takes both firms and countries into account:
[CODE]egen new_panelid=group(firm country)[CODE]

Kind regards,
Carlo
(Stata 19.0)
Comment
Anton Kraft

Join Date: Nov 2017

Posts: 10
#11

30 Nov 2017, 07:39

Hello everyone,

I have worked my way towards implementing the model proposed in #2, but I come across two problems in Stata that I would like to share here:

The first question is about my independent variables. As I explained in #1 I want to analyse the effect of country-specific characteristics like labor cost, gdp, infrastructure etc. on the location decision of mulitnational firms. All together, I have 15 different independent variables and I want to test them for multicollinearity. The two ways I know to do that is first look at a correlation matrix and second look at variance inflation factors. However, I am wondering if the results even make sense in my case (panel data and looking at firms and countries). Is it correct to do the correlate command on my entire sample or will the results not be interpretable because of my data structure [see second table in post #9]?

And concerning the variance inflation factors: I did a normal regression (regress command) on my data and then used the estat vif command, but the results are huge. I have VIF values of up to 455, which I cannot explain in a theoretical way at all, so I believe it is a statistical matter. Can I work around that or can I simply not use VIFs on my data?

The second question is regarding different assessment of fit tests for my model as proposed in #2. I would like to see if omitting/adding certain variables or transforming them improves the statistical fit of the model, like for example likelihood ratio test. However, it seems to me that none of these tests I find in the books works when using the cluster() command to cluster on countries and get robust standard errors. So far all I can do is look at the AIC and BIC using estat ic command and check in which model the two values are smaller. Can anyone tell me if that's really all I can do or are there any other tests I can do while using cluster()? Or is it appropriate to just not use cluster() while doing the tests and simply implement cluster() after I found the best fitted model?

Thanks a lot in advance for any advice
Comment

year	firm ID	depvar	indepvar
2001	01
2001	02
2002	01
2002	02

year	country	firm ID	depvar	indepvar
2001	country1	01
2001	country1	02
2001	country2	01
2001	country2	02
2002	country1	01
...	...	...

Announcement

Negative Binomial Regression on pooled panel data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment