Panel Data with fixed and industry effects. Big confusion

Daniel Wegter

Join Date: May 2015

Posts: 1
#1

Panel Data with fixed and industry effects. Big confusion

12 May 2015, 05:35

Dear Statalistforum,

This is my first post here and I am relatively new to Stata. Hi everyone ! I have never used this forum before, but I am desperatly in need for answers on some confusions I have in my research. And after spending half a day searching I am only more confused so maybe you guys could give some guidance.

I am writing my final year thesis on the following:

I am checking for the effect of family control during the 2008-2009 crisis on a couple of dependent variables (investment, financing, employees).
- 1 country (so no country fixed effects)
- 383 firms (with Firm ID)
- 4 years (2006,2007=0 and 2008,2009=1) so around 1500 observations
- 33 industries
- 5 or 6 control variables
- family control dummy * crisis period dummy and non-family control dummy * crisis period dummy (to measure differences between pre-crisis and crisis)

I use a panel data regression with firm fixed effects and I use the following commands before my regression:
gen crisis=0
replace crisis=1 if Year==2008
replace crisis=1 if Year==2009

gen familycontrol_crisis= familycontrol* crisis
gen nonfamilycontrol_crisis= nonfamilycontrol* crisis

gen industry_year= Industry* Year
quietly tabulate industry_year , generate( industry_year )

tsset FirmID Year

I hope above commands and dummies are correct, and if so I have the following questions:
a. I use Industry*Year because my industry variable does not change over the years. However, this creates 132 dummy variables and my total observations are not that big. I always learned that you need around 10 observations per independent variable. Is this true? Does this matter? I am not really interested in the coefficient of the industry dummy but would including these dummies give me wrong results?

b. If so, can I exclude the industry dummy and assume that the industry effect is captured in the firm fixed effect (Firm ID)?

c. I use the following command for my regression: xtreg dep variable familycontrol_crisis nonfamilycontrol_crisis other independent variables industry_year_full_1- industry_year_full_132, fe

This gives me a couple of R^2, however it does not give me an adjusted R^2, if I type in the command to display the adj R^2 it shows me a negative one. Even when I exclude all industry effects it stays negative while I am sure that my model is correct (I follow another research paper). Is this bad? On what R^2 should I focus? It gives me a within, overall and between.

d. I am interested in the difference in coefficient between familycontrol_crisis and nonfamilycontrol_crisis so after the xtreg, fe regression I type: test familycontrol_crisis=nonfamilycontrol_crisis. Is this the correct method to test for this?

e. I also experimented by reducing the amount of industries, only using year effects etc. This all gives me somehow different results however I have no idea on what statistic to focus to say OK this is the model I go with. Do I need to focus on the R^2, the coefficients F-statistics or something else?

f. I also use some Log variables, for instance to measure Log(Employees). This gives me a very high overall R^2, how can this be? is this normal?

As you can probably see, I am not that experienced in statistics. However, I am very much struggling with these questions so any help would be highly appreciated. Many thanks in advance.

Best regards,
Daniel
Tags: fixed effects, industry effects, panel data, rsquared
Andrew Musau

Join Date: Oct 2014

Posts: 10221
#2

12 May 2015, 10:07

Since firms are within industries, the industry effect is captured by the firm dummies: industry dummies are therefore not necessary in the presence of firm dummies. Adjusted R2 is a meaningless statistic in the context of a fixed effects regression: See discussion below

http://www.statalist.org/forums/foru...sted-within-r2

So you can consider the within R2 (which is the fixed effects R2), or run the regression using LSDV (OLS with dummies) and pick out that R2 (this is what other software like Eviews output, and what a good number of researchers report).

Code:

*LSDV estimation reg y x1...xn i.firm i.year

I recommend that you run an F-test to determine whether or not to include the time dummies in the regression. You use the following command after xtreg y x1...xn i.year, fe

Code:

testparm i.year

If the F statistic is significant, you should include the time dummies, otherwise your estimates will suffer from omission bias. Otherwise, exclude them. Finally, you need to have some reason for transforming your variables and this is determined prior to estimation (it is a model selection rather than estimation issue).
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

12 May 2015, 11:19

What you're not getting in the firm effect is the industry by year effect. I have sometimes calculated the industry average for the dv (excluding the firm of interest) and used this as a control for industry-year effects.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#4

12 May 2015, 11:42

A couple of tangential points:

1. It is unusual, and usually incorrect, to specify a model that has interactions but does not include the corresponding main effects. Now, perhaps the familycontrol variable doesn't vary within firm over time, in which case it will be automatically dropped even if you include it. But the crisis variable should be present in the model in its own right, I think.

2. Isn't non-family control just 1-family control? So doesn't one of these drop due to multicollinearity (along with its interactions with crisis)?

3. In any case, rather than generating your own interaction variables, you would be better off using factor variable notation, so that then you can go on and use -margins- to get predictions, marginal effects, etc. Thus

Code:

// NOTE: THIS WILL INCLUDE BOTH MAIN EFFECTS AND INTERACTIONS // IF familycontrol DOES NOT VARY WITHIN FIRM, STATA WILL OMIT IT ANYWAY xtreg dep_variable i.familycontrol##i.crisis i.nonfamilycontrol##i.crisis /*etc.*/, fe

With regard to your question d, leaving aside my concerns in #2, your approach is correct. But I am having a hard time grasping, substantively, what this test actually means.

With regard to your question e, the first and most important determinant of how a model should be specified is the underlying scientific theory. Trying multiple models and then selecting one based on R2, or an F statistic carries a high risk of overfitting the noise in the data, especially if the size of your data set is less than enormous.

With regard to question f, there is no way to answer this without knowing what the variables are and how they related to each other. It may well be the case that a logarithmic transform of a variable produces a great improvement in model fit. What does theory in your field say?

Finally, I note that you are doing this for a thesis. You clearly have a lot of uncertainty about how to proceed. Your institution owes you a thesis advisor with whom you can discuss these questions and who will provide informed answers. (You are welcome to ask on this Forum, of course, too. But you paid your tuition: insist that you get what you paid for.)
Comment
Mohammed Makhlouf

Join Date: Jan 2016

Posts: 1
#5

27 Jan 2016, 02:15

Excuse me, How can I get Daniel Wegter email?
Comment

Announcement

Panel Data with fixed and industry effects. Big confusion

Comment

Comment

Comment

Comment