Confusing about large dummies

Chen Huang

Join Date: Jan 2016

Posts: 33
#1

Confusing about large dummies

03 Jul 2016, 15:27

Hello everyone,

I am using Gini index as the main independent variable to do research in the US. My sample is from each the US state, and I use the Gini index for each state as well. The data is cross sectional. In one model I control for year and industry effects( use dummies), and I cluster both year and industry. My question is, should I control for state effects too? Because in this model, hold other things constant, if I add state dummies, the result is positive; then it is negative without states dummies.....both are significant...

I am very confusing now... Because there is only one Gini index for each state( I don't use Gini for each year). What I am thinking is that, the Gini index used for each state in the US may have conflicts with using states dummies, which means I should not use state dummies in this situation. Does anyone have any ideas? Thank you very much.

Chen
Tags: None
Chen Huang

Join Date: Jan 2016

Posts: 33
#2

04 Jul 2016, 04:02

Anyone can help?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#3

04 Jul 2016, 09:15

I find your question confusing. If you have only one Gini index value for each state, the inclusion of state indicators ("dummies") will just result in the Gini being excluded from the model due to collinearity with the state indicators. So I don't see how your result goes from positive to negative when including state indicators: it should go from positive to non-existent. So I don't understand what you have actually done.

I suspect, at the end of the day, this is more a question about the appropriate model for your research question than it is about the specifics of estimating your model in Stata. So it would be very important for you to clearly state your research question.

I think it would also help if you show the exact code you used and the exact responses you got from Stata so we can help you interpret them. Please read FAQ #12 before doing this so that you show them in the most helpful possible way.
Comment

Chen Huang

Join Date: Jan 2016
Posts: 33

04 Jul 2016, 09:32

Originally posted by Clyde Schechter View Post

I find your question confusing. If you have only one Gini index value for each state, the inclusion of state indicators ("dummies") will just result in the Gini being excluded from the model due to collinearity with the state indicators. So I don't see how your result goes from positive to negative when including state indicators: it should go from positive to non-existent. So I don't understand what you have actually done.

I suspect, at the end of the day, this is more a question about the appropriate model for your research question than it is about the specifics of estimating your model in Stata. So it would be very important for you to clearly state your research question.

I think it would also help if you show the exact code you used and the exact responses you got from Stata so we can help you interpret them. Please read FAQ #12 before doing this so that you show them in the most helpful possible way.

Hello Clyde,

Thanks for your reply. The Gini index for each state i meant that I use the the each state's Gini index for only one year, for example. So there will be 51 Gini indices.

Now I am thinking the same, whether to include state dummy is more related to my research question. I also includes the commands and two different results below:

The first one is includes year, industry and state effects, and the second only includes year and industry effects.

Logistic regression		Number of obs	= 5,780
	Wald chi2(92)	= 1662.11
	Prob > chi2	= 0.0000
Log pseudolikelihood	=	-2884.272	Pseudo R2	= 0.2781
		(Std. Err. adjusted	for 335	clusters in yearffi17)

		Robust
	wealthdummy Coef.	Std. Err. z	P>z	[95% Conf. Interval]

	Gini .0707125	.0277964 2.54	0.011	.0162325 .1251924
	Education .05772	.0071271 8.10	0.000	.0437512 .0716888
	Income .3249641	.0425272 7.64	0.000	.2416123 .4083158
	Capital -6.282551	.5358669 -11.72	0.000	-7.332831 -5.232272

Logistic regression		Number of obs	= 5,786
	Wald chi2(45)	= 908.63
	Prob > chi2	= 0.0000
Log pseudolikelihood	=	-2959.0711	Pseudo R2	= 0.2601
		(Std. Err. adjusted	for 335	clusters in yearffi17)

		Robust
	wealthdummy Coef.	Std. Err. z	P>z	[95% Conf. Interval]

	Gini -.0079384	.0032614 -2.43	0.015	-.0143306 -.0015462
	Education .052284	.0060603 8.63	0.000	.040406 .0641621
	Income .3159207	.0413201 7.65	0.000	.2349349 .3969066
	Capital -6.457809	.5179353 -12.47	0.000	-7.472944 -5.442674

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#5

04 Jul 2016, 10:10

OK. You did not show the commands that generated these outputs, nor did you post them in a code block as FAQ #12 suggests, but there is enough readable information here for me to get some idea of what you are doing. Thank you.

The biggest problem is that you are using the -logistic- command. While you have to some extent compensated for lack of independence of observations by using cluster robust standard errors, you are still failing to capture the hierarchical nature of your data. You have a fairly complex design, as best I can tell, where observations represent combinations of industries and states and years. Perhaps your data are even at some finer level of observation such as firms within those? At the very least, it seems industries are nested within states, and years of observation are nested within industries, is that right?

Anyway, you cannot get the equivalent of a fixed effects logistic regression by adding indicator variables for the fixed effects. That works for a linear model, but not for the logistic. (Estimations obtained using indicator variables in this way are not consistent.) At a minimum you will need to use an -xt- model such as -xtlogit-, and you may need to go to a multilevel model such as -melogit-. (In particular, in a state-fixed effects model you will not be able to estimate an effect of Gini given that in your data Gini is constant within states.)

If you would like more guidance for developing your model, please provide a more detailed explanation of the structure of your data (perhaps it would be most clear if accompanied by a small, but representative, sample of your data.) Do read the FAQ, especially section 12. If you do post example data, be sure to use the -dataex- command. If you post more results, be sure to also show the code that led to them, and make sure you post code and results by copy/pasting directly from Stata's results window or your log file into a code block on the Forum.
Comment
Chen Huang

Join Date: Jan 2016

Posts: 33
#6

04 Jul 2016, 12:31

Originally posted by Clyde Schechter View Post

I am trying to get some sample, but it seems to be those samples are useless to explain the situation. There are some facts for this model:

1, Variable Gini is constant within a state, so there are 51 Gini coefficients, as you mentioned;

2, My sample data is indeed at firm level. The sample consistes of those firms in a single and particular year, so the same firm will not appear the second time in the sample. (which is not panel data). But they are from different industries and states in the US.

The command I used for this model is :

Code:

logit wealth gini education income capital i.year i.ffi17 i.states , cl(yearffi17)

Where ffi17 is the industries clustered by 17 categories. It's clustered by 364 clusters, and I created "yearffi17" by using the command:

Code:

egen yearffi17=group(year ffi17)

For the case without including states effect, I simply remove i.states in the command.

In this case, I think industries are not nested within states.

Thanks in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#7

04 Jul 2016, 13:03

OK, I have a clearer picture now.

The reason you appear to get an estimate of the Gini effect when you include state effects (never mind whether it's a correct estimate or not) is that when Stata encounters a collinearity problem, it usually (and in your case) drops the last variable named in the command. Since you put gini early in your last, and i.state at the end, what it did (and you probably did not notice) is omit the last state indicator variable due to collinearity, leaving gini in. Given that there are 51 states, there should be 50 state indicators, but if you look carefully you will see that only 49 were included. It's easier to see with a smaller example:

Code:

webuse grunfeld, clear // CREATE A NEW VARIABLE THAT IS CONSTANT WITHIN COMPANY set seed 1234 by company (year), sort: gen new_var = runiform() if _n == 1 by company (year): replace new_var = new_var[1] // ILLUSTRATE OMISSION OF COLLINEAR VARIABLES regress mvalue invest new_var i.company // OMITS ONE COMPANY IN ADDITION TO REFERENCE regress mvalue invest i.company new_var // OMITS new_var

Consequently your model including gini and state effects does not give you an estimate of the gini effect adjusting for state effects. It only appears to because you did not notice that an insufficient number of states were represented. In fact, estimating the gini effect with state-level fixed effects is, in principle, impossible in your data.

Since the same firms do not recur in different years, they do not form a level in your model. So I guess you really just industry and state recurring in either a crossed or multiple membership pattern. Probably the most compact, most efficient way to do this would be something like this:

Code:

xtset yearffi17 xtlogit wealth gini education income capital, fe

You will have absorbed the year and industry level effects, and you can get an estimate of the gini effect. But you may have omitted variable bias due to unmodeled (?unobserved) state level effects.

In my own line of work, where we do not worry so much about the assumptions underlying random effects models, I would actually probably do this as a mixed model such as:

Code:

melogit wealth gini education income capital i.year || _all:R.ffi || state:

which will treat year as a fixed effect and give you crossed random effects for industry and state. Because state effects are modeled as random, you can still get an estimate of gini effect. Whether the use of random effects modeling would be acceptable in your discipline is a question you should consult a colleague about.
Comment
Chen Huang

Join Date: Jan 2016

Posts: 33
#8

04 Jul 2016, 15:17

Thank you so much for your explanation. As you mentioned that "estimating the gini effect with state-level fixed effects is, in principle, impossible in your data", is just because the Gini is constant within a state for the whole sample period. Does this mean I also can not include state dummies in an OLS model? In other words, when I include state dummy, there should not be given result as the collinearity issue, but state just omitted part of the state dummy automatically and generate a "wrong" results for me. So what about if I cluster by states? Does this make sense?

I noticed that there was something omitted actually, but i thought it is normal... So basically, if the situation where the indicator variable is partially omitted, this represents a problem and we should not use this indicator variable?

And the last code returns me with

"Fitting fixed-effects model:

Iteration 0: log likelihood = -2856.7272
Iteration 1: log likelihood = -2680.9148
Iteration 2: log likelihood = -2676.8233
Iteration 3: log likelihood = -2676.8219
Iteration 4: log likelihood = -2676.8219

Refining starting values:

Grid node 0: log likelihood = -2619.4

Fitting full model:

Code:

melogit wealth gini education income capital i.year || _all:R.ffi || state:
Thanks,

Chen
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#9

04 Jul 2016, 16:40

Thank you so much for your explanation. As you mentioned that "estimating the gini effect with state-level fixed effects is, in principle, impossible in your data", is just because the Gini is constant within a state for the whole sample period. Does this mean I also can not include state dummies in an OLS model? In other words, when I include state dummy, there should not be given result as the collinearity issue, but state just omitted part of the state dummy automatically and generate a "wrong" results for me. So what about if I cluster by states? Does this make sense?

That is right. When you have a variable that is constant within state, its effect cannot be estimated in any model that includes a complete representation of state in fixed effects, whether it be -xtreg, fe- or -regress...i.state-. No matter what you try, you cannot jointly estimate the effect of gini and fixed effects for state. If you think you have accomplished that in some analysis, look more closely and you will find that something, somewhere, is missing. The results are not what they appear to be.

The term "clustering by states" can have many meanings, and I'm not sure what you're referring to. Using vce(cluster state) is perhaps better than nothing, but it is not sufficient to accomplish one of the main purposes of using fixed effects models: the elimination of unobserved variable bias at the state level.

The -melogit- command with crossed-effects is estimating a large number of parameters at three levels, and has to repeatedly calculate statistics from your rather large data set. It is going to run very slowly. Depending on your computer, it would not surprise me for a model like this to take a day or two to finish estimating. And multi-level logistic models sometimes fail to converge altogether. To satisfy yourself that you're on the right track, you might try running it first on a much smaller sample, say one with only 10 states and 5 industries. That will reduce both the number of parameters and the size of the estimation sample and should run in a reasonable amount of time. Then, having gained some experience with it, you can more confidently try running it on the full sample--just set it aside to run for a long time and be patient.

If this model ultimately fails to converge, you might try -meqrlogit-. It estimates the same model as -melogit- but uses a different algorithm and often converges where -melogit- fails.

So basically, if the situation where the indicator variable is partially omitted, this represents a problem and we should not use this indicator variable?

It is normal for a set of indicator variables for a categorical variable to have one category omitted. When more categories are omitted something is anomalous. It may be that the estimation sample, due to the distribution of missing values, turns out not to represent some of those categories--which may or may not be a problem depending on whether your data set should have those missing values or not. Or, as in the situation here, it may indicate that you have tried to include in your model another variable that is constant within the categories of that variable. The point is that when the expected number of indicator variables does not show up, you need to investigate why. Not using that variable may be the solution; or dropping some other variable may be. Or perhaps your data set has missing values that shouldn't be there and your data set needs to be fixed up. Or maybe your estimation sample doesn't represent all the values of that variable: and this may be a sign of bad data or may be perfectly OK. So there is no general rule about what to do to fix it. But there is, indeed, a general rule that missing indicators often signal a problem and you need to find out what is going on.
Comment
Chen Huang

Join Date: Jan 2016

Posts: 33
#10

04 Jul 2016, 17:00

Originally posted by Clyde Schechter View Post

Your comments are really helpful. Now everything is clear. Thank you again for the help!
Comment

Announcement