Clustering for repeated cross sections differences in differences

Surya Singh

Join Date: Sep 2014

Posts: 54
#1

Clustering for repeated cross sections differences in differences

24 Jul 2017, 08:39

Hello,

I'm having some trouble deciding on which level to cluster my regressions. I have a repeated cross section of individuals and a dataset that includes only individuals who are observed in the year that they have a newborn child, therefore there may be repeat observations if they have more than 1 child in the study period of 1990-2011. Approximately 17% have more than 1 child in the dataset with 1 individual having 7 children in that study period.

The method that I am using is differences-in-differences with staggered policy introduction (25 states have introduced a policy; and 26 states have no policy) and I have state-fixed effects as i.state and year-fixed effects as i.year including my policy variable. My outcome variable is a non-negative overdispersed count variable therefore I'm using a negative binomial model.

I am wondering if I should be clustering on the state level or the individual level to control for serial correlation? and if I do cluster on the state level, can I include state fixed effects as well?
The problem is that I have low observations, therefore we do not have many clusters for this dataset, for example we only have 1 individual in Alabama in the year 1993 and so my Wald statistic is missing when I run the regression: nbreg y policy x i.state i.year (vce cluster ID_person) and also when I run the regression: nbreg y policy x i.state i.year (vce cluster state).

If anyone could offer advice as to what I should do in this case? Which level should I cluster on and how does it affect my interpretation if the Wald statistic is missing?
Is there another test I could do post-estimation to determine the joint significance of the variables after nbreg if the Wald statistic is missing?
Or is 17% repeated observations not enough to warrant clustering?

Thank you in advance for any help! I'm completely lost here as what is appropriate in this case.

Surya
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

25 Jul 2017, 10:32

Surya:
as far as I can get your query, due to the features of your dataset, clustered standard errors are not expected to be the way to go.
Anyway, before ginìving in on clustered standard errors, I would iinvestigate whether -egen- with the -group- function can help you out in grouping individuals and states.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#3

25 Jul 2017, 11:00

I think I would take a different approach altogether here. Since most of your parents occur only once in the data set, if you cluster at that level most of your clusters will be singletons, which is problematic. Moreover, your observations will still not be independent across parent-level clusters because there is dependence within states. So I would be inclined to cluster at the state level. As for Alabama, you might get around the singleton problem there by combining Alabama with another state that is similar for the purposes of your study (Mississippi?) and treat those two states as a single cluster. You will be left with almost 50 clusters, which is probably an adequate number of clusters.

Part of the difficulty you are encountering is that you are trying to squeeze a three-level data set into a two-level model. What you really have here is three nested levels: the individual observation, the parent (mostly singletons, but some not) and states. So if I were working on this problem, I would resolve this dilemma in one of two ways:

1. Make the data two level. Since most parents have only one child, reduce the data set to just one child per parent. You might do that either by keeping only the first child for each parent having more than one, or by selecting one at random.

2. Go to a three-level model using menbreg. I don't know what discipline you are working in. I know that in finance and economics, mixed models are viewed askance because their estimates can only be taken as consistent under strong, unverifiable assumptions. Nevertheless, shoehorning a three-level data structure into a two level model is not a way to get correct estimates either: it is a deliberate rather than an inadvertent mis-specification of the problem.

Finally, as an aside, I should point out that if your intent is to have a fixed-effects nbreg model with state and year fixed effects, you cannot emulate that model by running -nbreg ... i.state i.year-. You must use -xtnbreg- after -xtset state year- to get the state fixed effects, and then you can include i.year in your list of regressor variables.
Comment
Surya Singh

Join Date: Sep 2014

Posts: 54
#4

25 Jul 2017, 12:24

Originally posted by Clyde Schechter View Post

I think I would take a different approach altogether here. Since most of your parents occur only once in the data set, if you cluster at that level most of your clusters will be singletons, which is problematic. Moreover, your observations will still not be independent across parent-level clusters because there is dependence within states. So I would be inclined to cluster at the state level. As for Alabama, you might get around the singleton problem there by combining Alabama with another state that is similar for the purposes of your study (Mississippi?) and treat those two states as a single cluster. You will be left with almost 50 clusters, which is probably an adequate number of clusters.

Part of the difficulty you are encountering is that you are trying to squeeze a three-level data set into a two-level model. What you really have here is three nested levels: the individual observation, the parent (mostly singletons, but some not) and states. So if I were working on this problem, I would resolve this dilemma in one of two ways:

1. Make the data two level. Since most parents have only one child, reduce the data set to just one child per parent. You might do that either by keeping only the first child for each parent having more than one, or by selecting one at random.

2. Go to a three-level model using menbreg. I don't know what discipline you are working in. I know that in finance and economics, mixed models are viewed askance because their estimates can only be taken as consistent under strong, unverifiable assumptions. Nevertheless, shoehorning a three-level data structure into a two level model is not a way to get correct estimates either: it is a deliberate rather than an inadvertent mis-specification of the problem.

Finally, as an aside, I should point out that if your intent is to have a fixed-effects nbreg model with state and year fixed effects, you cannot emulate that model by running -nbreg ... i.state i.year-. You must use -xtnbreg- after -xtset state year- to get the state fixed effects, and then you can include i.year in your list of regressor variables.

Thanks Clyde for the suggestions! I had considered just taking the first observation for each parent before, I'll try this.
After I run my complete model with demographic and economic covariates, I have 2 singleton states with 3 states omitted because of no observations in those states , so if I were to combine the 2 singleton states with neighboring states I would have then less than 50 clusters. I read somewhere that if clusters are less than 50, it would not be good to cluster?

Final question, would including i.state as state fixed effects control for serial correlation?
Comment
Surya Singh

Join Date: Sep 2014

Posts: 54
#5

25 Jul 2017, 12:26

Originally posted by Carlo Lazzaro View Post

Surya:
as far as I can get your query, due to the features of your dataset, clustered standard errors are not expected to be the way to go.
Anyway, before ginìving in on clustered standard errors, I would iinvestigate whether -egen- with the -group- function can help you out in grouping individuals and states.

Thanks Carlo for the suggestion! Would you not recommend clustering because of the few observations/clusters? I have 3,000 individuals over 50 states over 22 years and there are some states that there are not individuals in for certain years.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#6

25 Jul 2017, 13:21

So it sounds like you would have 45 clusters. While it would be nice to have more, I think most people would accept 45 as sufficient. There is no hard-and-fast limit that I know of.

One thing you could do is a robustness check: run it with both state-clustered and unclustered VCE and see if the results differ much. If they don't, you're on safe ground either way.

You should not be including i.state in your model. You should -xtset state-, or perhaps -xtset state year- and run -xtnbreg..., vce(cluster state)-. That will give you what you need. If you try to add i.state to the -xtnbreg- model, Stata will omit those variables because they will be colinear with the state-level fixed effects implicit in -xtnbreg-. If you plan on just running -nbreg ... i.state-, do not do that: it is not equivalent to a fixed-effects nbreg model. That trick only works for linear regression (whether you cluster the VCE on state or not.)
Comment
Surya Singh

Join Date: Sep 2014

Posts: 54
#7

25 Jul 2017, 16:07

Originally posted by Clyde Schechter View Post

So it sounds like you would have 45 clusters. While it would be nice to have more, I think most people would accept 45 as sufficient. There is no hard-and-fast limit that I know of.

One thing you could do is a robustness check: run it with both state-clustered and unclustered VCE and see if the results differ much. If they don't, you're on safe ground either way.

Thanks Clyde, I just did that and my results don't change much but the wald statistic is missing for when I do state-clustered...could I do testparm after the regression to test for joint significance of variables if the Wald statistic is missing?

Originally posted by Clyde Schechter View Post

You should not be including i.state in your model. You should -xtset state-, or perhaps -xtset state year- and run -xtnbreg..., vce(cluster state)-. That will give you what you need. If you try to add i.state to the -xtnbreg- model, Stata will omit those variables because they will be colinear with the state-level fixed effects implicit in -xtnbreg-. If you plan on just running -nbreg ... i.state-, do not do that: it is not equivalent to a fixed-effects nbreg model. That trick only works for linear regression (whether you cluster the VCE on state or not.)

Ah ok, I've set it to xtset state as xtset state year does not work since i have a repeated cross section of individuals. xtnbreg doesn't work with vce (cluster state) as for xtnbreg seems to only work with vce bootstrap, jacknife, and oim but it works with xtpoisson so I will see if this is appropriate given the large amount of zeros in my outcome variable.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#8

25 Jul 2017, 16:46

could I do testparm after the regression to test for joint significance of variables if the Wald statistic is missing?

You could, but you will get the same missing result! -testparm- just does the same Wald test.

Why are you so concerned with getting that full-model test anyway? What research question does it answer? Usually nobody even bothers to look at it because it tests a hypothesis of absolutely no interest to anybody.
Comment
Surya Singh

Join Date: Sep 2014

Posts: 54
#9

05 Sep 2017, 07:47

Originally posted by Clyde Schechter View Post

You should not be including i.state in your model. You should -xtset state-, or perhaps -xtset state year- and run -xtnbreg..., vce(cluster state)-. That will give you what you need. If you try to add i.state to the -xtnbreg- model, Stata will omit those variables because they will be colinear with the state-level fixed effects implicit in -xtnbreg-. If you plan on just running -nbreg ... i.state-, do not do that: it is not equivalent to a fixed-effects nbreg model. That trick only works for linear regression (whether you cluster the VCE on state or not.)

Hi Clyde, I've been running my models with xtset state and running xtnbreg, fe to include state fixed into my model as you suggested. I have also done marginal effects after these models, which differ greatly than the nbreg model, I know that it is because when I do xtset state, it sets my panel variable as state, rather than ID_person, what are the implications for interpreting the marginal effects? Am I right in thinking that the marginal effects refer to states now, rather than individuals?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#10

05 Sep 2017, 09:37

Am I right in thinking that the marginal effects refer to states now, rather than individuals?

I don't understand this question. Marginal effects are (an approximation to) the difference in outcome associated with a unit difference in a variable. Whether the difference in the variable arises from differences at the individual level, or are a wholesale difference does not matter.

What is important to bear in mind, though, is that with fixed effects models, all effects (and marginal effects) refer to within-panel (now in your case within-state) differences. The effects of differences between states are not estimated in fixed-effects models. By contrast, in -nbreg-, the effects measured are a blend of within- and between-state effects. Since the two models are estimating different effects, the results can be, and often are, very different.
Comment

Announcement

Clustering for repeated cross sections differences in differences

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment