Appropriate Dimension for Clustering of Standard Errors

Prateek Bedi

Join Date: Sep 2018

Posts: 199
#1

Appropriate Dimension for Clustering of Standard Errors

18 Sep 2020, 06:16

Hi,

I am trying to estimate the impact of directors' remuneration on firm performance. My unbalanced data set comprises 1696 firms (spread across 68 industries) and 16 time periods (in particular, years). I have 7 independent variables in total. I am using industry fixed effects (by including industry dummies) regression along with time dummies. Since my model suffers from heteroskedasticity and auto-correlation, I want to calculate clustered standard errors. Below is my model.

Code:

reg Profitability4 Size2 Leverage1 CurrentRatio SalesGro CapitalExpenditure2 WPromoterSharesin1 AD_Totalremuneration i.Year i.NICCodeFirst2Digits, vce(cluster CompanyID)

Question: Given that I am applying industry effects (by introducing industry dummies), should I cluster my standard errors around industry or company?

P.S.: Each company belongs to a particular industry and this does not change across time. For instance, a company called Biocon belongs to Pharmaceutical industry and remains in this industry throughout the time period.
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

18 Sep 2020, 06:25

If I understood your study design correctly, a multilevel hierarchical approach would tackle several issues reported in your message. Please type - help mixed - and check it out.

Best regards,

Marcos
1 like
Comment
Prateek Bedi

Join Date: Sep 2018

Posts: 199
#3

18 Sep 2020, 07:01

Hi Marcos,

Thanks for your response. I checked out -mixed-. However, I am not exactly trying to fit a mixed model to my data. I am just trying to estimate a fixed effects regression wherein my fixed effects correspond to industries to which my firms (i.e. my actual cross-sectional units) belong. Industry fixed effects are nested within firm fixed effects in the sense that applying firm fixed effects controls for the industry fixed effects but not the other way round. Since I am using industry fixed effects for my regression, I would like to confirm whether I should cluster my standard errors also around industries or should I cluster them around companies?
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#4

18 Sep 2020, 11:10

The theory is simple: you should cluster at the level where your error terms are correlated. Hence if you think that error terms of different companies from the same industry are uncorrelated, then you should cluster at the company level (as you seem to be doing now).

Practically, I would say that you should cluster at the industry level. 68 industries are enough clusters, and it is safer this way. I can come up with various reasons why error terms of different companies are correlated within the same industry, even after the inclusion of industry fixed effects.
1 like
Comment
Prateek Bedi

Join Date: Sep 2018

Posts: 199
#5

18 Sep 2020, 13:54

Thanks a lot, Joro Kolev for telling me the rationale of selecting a particular dimension for clustering standard errors. Now, applying this logic, error terms of different companies can be correlated irrespective of the their industry classification. In the same way, I agree with you that there may be theoretically sufficient reasons to believe that error terms of companies within a particular industry may also be correlated. In such a scenario, how do we make a choice? Should be prefer clustering at company or industry level?
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

19 Sep 2020, 10:53

That is exactly what a multilevel hierarchical model would tackle. There’ll will some level of shrinkage, which is helpful as well.

Best regards,

Marcos
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2159
#7

19 Sep 2020, 16:46

I have to respectively disagree with the suggestion that "you should cluster at the level where your error terms are correlated." As Prateek notes, this is not a well-defined concept. If anyone is interested, I'll post simulations that I've used in teaching clustering for the past five years. I generate a simple population model with a heterogeneous treatment effect in a large population -- millions of units. Then I draw a pretty large random sample but still a small fraction of the population. After simple OLS regression, I show clustered standard errors at different levels of clustering. For example, suppose it is the United States with G = 50 states. The clustered standard errors are systematically much too large, resulting in unnecessarily conservative standard errors. Next, cluster at the H = 9 census regions in the United States. The standard errors are inflated even more, and it is not due to small sample bias. The correct procedure is to use the Stats 1 formula for random sampling because we are doing random sampling. Nothing else matters once we are taking a relatively small random sample from the population.

The clustering mistakes neglected heterogeneity for cluster correlation. In fact, if you compute a within correlation it is very significant. And highly misleading: the correct answer is no clustering no matter how large the within-cluster correlation is estimated to be.

What if Prateek has industry at the one, two, and three digit level? The old model-based approach cannot tell us at which level to cluster unless we arbitrarily assume the cluster correlation stops at a certain level.

In Abadie, Athey, Imbens, and Wooldridge (2017) we argue that you can only figure out the proper level of clustering when you take a stand on two issues.

1. Have the data been obtained from cluster sampling? If so, at what level? I'm guessing that Prateek's data were not obtained by sampling clusters industries, so that is out as a reason to cluster at the industry level.
2. What is the level of assignment of the key explanatory variables? This is usually the main consideration. So, if a policy is assigned at the industry level, then cluster at the industry level because that is what a potential outcomes framework implies. If it's at the county level, cluster at the county level -- but not at the state level as that would be "over-clustering." In Prateek's case, it seems all explanatory variables are at the firm level. So cluster at the firm level. A panel data set is naturally clustered by the cross-sectional identifier because we take all time periods for each i, and the clustering is to account for for serial correlation.

I hope this helps.
Jeff
2 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

20 Sep 2020, 01:29

Professor Jeff Wooldridge , starting backwards,

1) Your last point 1. and 2. are uncontroversial (one might even say self-evident), and have a limited range of applications to randomistas and experimentalists. Most economists do not fall in these two categories, and very very few financial economist (if any) do. This particular application does not fall into these two groups either. So 1. and 2. do not offer any guidance here, as useful as they might be in randomised contexts.

2) I do not think that neither Prateek's line of thought in #5, nor your "What if Prateek has industry at the one, two, and three digit level? The old model-based approach cannot tell us at which level to cluster unless we arbitrarily assume the cluster correlation stops at a certain level," lead to a solution. To the best of my knowledge there are no statistical tests to tell us at which level we should cluster. A model cannot tell us this either--we are the ones choosing the appropriate model, models do not choose us.

So all we are left with, at the end, is that we as economists, have to decide whether the error terms and the regressors in our equation co-move at the at the one, two, and three digit industry level. The question you ask can be also asked about the industry fixed effects Prateek is including in his equation: Why industry effects at 68 industries level, but not say at more aggregate (say 5 industries), or more disaggregate (say 200 industries) level? The answer to both questions is coming from the economic knowledge we have for the situation at hand. (If we do not have such economic knowledge, yes, we are in trouble. And this trouble extends far beyond at which level to cluster, and at which level to include the fixed effects.)

3) I think you are teaching us here what might turn out to be the new ways how we think about this problem. So yes, I am very interested in you posting these simulation experiments that you mention. I am also very interested in the effects of heterogeneity you allude to. I need to learn more about heterogeneity to be able to think about its consequences clearly. I have read maybe 2 papers by you on heterogeneous effects, I think at least one of them in Economics Letters, but they had more the structure of "if you estimate this standard model (say fixed effects without slope heterogeneity), you would still estimate correctly the average of the heterogeneous slope."

4) What I gave in #4 is more or less the standard advice and the conventional wisdom. For elaboration of this standard advice and conventional wisdom, one can look up
Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster-robust inference. Journal of human resources, 50(2), 317-372. (as of now Cited by 2744)
Section IV. What to Cluster Over? (working paper freely accessible version here http://cameron.econ.ucdavis.edu/rese...5_February.pdf)

The authors write:

"There are two guiding principles that determine what to cluster over.
First, given V[𝜷�] defined in (7) and (9) whenever there is reason to believe that both the regressors and the errors might be correlated within cluster, we should think about clustering defined in a broad enough way to account for that clustering. Going the other way, if we think that either the regressors or the errors are likely to be uncorrelated within a potential group, then there is no need to cluster within that group.
Second, V�clu[𝜷�] is an average of 𝐺 terms that gets closer to V[𝜷�] only as 𝐺 gets large. If we define very large clusters, so that there are very few clusters to average over in equation (11), then the resulting V�clu[𝜷�] can be a very poor estimate of V [𝜷�]. This complication, and discussion of how few is “few”, is the subject of Section VI. These two principles mirror the bias-variance trade-off that is common in many estimation problems – larger and fewer clusters have less bias but more variability. There is no general solution to this trade-off, and there is no formal test of the level at which to cluster. The consensus is to be conservative and avoid bias and use bigger and more aggregate clusters when possible, up to and including the point at which there is concern about having too few clusters.
For example, suppose your dataset included individuals within counties within states, and you were considering whether to cluster at the county level or the state level. We have been inclined to recommend clustering at the state level. If there was within-state cross-county correlation of the regressors and errors, then ignoring this correlation (for example, by clustering at the county level) would lead to incorrect inference. In practice researchers often cluster at progressively higher (i.e., broader) levels and stop clustering when there is relatively little change in the standard errors. This seems to be a reasonable approach."
2 likes
Comment
Prateek Bedi

Join Date: Sep 2018

Posts: 199
#9

21 Sep 2020, 05:17

Thanks a lot Jeff Wooldridge and Joro Kolev for your such detailed and insightful replies. I shall look at the papers mentioned by both of you. Thanks a lot once again!
Comment

Announcement

Appropriate Dimension for Clustering of Standard Errors

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment