Double Clustering in a Multi-country Data Set up

lal mohan kumar

Join Date: May 2019
Posts: 265

Double Clustering in a Multi-country Data Set up

31 Dec 2021, 02:14

Dear Stata Members
First, a heartfelt advance New Year Wishes to All. I wish all a prosperous New Year

I am dealing with a cross-country dataset, in which the lowest units are firms. I have an agglomeration of firms (industry) and the broad level is the country. I have 22 Countries, 18 Industries,17252 firms and 22 years.
For panel data clustering I usually cluster at a single unit level, that is firm-level. However, some articles cluster at both firm and year levels in the cross-country setup.

What does it mean by double clustering (firm and year)?
Clustering as far as I know in the context of the panel, is to account for the correlation within the units. For instance, if the residual of the outcome variable is likely to be correlated within say Industry, one should cluster the standard errors by industry. But in the context of double clustering with respect to firm-year, will it make sense to cluster SE within these unique pairs of firm and year?

Similarly in a post, I have seen that clustering units less than 30 is not advisable (https://www.statalist.org/forums/for...72#post1603472). Will this apply to double clustering, where my no: of years are <30.

Code:

. xtset id year

Panel variable: id (unbalanced)
 Time variable: year, 1999 to 2020, but with gaps
         Delta: 1 unit

. reghdfe dividends risk  roa_w size_w lev_w sg_w cash_ta1_w tangib_w age mb_w, absorb(id year) cluster (id )
(dropped 1846 singleton observations)
(MWFE estimator converged in 8 iterations)

HDFE Linear regression                            Number of obs   =     92,159
Absorbing 2 HDFE groups                           F(   9,  10505) =     192.48
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.3821
                                                  Adj R-squared   =     0.3024
                                                  Within R-sq.    =     0.0373
Number of clusters (id)      =     10,506         Root MSE        =     0.1669

                                (Std. err. adjusted for 10,506 clusters in id)
------------------------------------------------------------------------------
             |               Robust
   dividends | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        risk |   .0089675    .003735     2.40   0.016     .0016462    .0162888
       roa_w |  -.7158403   .0192201   -37.24   0.000    -.7535152   -.6781653
      size_w |   .0051734   .0023954     2.16   0.031      .000478    .0098688
       lev_w |  -.0614293   .0088244    -6.96   0.000    -.0787268   -.0441318
        sg_w |  -.0029462   .0003515    -8.38   0.000    -.0036352   -.0022572
  cash_ta1_w |  -.0693444    .010555    -6.57   0.000    -.0900342   -.0486545
    tangib_w |  -.0245404   .0092626    -2.65   0.008    -.0426969    -.006384
         age |   .0165564   .0036146     4.58   0.000     .0094712    .0236417
        mb_w |  -.0006307   .0001642    -3.84   0.000    -.0009526   -.0003089
       _cons |   .2522908   .0239573    10.53   0.000     .2053299    .2992517
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |     10506       10506           0    *|
        year |        21           0          21     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

. reghdfe dividends risk  roa_w size_w lev_w sg_w cash_ta1_w tangib_w age mb_w, absorb(id year) cluster (id year )
(dropped 1846 singleton observations)
(MWFE estimator converged in 8 iterations)
Warning: VCV matrix was non-positive semi-definite; adjustment from Cameron, Gelbach & Miller applied.

HDFE Linear regression                            Number of obs   =     92,159
Absorbing 2 HDFE groups                           F(   9,     20) =      94.49
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.3821
                                                  Adj R-squared   =     0.3024
Number of clusters (id)      =     10,506         Within R-sq.    =     0.0373
Number of clusters (year)    =         21         Root MSE        =     0.1669

                               (Std. err. adjusted for 21 clusters in id year)
------------------------------------------------------------------------------
             |               Robust
   dividends | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        risk |   .0089675   .0122294     0.73   0.472    -.0165426    .0344776
       roa_w |  -.7158403    .046722   -15.32   0.000    -.8133007   -.6183799
      size_w |   .0051734    .004472     1.16   0.261     -.004155    .0145018
       lev_w |  -.0614293     .01249    -4.92   0.000     -.087483   -.0353757
        sg_w |  -.0029462   .0006372    -4.62   0.000    -.0042754    -.001617
  cash_ta1_w |  -.0693444   .0108852    -6.37   0.000    -.0920505   -.0466382
    tangib_w |  -.0245404   .0096214    -2.55   0.019    -.0446104   -.0044705
         age |   .0165564   .0060575     2.73   0.013     .0039207    .0291922
        mb_w |  -.0006307   .0002081    -3.03   0.007    -.0010648   -.0001967
       _cons |   .2522908   .0807704     3.12   0.005     .0838068    .4207749
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |     10506       10506           0    *|
        year |        21          21           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Double clustering indicates that clustering is done for 21 clusters (id-year). But the significance level has also changed. What could be the reason for this drop in significance from Single clustering to Double clustering?
Any thoughts, or suggestions could be helpful as this is for my general learning

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

31 Dec 2021, 02:31

Ial:
actually, double-clustering makes only two predictors losing their previous statistical significance (not a big deal indeed).
The reason might be that the number of year is too limited to cluster (and so the SE are influenced by that) or to opposite (that is, despite being less than 30, their effect is actually to increase the SEs and rightfully so).
Be as it may, I think that this is not the most relevant issue in your model: I would focus on model specification instead, as in both regeressins the within R_sq is really low.

Kind regards,
Carlo
(Stata 19.0)
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#3

31 Dec 2021, 02:45

Dear Carlo Lazzaro
Thanks for the instant reply and for suggesting the plausible reasons. It seems to me at this stage as you suggested that double clustering won't make much change in my model. Well, then let me take it as it is.
My advance New Year wishes to you!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#4

31 Dec 2021, 02:57

Ial:
Your interpretation is correct.
I do reciprocate the very same to you. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
ALKEBSEE RADWAN

Join Date: Mar 2019

Posts: 240
#5

25 Apr 2023, 08:12

Carlo Lazzaro
Dear Carlo,
I was looking for an answer for my question regarding clustering standard errors.

Does clustering OLS regression model by year makes sense or not?
Also, Why ?
Below is my OLS model

HTML Code:

reg Y X size btm_w roa_w loss bind Dual boardsize work finance i.Indus i.year, robust cluster(year)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#6

25 Apr 2023, 08:15

Alkebsee:
it depends.
My guess is that clustering on industry makes more sense.

Kind regards,
Carlo
(Stata 19.0)
Comment
ALKEBSEE RADWAN

Join Date: Mar 2019

Posts: 240
#7

25 Apr 2023, 08:19

Originally posted by Carlo Lazzaro View Post

Alkebsee:
it depends.
My guess is that clustering on industry makes more sense.

Kindly provide more explanation. You said it depends. So, please it depends on what ?

I need to justify clustering by year
Please
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#8

25 Apr 2023, 08:26

Alkebsee:
if you think that residuals are more correlated with years than with industry (something that I would find strange), cluster on years.

Kind regards,
Carlo
(Stata 19.0)
Comment
ALKEBSEE RADWAN

Join Date: Mar 2019

Posts: 240
#9

25 Apr 2023, 08:30

Originally posted by Carlo Lazzaro View Post

Alkebsee:
if you think that residuals are more correlated with years than with industry (something that I would find strange), cluster on years.

I got it
Thank you very much
Comment
ALKEBSEE RADWAN

Join Date: Mar 2019

Posts: 240
#10

25 Apr 2023, 09:30

Originally posted by Carlo Lazzaro View Post

Alkebsee:
if you think that residuals are more correlated with years than with industry (something that I would find strange), cluster on years.

Sorry for disturbing you again. I just come up with a question.
How can I decide whether the residuals are more correlated with years than industry and vice versa ?
Is it depending on the nature of what I am exploring ( the relationship between Y and X) or there is a test for that ?
Maybe the question is not reasonable but I need to know frankly.

Thank you in advance
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#11

25 Apr 2023, 11:53

Alkebsee:
there's no test that I know about that.
The knowledge of the data generating process rules here.
That said, I'd go clustering the standad errors on -i.industry- and plugging -i.year- among the predictors in the right-hand side of the regression equation.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Double Clustering in a Multi-country Data Set up

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment