Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Double Clustering in a Multi-country Data Set up

    Dear Stata Members
    First, a heartfelt advance New Year Wishes to All. I wish all a prosperous New Year

    I am dealing with a cross-country dataset, in which the lowest units are firms. I have an agglomeration of firms (industry) and the broad level is the country. I have 22 Countries, 18 Industries,17252 firms and 22 years.
    For panel data clustering I usually cluster at a single unit level, that is firm-level. However, some articles cluster at both firm and year levels in the cross-country setup.

    What does it mean by double clustering (firm and year)?
    Clustering as far as I know in the context of the panel, is to account for the correlation within the units. For instance, if the residual of the outcome variable is likely to be correlated within say Industry, one should cluster the standard errors by industry. But in the context of double clustering with respect to firm-year, will it make sense to cluster SE within these unique pairs of firm and year?

    Similarly in a post, I have seen that clustering units less than 30 is not advisable (https://www.statalist.org/forums/for...72#post1603472). Will this apply to double clustering, where my no: of years are <30.


    Code:
    . xtset id year
    
    Panel variable: id (unbalanced)
     Time variable: year, 1999 to 2020, but with gaps
             Delta: 1 unit
    
    . reghdfe dividends risk  roa_w size_w lev_w sg_w cash_ta1_w tangib_w age mb_w, absorb(id year) cluster (id )
    (dropped 1846 singleton observations)
    (MWFE estimator converged in 8 iterations)
    
    HDFE Linear regression                            Number of obs   =     92,159
    Absorbing 2 HDFE groups                           F(   9,  10505) =     192.48
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.3821
                                                      Adj R-squared   =     0.3024
                                                      Within R-sq.    =     0.0373
    Number of clusters (id)      =     10,506         Root MSE        =     0.1669
    
                                    (Std. err. adjusted for 10,506 clusters in id)
    ------------------------------------------------------------------------------
                 |               Robust
       dividends | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
            risk |   .0089675    .003735     2.40   0.016     .0016462    .0162888
           roa_w |  -.7158403   .0192201   -37.24   0.000    -.7535152   -.6781653
          size_w |   .0051734   .0023954     2.16   0.031      .000478    .0098688
           lev_w |  -.0614293   .0088244    -6.96   0.000    -.0787268   -.0441318
            sg_w |  -.0029462   .0003515    -8.38   0.000    -.0036352   -.0022572
      cash_ta1_w |  -.0693444    .010555    -6.57   0.000    -.0900342   -.0486545
        tangib_w |  -.0245404   .0092626    -2.65   0.008    -.0426969    -.006384
             age |   .0165564   .0036146     4.58   0.000     .0094712    .0236417
            mb_w |  -.0006307   .0001642    -3.84   0.000    -.0009526   -.0003089
           _cons |   .2522908   .0239573    10.53   0.000     .2053299    .2992517
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
              id |     10506       10506           0    *|
            year |        21           0          21     |
    -----------------------------------------------------+
    * = FE nested within cluster; treated as redundant for DoF computation
    
    . reghdfe dividends risk  roa_w size_w lev_w sg_w cash_ta1_w tangib_w age mb_w, absorb(id year) cluster (id year )
    (dropped 1846 singleton observations)
    (MWFE estimator converged in 8 iterations)
    Warning: VCV matrix was non-positive semi-definite; adjustment from Cameron, Gelbach & Miller applied.
    
    HDFE Linear regression                            Number of obs   =     92,159
    Absorbing 2 HDFE groups                           F(   9,     20) =      94.49
    Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                      R-squared       =     0.3821
                                                      Adj R-squared   =     0.3024
    Number of clusters (id)      =     10,506         Within R-sq.    =     0.0373
    Number of clusters (year)    =         21         Root MSE        =     0.1669
    
                                   (Std. err. adjusted for 21 clusters in id year)
    ------------------------------------------------------------------------------
                 |               Robust
       dividends | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
            risk |   .0089675   .0122294     0.73   0.472    -.0165426    .0344776
           roa_w |  -.7158403    .046722   -15.32   0.000    -.8133007   -.6183799
          size_w |   .0051734    .004472     1.16   0.261     -.004155    .0145018
           lev_w |  -.0614293     .01249    -4.92   0.000     -.087483   -.0353757
            sg_w |  -.0029462   .0006372    -4.62   0.000    -.0042754    -.001617
      cash_ta1_w |  -.0693444   .0108852    -6.37   0.000    -.0920505   -.0466382
        tangib_w |  -.0245404   .0096214    -2.55   0.019    -.0446104   -.0044705
             age |   .0165564   .0060575     2.73   0.013     .0039207    .0291922
            mb_w |  -.0006307   .0002081    -3.03   0.007    -.0010648   -.0001967
           _cons |   .2522908   .0807704     3.12   0.005     .0838068    .4207749
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    -----------------------------------------------------+
     Absorbed FE | Categories  - Redundant  = Num. Coefs |
    -------------+---------------------------------------|
              id |     10506       10506           0    *|
            year |        21          21           0    *|
    -----------------------------------------------------+
    * = FE nested within cluster; treated as redundant for DoF computation
    Double clustering indicates that clustering is done for 21 clusters (id-year). But the significance level has also changed. What could be the reason for this drop in significance from Single clustering to Double clustering?
    Any thoughts, or suggestions could be helpful as this is for my general learning



  • #2
    Ial:
    actually, double-clustering makes only two predictors losing their previous statistical significance (not a big deal indeed).
    The reason might be that the number of year is too limited to cluster (and so the SE are influenced by that) or to opposite (that is, despite being less than 30, their effect is actually to increase the SEs and rightfully so).
    Be as it may, I think that this is not the most relevant issue in your model: I would focus on model specification instead, as in both regeressins the within R_sq is really low.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Carlo Lazzaro
      Thanks for the instant reply and for suggesting the plausible reasons. It seems to me at this stage as you suggested that double clustering won't make much change in my model. Well, then let me take it as it is.
      My advance New Year wishes to you!

      Comment


      • #4
        Ial:
        Your interpretation is correct.
        I do reciprocate the very same to you. Thanks.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Carlo Lazzaro
          Dear Carlo,
          I was looking for an answer for my question regarding clustering standard errors.

          Does clustering OLS regression model by year makes sense or not?
          Also, Why ?
          Below is my OLS model
          HTML Code:
          reg Y X   size btm_w roa_w loss bind Dual boardsize work finance i.Indus i.year, robust cluster(year)

          Comment


          • #6
            Alkebsee:
            it depends.
            My guess is that clustering on industry makes more sense.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Originally posted by Carlo Lazzaro View Post
              Alkebsee:
              it depends.
              My guess is that clustering on industry makes more sense.
              Kindly provide more explanation. You said it depends. So, please it depends on what ?

              I need to justify clustering by year
              Please

              Comment


              • #8
                Alkebsee:
                if you think that residuals are more correlated with years than with industry (something that I would find strange), cluster on years.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Originally posted by Carlo Lazzaro View Post
                  Alkebsee:
                  if you think that residuals are more correlated with years than with industry (something that I would find strange), cluster on years.
                  I got it
                  Thank you very much

                  Comment


                  • #10
                    Originally posted by Carlo Lazzaro View Post
                    Alkebsee:
                    if you think that residuals are more correlated with years than with industry (something that I would find strange), cluster on years.
                    Sorry for disturbing you again. I just come up with a question.
                    How can I decide whether the residuals are more correlated with years than industry and vice versa ?
                    Is it depending on the nature of what I am exploring ( the relationship between Y and X) or there is a test for that ?
                    Maybe the question is not reasonable but I need to know frankly.

                    Thank you in advance

                    Comment


                    • #11
                      Alkebsee:
                      there's no test that I know about that.
                      The knowledge of the data generating process rules here.
                      That said, I'd go clustering the standad errors on -i.industry- and plugging -i.year- among the predictors in the right-hand side of the regression equation.
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment

                      Working...
                      X