Clustering Standard errors at industry versus company level

lal mohan kumar

Join Date: May 2019
Posts: 265

Clustering Standard errors at industry versus company level

02 Aug 2020, 23:38

Dear Statalist members
I am running a regression with Log stock returns on 1st quarter of 2020 as my dependent variable;

Code:

Ln (Stock price on 31-03-2020/ Stock price on 01-01-2020)

and industries classified into 48 groups (Fama French-48) as my independent variables. My intention is to check whether stock returns varied among industries during the outbreak of Corona(which is obvious logically). Since it is advised to cluster the standard errors at the aggregate level, I ran the following code with clustering at the industry level and I am attaching a subset of my results

Code:

reg quart_ret i.ff48,vce(cluster ff48)

Code:

. reg quart_ret i.ff48,vce(cluster ff48) 

Linear regression                                      Number of obs =    1924
                                                       F(  0,    29) =       .
                                                       Prob > F      =       .
                                                       R-squared     =  0.0586
                                                       Root MSE      =  .30878

                                  (Std. Err. adjusted for 30 clusters in ff48)
------------------------------------------------------------------------------
             |               Robust
   quart_ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ff48 |
          5  |  -.0006924   1.29e-15 -5.4e+11   0.000    -.0006924   -.0006924
          7  |   .0184726   1.29e-15  1.4e+13   0.000     .0184726    .0184726
          8  |  -.0564544   1.29e-15 -4.4e+13   0.000    -.0564544   -.0564544
          9  |  -.0447615   1.29e-15 -3.5e+13   0.000    -.0447615   -.0447615
         10  |   .1450836   1.29e-15  1.1e+14   0.000     .1450836    .1450836
         11  |   .1361723   1.29e-15  1.1e+14   0.000     .1361723    .1361723
         13  |   .1527319   1.43e-15  1.1e+14   0.000     .1527319    .1527319
         14  |   .0309398   1.30e-15  2.4e+13   0.000     .0309398    .0309398
         16  |   .0442209   1.29e-15  3.4e+13   0.000     .0442209    .0442209

As the results indicate all my standard errors are very big, bizarre (+ve &-ve) and F statistic is missing. My intention of clustering by industries is to account for correlation among firms in the same industry but I know that during this period correlation can exist amongst industries also. Hence I tried the following command by clustering at the company level and my results are attached

Code:

 reg quart_ret i.ff48,vce(cluster companyname)

Linear regression                                      Number of obs =    1924
                                                       F( 29,  1923) =    4.70
                                                       Prob > F      =  0.0000
                                                       R-squared     =  0.0586
                                                       Root MSE      =  .30878

                         (Std. Err. adjusted for 1924 clusters in companyname)
------------------------------------------------------------------------------
             |               Robust
   quart_ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ff48 |
          5  |  -.0006924   .0914865    -0.01   0.994    -.1801155    .1787308
          7  |   .0184726   .0776587     0.24   0.812    -.1338315    .1707766
          8  |  -.0564544   .0526913    -1.07   0.284    -.1597924    .0468836
          9  |  -.0447615   .0512153    -0.87   0.382    -.1452049    .0556818
         10  |   .1450836   .0650334     2.23   0.026     .0175402    .2726271
         11  |   .1361723   .0814712     1.67   0.095    -.0236089    .2959536
         13  |   .1527319   .0408991     3.73   0.000     .0725207    .2329432
         14  |   .0309398   .0304394     1.02   0.310    -.0287578    .0906375
         16  |   .0442209   .0357533     1.24   0.216    -.0258985    .1143403
         17  |  -.0232826   .0417664    -0.56   0.577    -.1051948    .0586297
         18  |  -.1212549   .0403227    -3.01   0.003    -.2003357    -.042174
         19  |  -.0859658    .036367    -2.36   0.018    -.1572888   -.0146428
         21  |  -.0131784   .0371308    -0.35   0.723    -.0859993    .059642

Now, which one should I consider for my interpretation? I doubt that the model which cluster at industry level is usable since all p values are significant there. Which one should I use for interpretation purposes? If further clarification is required, I am happy to provide it.

Thanks in advance

Tags: None

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

03 Aug 2020, 00:01

In your first table your standard errors are small (basically all are 0s) and your t-statistics are huge.

If your data is a cross section (which seems to be the case from your explanation) clustering on company is equivalent to robust standard errors. So you are not clustering in fact in your second table.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

03 Aug 2020, 00:07

Why you get such results in your first table I cannot know without further troubleshooting, but it indicates some data problem. Most probably for many industries you have one firm per industry. Here is a (sort of) replication of your problem:

Code:

. set obs 10
number of observations (_N) was 0, now 10

. gen x = rnormal()

. gen i = _n

. replace i = 1 in 1/3
(2 real changes made)

. reg x i.i

      Source |       SS           df       MS      Number of obs   =        10
-------------+----------------------------------   F(7, 2)         =      0.15
       Model |  .917285063         7  .131040723   Prob > F        =    0.9772
    Residual |  1.78519588         2   .89259794   R-squared       =    0.3394
-------------+----------------------------------   Adj R-squared   =   -1.9726
       Total |  2.70248094         9   .30027566   Root MSE        =    .94477

------------------------------------------------------------------------------
           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           i |
          4  |   .6182951   1.090931     0.57   0.628    -4.075602    5.312193
          5  |  -.1002139   1.090931    -0.09   0.935    -4.794111    4.593684
          6  |   .5437604   1.090931     0.50   0.668    -4.150137    5.237658
          7  |   .1238122   1.090931     0.11   0.920    -4.570085     4.81771
          8  |   .6928872   1.090931     0.64   0.590     -4.00101    5.386785
          9  |  -.1715198   1.090931    -0.16   0.890    -4.865417    4.522378
         10  |   .3666998   1.090931     0.34   0.769    -4.327198    5.060597
             |
       _cons |  -.3420687   .5454655    -0.63   0.595    -2.689017     2.00488
------------------------------------------------------------------------------

. reg x i.i, cluster(i)

Linear regression                               Number of obs     =         10
                                                F(0, 7)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.3394
                                                Root MSE          =     .94477

                                      (Std. Err. adjusted for 8 clusters in i)
------------------------------------------------------------------------------
             |               Robust
           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           i |
          4  |   .6182951          .        .       .            .           .
          5  |  -.1002139   1.26e-16 -8.0e+14   0.000    -.1002139   -.1002139
          6  |   .5437604   2.52e-16  2.2e+15   0.000     .5437604    .5437604
          7  |   .1238122   1.26e-16  9.8e+14   0.000     .1238122    .1238122
          8  |   .6928872          .        .       .            .           .
          9  |  -.1715198   2.52e-16 -6.8e+14   0.000    -.1715198   -.1715198
         10  |   .3666998   1.26e-16  2.9e+15   0.000     .3666998    .3666998
             |
       _cons |  -.3420687          .        .       .            .           .
------------------------------------------------------------------------------

Comment

lal mohan kumar

Join Date: May 2019

Posts: 265
#4

03 Aug 2020, 00:10

Dear Joro Thanks for the timely help. So in second case my standard errors are robust but not clustered while in first case my standard errors are clustered(are they robust) Now which one should I consider based on economic intuition and logic Ofcourse mine is a cross section

Most probably for many industries you have one firm per industry

Does this mean that I must remove industries having a fewer firms say <10 and then proceed
Thanks in advance

Last edited by lal mohan kumar; 03 Aug 2020, 00:14.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#5

03 Aug 2020, 00:26

Clustered standard errors are always robust, so yes, in your first table they are robust and clustered.

I think based on economics you should cluster on industry like you are doing in the first table, shocks to companies are shared/common within industries, so stock returns are correlated within industries.

No, I do not propose that you immediately remove data. Just troubleshoot and see what is going on. E.g., you can write

Code:

tabulate ff48

and this will show you in the column Freq. how many firms you have per industry.

You can also do your regression in your first table with standard errors only robust, to confirm that the problem does not arise.

Originally posted by lal mohan kumar View Post

Dear Joro Thanks for the timely help. So in second case my standard errors are robust but not clustered while in first case my standard errors are clustered(are they robust) Now which one should I consider based on economic intuition and logic Ofcourse mine is a cross section

Does this mean that I must remove industries having a fewer firms say <10 and then proceed
Thanks in advance
Comment

lal mohan kumar

Join Date: May 2019
Posts: 265

03 Aug 2020, 00:37

Dear Joro
Thanks for those insights related to clustering and robustness of standard errors. I have minimum of 6 industries in my industry classification and maximum of 276 one

Code:

 tabulate ff48

       ff48 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        141        7.34        7.34
          5 |          6        0.31        7.65
          7 |         22        1.14        8.79
          8 |         21        1.09        9.89
          9 |         55        2.86       12.75
         10 |          7        0.36       13.11
         11 |         21        1.09       14.20
         13 |        111        5.78       19.98
         14 |        276       14.36       34.34
         16 |         95        4.94       39.28
         17 |         67        3.49       42.77
         18 |        131        6.82       49.58
         19 |        110        5.72       55.31
         21 |         89        4.63       59.94
         22 |         75        3.90       63.84
         23 |         70        3.64       67.48
         28 |          6        0.31       67.79
         30 |         10        0.52       68.31
         31 |         43        2.24       70.55
         32 |         27        1.40       71.96
         33 |         10        0.52       72.48
         34 |        159        8.27       80.75
         36 |         28        1.46       82.21
         38 |         28        1.46       83.66
         40 |         12        0.62       84.29
         41 |        207       10.77       95.06
         42 |         30        1.56       96.62
         43 |         32        1.66       98.28
         49 |         33        1.72      100.00
------------+-----------------------------------
      Total |      1,922      100.00

In the meanwhile, I also tried to count the id(firms) based on the codes you gave me earlier to get summary statistics of id in ff48 industry

Code:

 egen count_co=count (id),by(ff48)

. univar count_co
                                        -------------- Quantiles --------------
Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
count_co    1922   129.94    79.95     6.00    70.00   111.00   207.00   276.00
-------------------------------------------------------------------------------

These 2 analysis indicates that I have atleast 6 observations. My fear with clustering at industry level is that I may have to write all industries were significantly affected by the Pandemic which others may take sceptically since economists believe that services, construction, transportation were the worst affected ones. Can I still proceed with Industry level clustering?
Thanks in advance

Last edited by lal mohan kumar; 03 Aug 2020, 01:22.

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#7

03 Aug 2020, 01:31

All you are showing above looks fine in terms that you have enough firms per industry to be able to do both industry fixed effects and clustering on industry.

And yet there is some numerical problem, because your regression in the first table you show thinks that you have 30 industries
(Std. Err. adjusted for 30 clusters in ff48) , but you have only 29 (I am able to count only 29 industries in your tabulation command).

Why dont you try

Code:

areg quart_ret , absorb(ff48) vce(robust) areg quart_ret , absorb(ff48) vce(cluster ff48)

to see what happens? In principle -areg- and what you are doing should be numerically the same, and the way how you do it is the right way because you are interested in the industry dummies.

But this might reveal something about where the numerical problem is.
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#8

03 Aug 2020, 03:21

In your first regression you are basically calculating the mean return for each industry (those are the coefficients you get for each industry) and at then you are clustering in the industry variable. If you work out the algebra for the clustered covariance matrix you will get zero standard errors.
Another example is the well known Griliches data set on wages. If you regress log of wages on age dummies and cluster on age you get zero standard errors.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

03 Aug 2020, 03:42

You appear to be right, Eric. (see below)

But it is not clear to me why you re right. The first part of what you say is self evident, this is exactly what this regression does "calculating the mean return for each industry".

But why if you cluster by industry on the top of the industry fixed effect should the standard errors show up at 0?

Can you point to a source where you have seen this algebra worked out?

Code:

. set obs 1000
number of observations (_N) was 0, now 1,000

. gen y = rnormal()

. egen industry = seq(), block(100)

. reg y i.industry, cluster(industry)

Linear regression                               Number of obs     =      1,000
                                                F(0, 9)           =          .
                                                Prob > F          =          .
                                                R-squared         =     0.0048
                                                Root MSE          =     .96352

                              (Std. Err. adjusted for 10 clusters in industry)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    industry |
          2  |  -.0483758   9.91e-16 -4.9e+13   0.000    -.0483758   -.0483758
          3  |   .0048997   9.56e-16  5.1e+12   0.000     .0048997    .0048997
          4  |  -.0599166   9.45e-16 -6.3e+13   0.000    -.0599166   -.0599166
          5  |   .1386841   9.81e-16  1.4e+14   0.000     .1386841    .1386841
          6  |  -.0474743   9.45e-16 -5.0e+13   0.000    -.0474743   -.0474743
          7  |  -.0019279   9.54e-16 -2.0e+12   0.000    -.0019279   -.0019279
          8  |   .0275062   9.52e-16  2.9e+13   0.000     .0275062    .0275062
          9  |  -.1336482   9.63e-16 -1.4e+14   0.000    -.1336482   -.1336482
         10  |   .0009313   9.45e-16  9.9e+11   0.000     .0009313    .0009313
             |
       _cons |  -.0417158   9.45e-16 -4.4e+13   0.000    -.0417158   -.0417158
------------------------------------------------------------------------------

Originally posted by Eric de Souza View Post

In your first regression you are basically calculating the mean return for each industry (those are the coefficients you get for each industry) and at then you are clustering in the industry variable. If you work out the algebra for the clustered covariance matrix you will get zero standard errors.
Another example is the well known Griliches data set on wages. If you regress log of wages on age dummies and cluster on age you get zero standard errors.

Comment

Eric de Souza

Join Date: Mar 2014

Posts: 587
#10

03 Aug 2020, 03:49

@Joro: I just took the formula for clustered variance matrices from one of James MacKinnon's papers, assumed that there were two groups and did some simple calculations. To check what I got I did the exercise with the Griliches data set and another very different data set and got the same result.
I haven't worked out the algebra in detail for n groups but would be very surprised if it were different.
On edit: this is not a proof, but was enough to convince me. It is what I thought which is why I tried it out.

Last edited by Eric de Souza; 03 Aug 2020, 04:17.
Comment

lal mohan kumar

Join Date: May 2019
Posts: 265

#11

03 Aug 2020, 04:43

Thank you Joro and Eric for the reply

And yet there is some numerical problem, because your regression in the first table you show thinks that you have 30 industries
(Std. Err. adjusted for 30 clusters in ff48) , but you have only 29 (I am able to count only 29 industries in your tabulation command).

I ran the codes that you gave and got the following results

Code:

. areg quart_ret , absorb(ff48) vce(robust)

Linear regression, absorbing indicators           Number of obs   =       1924
                                                  F(   0,   1894) =          .
                                                  Prob > F        =          .
                                                  R-squared       =     0.0586
                                                  Adj R-squared   =     0.0442
                                                  Root MSE        =     0.3088

------------------------------------------------------------------------------
             |               Robust
   quart_ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |  -.4204698   .0070396   -59.73   0.000    -.4342761   -.4066635
-------------+----------------------------------------------------------------
        ff48 |   absorbed                                      (30 categories)

.
.  areg quart_ret , absorb(ff48) vce(cluster ff48)

Linear regression, absorbing indicators           Number of obs   =       1924
                                                  F(   0,     29) =          .
                                                  Prob > F        =          .
                                                  R-squared       =     0.0586
                                                  Adj R-squared   =     0.0442
                                                  Root MSE        =     0.3088

                                  (Std. Err. adjusted for 30 clusters in ff48)
------------------------------------------------------------------------------
             |               Robust
   quart_ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |  -.4204698   1.50e-17 -2.8e+16   0.000    -.4204698   -.4204698
-------------+----------------------------------------------------------------
        ff48 |   absorbed                                      (30 categories)

.

Both displays 30 categories. I think somewhere in codes I dropped industry ff47(financial industries) and that is why earlier tabulate showed 29 industries. Thanks for pointing it (I shall be careful in future and I apologize). But as you said " why if you cluster by the industry on the top of the industry fixed effect should the standard errors show up at 0" couldn't find out.
Once again thank you Joro

Last edited by lal mohan kumar; 03 Aug 2020, 04:51.

Comment

lal mohan kumar

Join Date: May 2019

Posts: 265
#12

03 Aug 2020, 09:11

Dear Joro and other Stata members
I think I must keep that question open, why clustering by the industry on the top of the industry fixed effect makes the standard error approximately equal to 0. In case someone points the reason I shall be grateful. Thanks
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#13

03 Aug 2020, 11:30

With a bit of thinking about this, I know the intuitive reason. To do inference (and to calculate standard errors at all), we need to have degrees of freedom. That is, we need to have more data points than parameters. However in your model we effectively have 30 parameters (the industry dummies) and also 30 independent observations (= 30 clusters). This is why when we cluster we get zero standard errors, because we are effectively estimating 30 parameters on 30 data points, so in a way the model perfectly fits the data without any error.

One way or another, you cannot cluster by industry in your model. Just calculate robust standard errors, and argue in your paper that the industry fixed effects subsume the aggregate shocks to industry.
Comment
lal mohan kumar

Join Date: May 2019

Posts: 265
#14

03 Aug 2020, 11:42

Dear Joro
Once again thanks for the replying to my query. The points you mentioned is intuitive and logically valid( because we are effectively estimating 30 parameters on 30 data points, so in a way the model perfectly fits the data without any error)
As you suggested it is better to go for robust standard errors in my case.
Thanks a lot for the proper guidance and clarification
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#15

03 Aug 2020, 11:51

Another way to formulate what Joro says is that the effective sample size is the number of clusters. If you have G clusters, the degrees of freedom for the t-statistic for a coefficient is G-k-1 where k is the number of coefficients estimated. But in fact your have k = G. Hence you have negative degrees of freedom.
Comment

Announcement

Clustering Standard errors at industry versus company level

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment