Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clustering Standard errors at industry versus company level

    Dear Statalist members
    I am running a regression with Log stock returns on 1st quarter of 2020 as my dependent variable;
    Code:
    Ln (Stock price on 31-03-2020/ Stock price on 01-01-2020)
    and industries classified into 48 groups (Fama French-48) as my independent variables. My intention is to check whether stock returns varied among industries during the outbreak of Corona(which is obvious logically). Since it is advised to cluster the standard errors at the aggregate level, I ran the following code with clustering at the industry level and I am attaching a subset of my results

    Code:
    reg quart_ret i.ff48,vce(cluster ff48)
    Code:
    . reg quart_ret i.ff48,vce(cluster ff48) 
    
    Linear regression                                      Number of obs =    1924
                                                           F(  0,    29) =       .
                                                           Prob > F      =       .
                                                           R-squared     =  0.0586
                                                           Root MSE      =  .30878
    
                                      (Std. Err. adjusted for 30 clusters in ff48)
    ------------------------------------------------------------------------------
                 |               Robust
       quart_ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            ff48 |
              5  |  -.0006924   1.29e-15 -5.4e+11   0.000    -.0006924   -.0006924
              7  |   .0184726   1.29e-15  1.4e+13   0.000     .0184726    .0184726
              8  |  -.0564544   1.29e-15 -4.4e+13   0.000    -.0564544   -.0564544
              9  |  -.0447615   1.29e-15 -3.5e+13   0.000    -.0447615   -.0447615
             10  |   .1450836   1.29e-15  1.1e+14   0.000     .1450836    .1450836
             11  |   .1361723   1.29e-15  1.1e+14   0.000     .1361723    .1361723
             13  |   .1527319   1.43e-15  1.1e+14   0.000     .1527319    .1527319
             14  |   .0309398   1.30e-15  2.4e+13   0.000     .0309398    .0309398
             16  |   .0442209   1.29e-15  3.4e+13   0.000     .0442209    .0442209

    As the results indicate all my standard errors are very big, bizarre (+ve &-ve) and F statistic is missing. My intention of clustering by industries is to account for correlation among firms in the same industry but I know that during this period correlation can exist amongst industries also. Hence I tried the following command by clustering at the company level and my results are attached

    Code:
     reg quart_ret i.ff48,vce(cluster companyname)
    
    Linear regression                                      Number of obs =    1924
                                                           F( 29,  1923) =    4.70
                                                           Prob > F      =  0.0000
                                                           R-squared     =  0.0586
                                                           Root MSE      =  .30878
    
                             (Std. Err. adjusted for 1924 clusters in companyname)
    ------------------------------------------------------------------------------
                 |               Robust
       quart_ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            ff48 |
              5  |  -.0006924   .0914865    -0.01   0.994    -.1801155    .1787308
              7  |   .0184726   .0776587     0.24   0.812    -.1338315    .1707766
              8  |  -.0564544   .0526913    -1.07   0.284    -.1597924    .0468836
              9  |  -.0447615   .0512153    -0.87   0.382    -.1452049    .0556818
             10  |   .1450836   .0650334     2.23   0.026     .0175402    .2726271
             11  |   .1361723   .0814712     1.67   0.095    -.0236089    .2959536
             13  |   .1527319   .0408991     3.73   0.000     .0725207    .2329432
             14  |   .0309398   .0304394     1.02   0.310    -.0287578    .0906375
             16  |   .0442209   .0357533     1.24   0.216    -.0258985    .1143403
             17  |  -.0232826   .0417664    -0.56   0.577    -.1051948    .0586297
             18  |  -.1212549   .0403227    -3.01   0.003    -.2003357    -.042174
             19  |  -.0859658    .036367    -2.36   0.018    -.1572888   -.0146428
             21  |  -.0131784   .0371308    -0.35   0.723    -.0859993    .059642
    Now, which one should I consider for my interpretation? I doubt that the model which cluster at industry level is usable since all p values are significant there. Which one should I use for interpretation purposes? If further clarification is required, I am happy to provide it.

    Thanks in advance

  • #2
    In your first table your standard errors are small (basically all are 0s) and your t-statistics are huge.

    If your data is a cross section (which seems to be the case from your explanation) clustering on company is equivalent to robust standard errors. So you are not clustering in fact in your second table.



    Comment


    • #3
      Why you get such results in your first table I cannot know without further troubleshooting, but it indicates some data problem. Most probably for many industries you have one firm per industry. Here is a (sort of) replication of your problem:

      Code:
      . set obs 10
      number of observations (_N) was 0, now 10
      
      . gen x = rnormal()
      
      . gen i = _n
      
      . replace i = 1 in 1/3
      (2 real changes made)
      
      . reg x i.i
      
            Source |       SS           df       MS      Number of obs   =        10
      -------------+----------------------------------   F(7, 2)         =      0.15
             Model |  .917285063         7  .131040723   Prob > F        =    0.9772
          Residual |  1.78519588         2   .89259794   R-squared       =    0.3394
      -------------+----------------------------------   Adj R-squared   =   -1.9726
             Total |  2.70248094         9   .30027566   Root MSE        =    .94477
      
      ------------------------------------------------------------------------------
                 x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
                 i |
                4  |   .6182951   1.090931     0.57   0.628    -4.075602    5.312193
                5  |  -.1002139   1.090931    -0.09   0.935    -4.794111    4.593684
                6  |   .5437604   1.090931     0.50   0.668    -4.150137    5.237658
                7  |   .1238122   1.090931     0.11   0.920    -4.570085     4.81771
                8  |   .6928872   1.090931     0.64   0.590     -4.00101    5.386785
                9  |  -.1715198   1.090931    -0.16   0.890    -4.865417    4.522378
               10  |   .3666998   1.090931     0.34   0.769    -4.327198    5.060597
                   |
             _cons |  -.3420687   .5454655    -0.63   0.595    -2.689017     2.00488
      ------------------------------------------------------------------------------
      
      . reg x i.i, cluster(i)
      
      Linear regression                               Number of obs     =         10
                                                      F(0, 7)           =          .
                                                      Prob > F          =          .
                                                      R-squared         =     0.3394
                                                      Root MSE          =     .94477
      
                                            (Std. Err. adjusted for 8 clusters in i)
      ------------------------------------------------------------------------------
                   |               Robust
                 x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
                 i |
                4  |   .6182951          .        .       .            .           .
                5  |  -.1002139   1.26e-16 -8.0e+14   0.000    -.1002139   -.1002139
                6  |   .5437604   2.52e-16  2.2e+15   0.000     .5437604    .5437604
                7  |   .1238122   1.26e-16  9.8e+14   0.000     .1238122    .1238122
                8  |   .6928872          .        .       .            .           .
                9  |  -.1715198   2.52e-16 -6.8e+14   0.000    -.1715198   -.1715198
               10  |   .3666998   1.26e-16  2.9e+15   0.000     .3666998    .3666998
                   |
             _cons |  -.3420687          .        .       .            .           .
      ------------------------------------------------------------------------------

      Comment


      • #4
        Dear Joro Thanks for the timely help. So in second case my standard errors are robust but not clustered while in first case my standard errors are clustered(are they robust) Now which one should I consider based on economic intuition and logic Ofcourse mine is a cross section

        Most probably for many industries you have one firm per industry
        Does this mean that I must remove industries having a fewer firms say <10 and then proceed
        Thanks in advance
        Last edited by lal mohan kumar; 03 Aug 2020, 00:14.

        Comment


        • #5
          Clustered standard errors are always robust, so yes, in your first table they are robust and clustered.

          I think based on economics you should cluster on industry like you are doing in the first table, shocks to companies are shared/common within industries, so stock returns are correlated within industries.

          No, I do not propose that you immediately remove data. Just troubleshoot and see what is going on. E.g., you can write
          Code:
           
           tabulate ff48
          and this will show you in the column Freq. how many firms you have per industry.

          You can also do your regression in your first table with standard errors only robust, to confirm that the problem does not arise.


          Originally posted by lal mohan kumar View Post
          Dear Joro Thanks for the timely help. So in second case my standard errors are robust but not clustered while in first case my standard errors are clustered(are they robust) Now which one should I consider based on economic intuition and logic Ofcourse mine is a cross section


          Does this mean that I must remove industries having a fewer firms say <10 and then proceed
          Thanks in advance

          Comment


          • #6
            Dear Joro
            Thanks for those insights related to clustering and robustness of standard errors. I have minimum of 6 industries in my industry classification and maximum of 276 one

            Code:
             tabulate ff48
            
                   ff48 |      Freq.     Percent        Cum.
            ------------+-----------------------------------
                      1 |        141        7.34        7.34
                      5 |          6        0.31        7.65
                      7 |         22        1.14        8.79
                      8 |         21        1.09        9.89
                      9 |         55        2.86       12.75
                     10 |          7        0.36       13.11
                     11 |         21        1.09       14.20
                     13 |        111        5.78       19.98
                     14 |        276       14.36       34.34
                     16 |         95        4.94       39.28
                     17 |         67        3.49       42.77
                     18 |        131        6.82       49.58
                     19 |        110        5.72       55.31
                     21 |         89        4.63       59.94
                     22 |         75        3.90       63.84
                     23 |         70        3.64       67.48
                     28 |          6        0.31       67.79
                     30 |         10        0.52       68.31
                     31 |         43        2.24       70.55
                     32 |         27        1.40       71.96
                     33 |         10        0.52       72.48
                     34 |        159        8.27       80.75
                     36 |         28        1.46       82.21
                     38 |         28        1.46       83.66
                     40 |         12        0.62       84.29
                     41 |        207       10.77       95.06
                     42 |         30        1.56       96.62
                     43 |         32        1.66       98.28
                     49 |         33        1.72      100.00
            ------------+-----------------------------------
                  Total |      1,922      100.00
            In the meanwhile, I also tried to count the id(firms) based on the codes you gave me earlier to get summary statistics of id in ff48 industry

            Code:
             egen count_co=count (id),by(ff48)
            
            . univar count_co
                                                    -------------- Quantiles --------------
            Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
            -------------------------------------------------------------------------------
            count_co    1922   129.94    79.95     6.00    70.00   111.00   207.00   276.00
            -------------------------------------------------------------------------------
            These 2 analysis indicates that I have atleast 6 observations. My fear with clustering at industry level is that I may have to write all industries were significantly affected by the Pandemic which others may take sceptically since economists believe that services, construction, transportation were the worst affected ones. Can I still proceed with Industry level clustering?
            Thanks in advance
            Last edited by lal mohan kumar; 03 Aug 2020, 01:22.

            Comment


            • #7
              All you are showing above looks fine in terms that you have enough firms per industry to be able to do both industry fixed effects and clustering on industry.

              And yet there is some numerical problem, because your regression in the first table you show thinks that you have 30 industries
              (Std. Err. adjusted for 30 clusters in ff48) , but you have only 29 (I am able to count only 29 industries in your tabulation command).

              Why dont you try

              Code:
              areg quart_ret , absorb(ff48) vce(robust) 
               areg quart_ret , absorb(ff48) vce(cluster ff48)
              to see what happens? In principle -areg- and what you are doing should be numerically the same, and the way how you do it is the right way because you are interested in the industry dummies.

              But this might reveal something about where the numerical problem is.

              Comment


              • #8
                In your first regression you are basically calculating the mean return for each industry (those are the coefficients you get for each industry) and at then you are clustering in the industry variable. If you work out the algebra for the clustered covariance matrix you will get zero standard errors.
                Another example is the well known Griliches data set on wages. If you regress log of wages on age dummies and cluster on age you get zero standard errors.

                Comment


                • #9
                  You appear to be right, Eric. (see below)

                  But it is not clear to me why you re right. The first part of what you say is self evident, this is exactly what this regression does "calculating the mean return for each industry".

                  But why if you cluster by industry on the top of the industry fixed effect should the standard errors show up at 0?

                  Can you point to a source where you have seen this algebra worked out?

                  Code:
                  . set obs 1000
                  number of observations (_N) was 0, now 1,000
                  
                  . gen y = rnormal()
                  
                  . egen industry = seq(), block(100)
                  
                  . reg y i.industry, cluster(industry)
                  
                  Linear regression                               Number of obs     =      1,000
                                                                  F(0, 9)           =          .
                                                                  Prob > F          =          .
                                                                  R-squared         =     0.0048
                                                                  Root MSE          =     .96352
                  
                                                (Std. Err. adjusted for 10 clusters in industry)
                  ------------------------------------------------------------------------------
                               |               Robust
                             y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  -------------+----------------------------------------------------------------
                      industry |
                            2  |  -.0483758   9.91e-16 -4.9e+13   0.000    -.0483758   -.0483758
                            3  |   .0048997   9.56e-16  5.1e+12   0.000     .0048997    .0048997
                            4  |  -.0599166   9.45e-16 -6.3e+13   0.000    -.0599166   -.0599166
                            5  |   .1386841   9.81e-16  1.4e+14   0.000     .1386841    .1386841
                            6  |  -.0474743   9.45e-16 -5.0e+13   0.000    -.0474743   -.0474743
                            7  |  -.0019279   9.54e-16 -2.0e+12   0.000    -.0019279   -.0019279
                            8  |   .0275062   9.52e-16  2.9e+13   0.000     .0275062    .0275062
                            9  |  -.1336482   9.63e-16 -1.4e+14   0.000    -.1336482   -.1336482
                           10  |   .0009313   9.45e-16  9.9e+11   0.000     .0009313    .0009313
                               |
                         _cons |  -.0417158   9.45e-16 -4.4e+13   0.000    -.0417158   -.0417158
                  ------------------------------------------------------------------------------


                  Originally posted by Eric de Souza View Post
                  In your first regression you are basically calculating the mean return for each industry (those are the coefficients you get for each industry) and at then you are clustering in the industry variable. If you work out the algebra for the clustered covariance matrix you will get zero standard errors.
                  Another example is the well known Griliches data set on wages. If you regress log of wages on age dummies and cluster on age you get zero standard errors.

                  Comment


                  • #10
                    @Joro: I just took the formula for clustered variance matrices from one of James MacKinnon's papers, assumed that there were two groups and did some simple calculations. To check what I got I did the exercise with the Griliches data set and another very different data set and got the same result.
                    I haven't worked out the algebra in detail for n groups but would be very surprised if it were different.
                    On edit: this is not a proof, but was enough to convince me. It is what I thought which is why I tried it out.
                    Last edited by Eric de Souza; 03 Aug 2020, 04:17.

                    Comment


                    • #11
                      Thank you Joro and Eric for the reply

                      And yet there is some numerical problem, because your regression in the first table you show thinks that you have 30 industries
                      (Std. Err. adjusted for 30 clusters in ff48) , but you have only 29 (I am able to count only 29 industries in your tabulation command).
                      I ran the codes that you gave and got the following results
                      Code:
                      . areg quart_ret , absorb(ff48) vce(robust)
                      
                      Linear regression, absorbing indicators           Number of obs   =       1924
                                                                        F(   0,   1894) =          .
                                                                        Prob > F        =          .
                                                                        R-squared       =     0.0586
                                                                        Adj R-squared   =     0.0442
                                                                        Root MSE        =     0.3088
                      
                      ------------------------------------------------------------------------------
                                   |               Robust
                         quart_ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                             _cons |  -.4204698   .0070396   -59.73   0.000    -.4342761   -.4066635
                      -------------+----------------------------------------------------------------
                              ff48 |   absorbed                                      (30 categories)
                      
                      .
                      .  areg quart_ret , absorb(ff48) vce(cluster ff48)
                      
                      Linear regression, absorbing indicators           Number of obs   =       1924
                                                                        F(   0,     29) =          .
                                                                        Prob > F        =          .
                                                                        R-squared       =     0.0586
                                                                        Adj R-squared   =     0.0442
                                                                        Root MSE        =     0.3088
                      
                                                        (Std. Err. adjusted for 30 clusters in ff48)
                      ------------------------------------------------------------------------------
                                   |               Robust
                         quart_ret |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                             _cons |  -.4204698   1.50e-17 -2.8e+16   0.000    -.4204698   -.4204698
                      -------------+----------------------------------------------------------------
                              ff48 |   absorbed                                      (30 categories)
                      
                      .


                      Both displays 30 categories. I think somewhere in codes I dropped industry ff47(financial industries) and that is why earlier tabulate showed 29 industries. Thanks for pointing it (I shall be careful in future and I apologize). But as you said " why if you cluster by the industry on the top of the industry fixed effect should the standard errors show up at 0" couldn't find out.
                      Once again thank you Joro
                      Last edited by lal mohan kumar; 03 Aug 2020, 04:51.

                      Comment


                      • #12
                        Dear Joro and other Stata members
                        I think I must keep that question open, why clustering by the industry on the top of the industry fixed effect makes the standard error approximately equal to 0. In case someone points the reason I shall be grateful. Thanks

                        Comment


                        • #13
                          With a bit of thinking about this, I know the intuitive reason. To do inference (and to calculate standard errors at all), we need to have degrees of freedom. That is, we need to have more data points than parameters. However in your model we effectively have 30 parameters (the industry dummies) and also 30 independent observations (= 30 clusters). This is why when we cluster we get zero standard errors, because we are effectively estimating 30 parameters on 30 data points, so in a way the model perfectly fits the data without any error.

                          One way or another, you cannot cluster by industry in your model. Just calculate robust standard errors, and argue in your paper that the industry fixed effects subsume the aggregate shocks to industry.

                          Comment


                          • #14
                            Dear Joro
                            Once again thanks for the replying to my query. The points you mentioned is intuitive and logically valid( because we are effectively estimating 30 parameters on 30 data points, so in a way the model perfectly fits the data without any error)
                            As you suggested it is better to go for robust standard errors in my case.
                            Thanks a lot for the proper guidance and clarification

                            Comment


                            • #15
                              Another way to formulate what Joro says is that the effective sample size is the number of clusters. If you have G clusters, the degrees of freedom for the t-statistic for a coefficient is G-k-1 where k is the number of coefficients estimated. But in fact your have k = G. Hence you have negative degrees of freedom.

                              Comment

                              Working...
                              X