Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • reghdfe and t-statistics

    Dear Statalist Community,

    currently I'm running regressions with the reghdfe command, which is, according to my research in the help of Stata and the Statalist Forum, probably the most convenient and most efficient way to include several fixed effects and allow for multidimensional clustering.

    I'm running the following regression, thereby absorbing for two sorts of fixed effects and also clustering across them:

    Code:
    reghdfe shavol postEDGAR, absorb(permno CFx) vce(cluster permno CFx)
    The variable permno corresponds to a unique identifier for firms (so including it in absorb implies controlling for firm fixed effects).

    The variable CFx corresponds to a categorial variable, taking different values for different time periods (so including it in absorb implies controlling for time fixed effects). It is similar to a year variable, however I have varying time periods I want to control for and so it takes for example the value of 0 for all dates between 01jan1993 and 25apr1993, the value of 1 for all dates between 26apr1993 and 18jul1993, the value of 2 for all dates between 19jul1993 and 04oct19933, and so on until the value of 10 for all dates between 01may1996 and 30may1997.

    Running above regression leads to the following output:

    Code:
    . // Regression with both Time and Firm Fixed Effects / Clustering: Firms and Time
    . eststo: reghdfe shavol postEDGAR, absorb(CFx permno) vce(cluster permno CFx)
    (converged in 6 iterations)
    
    HDFE Linear regression                            Number of obs   =  3,910,523
    Absorbing 2 HDFE groups                           F(   1,     10) =       8.37
    Statistics robust to heteroskedasticity           Prob > F        =     0.0160
                                                      R-squared       =     0.5144
                                                      Adj R-squared   =     0.5138
    Number of clusters (permno)  =      4,647         Within R-sq.    =     0.0003
    Number of clusters (CFx)     =         11         Root MSE        =  3.254e+05
    
                                (Std. Err. adjusted for 11 clusters in permno CFx)
    ------------------------------------------------------------------------------
                 |               Robust
          shavol |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
       postEDGAR |  -21628.26   7476.212    -2.89   0.016     -38286.3   -4970.226
    ------------------------------------------------------------------------------
    
    Absorbed degrees of freedom:
    ---------------------------------------------------------------+
     Absorbed FE |  Num. Coefs.  =   Categories  -   Redundant     | 
    -------------+-------------------------------------------------|
             CFx |            0              11             11 *   | 
          permno |            0            4647           4647 *   | 
    ---------------------------------------------------------------+
    * = fixed effect nested within cluster; treated as redundant for DoF computation
    (est1 stored)
    I was highly surprised, when I produced / converted the regression results to Excel with esttab and could only find ** (significance at the 5% level) for my variable of interest (postEDGAR), even though the t-statistic was -2.89, normally corresponding clearly to a significance at the 1% level.

    According to my knowledge this must imply, that I lose a large amount of degrees of freedom. The t value of -2.89 corresponds according to my output "only" to a p value of 0.016, explaining the significance only at the 5% level.

    What causes this extreme amount of losing degrees of freedom, assuming that the reghdfe command runs correct (which should be the case).
    I have around 3.9 million observations, the cluster and absorb variables have 4647 and 11 categories each (I know that reghdfe proposes that the number of clusters should be at least 50 for every cluster-variable, but my supervisor still advised me to use reghdfe with two dimensional clustering).

    In the stored results I saw that e(df_a) = 0, e(df_m) = 1 and e(df_r) = 10. Does that mean that my t-statistics are following a t distribution with 10 df's?


    Thank you in advance for all of help, suggestions and corrections!

    Best regards
    Phil

  • #2
    In the stored results I saw that e(df_a) = 0, e(df_m) = 1 and e(df_r) = 10. Does that mean that my t-statistics are following a t distribution with 10 df's?
    Yes, that's precisely what it means. And that is because you have used the cluster robust variance estimator: its degrees of freedom is restricted to the number of clusters minus 1. (With multiple clustering variables, it is the smallest number of clusters.)

    Comment


    • #3
      Thanks Clyde for the response.

      So that is probably also the reason, why reghdfe advices in the help that the number of clusters should be high enough (rule of thumb 50+). With 50 df's we should be relatively close to a standard normal distribution with our well known p-values

      Comment


      • #4
        So that is probably also the reason, why reghdfe advices in the help that the number of clusters should be high enough (rule of thumb 50+). With 50 df's we should be relatively close to a standard normal distribution with our well known p-values
        -reghdfe- was written by Sergio Correa, so you would have to ask him what he had in mind. Personally, I suspect it was a different motivation. Cluster robust standard errors have been shown in simulations to perform poorly when the number of clusters is small. While there is no clear consensus on how many clusters are adequate, you can certainly find guidelines that urge they not be used with fewer than 50 clusters because they simply aren't valid. But you can find others who will tell you that 20 or even 10 clusters suffice.

        In any case, I don't think anybody would advise you to avoid them on the basis of seeking normality: nobody tells you not to do a Student t-test in a 10 df context.

        Comment

        Working...
        X