Dear Statalist Community,
currently I'm running regressions with the reghdfe command, which is, according to my research in the help of Stata and the Statalist Forum, probably the most convenient and most efficient way to include several fixed effects and allow for multidimensional clustering.
I'm running the following regression, thereby absorbing for two sorts of fixed effects and also clustering across them:
The variable permno corresponds to a unique identifier for firms (so including it in absorb implies controlling for firm fixed effects).
The variable CFx corresponds to a categorial variable, taking different values for different time periods (so including it in absorb implies controlling for time fixed effects). It is similar to a year variable, however I have varying time periods I want to control for and so it takes for example the value of 0 for all dates between 01jan1993 and 25apr1993, the value of 1 for all dates between 26apr1993 and 18jul1993, the value of 2 for all dates between 19jul1993 and 04oct19933, and so on until the value of 10 for all dates between 01may1996 and 30may1997.
Running above regression leads to the following output:
I was highly surprised, when I produced / converted the regression results to Excel with esttab and could only find ** (significance at the 5% level) for my variable of interest (postEDGAR), even though the t-statistic was -2.89, normally corresponding clearly to a significance at the 1% level.
According to my knowledge this must imply, that I lose a large amount of degrees of freedom. The t value of -2.89 corresponds according to my output "only" to a p value of 0.016, explaining the significance only at the 5% level.
What causes this extreme amount of losing degrees of freedom, assuming that the reghdfe command runs correct (which should be the case).
I have around 3.9 million observations, the cluster and absorb variables have 4647 and 11 categories each (I know that reghdfe proposes that the number of clusters should be at least 50 for every cluster-variable, but my supervisor still advised me to use reghdfe with two dimensional clustering).
In the stored results I saw that e(df_a) = 0, e(df_m) = 1 and e(df_r) = 10. Does that mean that my t-statistics are following a t distribution with 10 df's?
Thank you in advance for all of help, suggestions and corrections!
Best regards
Phil
currently I'm running regressions with the reghdfe command, which is, according to my research in the help of Stata and the Statalist Forum, probably the most convenient and most efficient way to include several fixed effects and allow for multidimensional clustering.
I'm running the following regression, thereby absorbing for two sorts of fixed effects and also clustering across them:
Code:
reghdfe shavol postEDGAR, absorb(permno CFx) vce(cluster permno CFx)
The variable CFx corresponds to a categorial variable, taking different values for different time periods (so including it in absorb implies controlling for time fixed effects). It is similar to a year variable, however I have varying time periods I want to control for and so it takes for example the value of 0 for all dates between 01jan1993 and 25apr1993, the value of 1 for all dates between 26apr1993 and 18jul1993, the value of 2 for all dates between 19jul1993 and 04oct19933, and so on until the value of 10 for all dates between 01may1996 and 30may1997.
Running above regression leads to the following output:
Code:
. // Regression with both Time and Firm Fixed Effects / Clustering: Firms and Time . eststo: reghdfe shavol postEDGAR, absorb(CFx permno) vce(cluster permno CFx) (converged in 6 iterations) HDFE Linear regression Number of obs = 3,910,523 Absorbing 2 HDFE groups F( 1, 10) = 8.37 Statistics robust to heteroskedasticity Prob > F = 0.0160 R-squared = 0.5144 Adj R-squared = 0.5138 Number of clusters (permno) = 4,647 Within R-sq. = 0.0003 Number of clusters (CFx) = 11 Root MSE = 3.254e+05 (Std. Err. adjusted for 11 clusters in permno CFx) ------------------------------------------------------------------------------ | Robust shavol | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- postEDGAR | -21628.26 7476.212 -2.89 0.016 -38286.3 -4970.226 ------------------------------------------------------------------------------ Absorbed degrees of freedom: ---------------------------------------------------------------+ Absorbed FE | Num. Coefs. = Categories - Redundant | -------------+-------------------------------------------------| CFx | 0 11 11 * | permno | 0 4647 4647 * | ---------------------------------------------------------------+ * = fixed effect nested within cluster; treated as redundant for DoF computation (est1 stored)
According to my knowledge this must imply, that I lose a large amount of degrees of freedom. The t value of -2.89 corresponds according to my output "only" to a p value of 0.016, explaining the significance only at the 5% level.
What causes this extreme amount of losing degrees of freedom, assuming that the reghdfe command runs correct (which should be the case).
I have around 3.9 million observations, the cluster and absorb variables have 4647 and 11 categories each (I know that reghdfe proposes that the number of clusters should be at least 50 for every cluster-variable, but my supervisor still advised me to use reghdfe with two dimensional clustering).
In the stored results I saw that e(df_a) = 0, e(df_m) = 1 and e(df_r) = 10. Does that mean that my t-statistics are following a t distribution with 10 df's?
Thank you in advance for all of help, suggestions and corrections!
Best regards
Phil
Comment