Reghdfe 4.x standard errors, and categorical variables

Luca Baum

Join Date: Sep 2017

Posts: 2
#1

Reghdfe 4.x standard errors, and categorical variables

11 Sep 2017, 08:26

Dear all,

I have two questions that I need to understand, and I would really appreciate any help that I can get.

Firstly, I would like to understand why is it that -reghdfe- the 4.x version gives smaller standard erors, hence smaller p-values, than the -reghdfe- 3.x version? And which one would be the better to use?

Secondly, I have a control variable in my estimation, which is the lagged log of Income. I also created a categorical variable out of Income, and include it as an interaction with another (continous) variable. My question is, when Income is included as a categorical variable, should I still keep the lagged log of Income too, or exclude it from the regression?

My basic specification looks something like this:

Code:

xtreg y x1 x2 x3 logIncome_(t-1), fe cluster(id)

now:

Code:

xtreg y x1 c.x2##i.Incomecat x3 logIncome_(t-1), fe cluster(id)

Would this make sense? I would instinctly exclude the logIncome_(t-1) from the regression, though I have some doubts, as I do not have much experience with empirical analysis.
I apologize if the question is a rather basic one, but would really appreciate any guidance!

Thank you in advance for your time.

Last edited by Luca Baum; 11 Sep 2017, 08:38.
Tags: None
Luca Baum

Join Date: Sep 2017

Posts: 2
#2

12 Sep 2017, 02:52

Anyone please? Any ideas?
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#3

14 Sep 2017, 17:56

Hi Luca,

Just read your post. Have you seen if the degrees of freedom have changed?

Note that you can run the 3.x version if you run reghdfe with the old option, so that would be easy to compare without uninstalling/installing.

My best guess is that you might have different variables, or at least a diff number of degrees of freedom and of absorbed variables (i.e. , check e(df_a) e(df_r) e(df_m) ,etc.)

Best,
Sergio
Comment
Luca Baumm

Join Date: Sep 2017

Posts: 19
#4

23 Sep 2017, 09:26

Hi Sergio,

Thank you for your respond.
The model that I am estimating is the same with both the 3.x version, or the 4.x, so I have the same variables.

Actually, the e(df_r) and e(df_m) are the same in both of them, however, the e(df_a) is different. From what I can tell, they use different absorbed variables. The 4.x. version uses the year, whereas the 3.x. version uses the firm id and absorbs one year too. (I include both year and firm-fixed effects.)
I am not sure If I am making any sense right now. Hence, as I do not understand it that well, I am not sure what to continue the analysis with.
I appreciate your guidance!

Thank you!
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#5

23 Sep 2017, 17:58

Originally posted by Luca Baumm View Post

Hi Sergio,
The 4.x. version uses the year, whereas the 3.x. version uses the firm id and absorbs one year too. (I include both year and firm-fixed effects.)

Without knowing the exact results, this is my guess:

On 3.x, if you had the same variable in absorb() and cluster() then e(df_a) wouldn't be affected by the variable, because we are already applying a penalty by adjusting the number of obs. to the number of clusters in the DoF calculation. However, on 4.x we do apply a penalty of one degree of freedom, because the mean of the partialled-out variable is zero.

This shouldn't really matter because if you are using reghdfe you are likely to have many obs, in which case one obs shouldn't matter.

Best,
S
Comment

Luca Baumm

Join Date: Sep 2017
Posts: 19

24 Sep 2017, 11:37

Sergio,
Thank you very much for the explanation!

This is what I get at the end of the output.
With 3.x:

Code:

Absorbed degrees of freedom:
---------------------------------------------------------------+
 Absorbed FE |  Num. Coefs.  =   Categories  -   Redundant     |
-------------+-------------------------------------------------|
          id |        32260           32260              0     |
        year |            9              10              1     |
---------------------------------------------------------------+

Whereas, with 4.x :

Code:

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          id |     32260      32260            0    *|
        year |        10           0          10     |
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

So, I have the same variables in absorb (id year), and I cluster at the same level for both (country).

If I understand you correctly, it shouldn't matter which version I use?

Thank you for your patience and help!

Comment

Sergio Correia

Join Date: Apr 2014

Posts: 420
#7

24 Sep 2017, 11:47

Yeah, I only see a difference of one due to the 9 in year for 3.x and 10 in 4.x

Given your number of obs., this shouldn't really matter for your degrees of freedom
1 like
Comment
Luca Baumm

Join Date: Sep 2017

Posts: 19
#8

25 Sep 2017, 03:22

Thank you very much Sergio!!
Comment
Luca Baumm

Join Date: Sep 2017

Posts: 19
#9

25 Sep 2017, 04:36

Originally posted by Sergio Correia View Post

Yeah, I only see a difference of one due to the 9 in year for 3.x and 10 in 4.x

Sorry again Sergio, but what causes the confusion for me is that the number of redundant categories of Id with the 3.x version is 0, whereas with the 4.x all are redundant (* = FE nested within cluster; treated as redundant for DoF computation)..
Am I missing something very straightforward here?
Thank you!

Best,
Luca
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#10

25 Sep 2017, 07:14

I see. Are you using the same clustering on both cases?
Comment
Luca Baumm

Join Date: Sep 2017

Posts: 19
#11

25 Sep 2017, 07:26

Yes, I use exactly the same command on both cases.

I just unistall /install the versions and run the regression.

Code:

reghdfe y x1 x2 x3 x4 , absorb(id year) vce(cluster country#year)

This is the exact command I am using on both cases.

However, I just noticed I get the same results with 3.x and 4.x if I use the grouping: countryyear instead of country#year.

I understand that the two are equivalent (Countryyear and country#year), and with the 3.x version they always give me the same results, nonetheless apparently with the 4.x they do not.

Last edited by Luca Baumm; 25 Sep 2017, 07:54.
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#12

25 Sep 2017, 09:24

Ok, I can reproduce this issue with Stata's example dataset:

Code:

clear all cls set more off sysuse auto reghdfe price weight, a(turn#trunk foreign) vce(cluster turn#foreign) reghdfe price weight, a(turn#trunk foreign) vce(cluster turn#foreign) old egen turn_foreign = group(turn foreign) reghdfe price weight, a(turn#trunk foreign) vce(cluster turn_foreign)

(note that the -old- version on reghdfe 4 calls reghdfe 3, which helps when comparing between both).

It seems that there is a bug in reghdfe 4, where it seems that -country- is nested within -id- (but it should have checked that country#year is). Will try to push an update later today.
1 like
Comment
Luca Baumm

Join Date: Sep 2017

Posts: 19
#13

25 Sep 2017, 09:35

Ohh okay, I see, I understand now. That would be great!

Your help is greatly appreciated! Thank you!
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#14

25 Sep 2017, 09:43

The bug was (as it often is) due to a line of code that caused reghdfe to only use the first variable in vce(cluster var1#var2).

This commit should be fixing the problem. Let me know if reinstalling reghdfe fixes the problem.
Comment
Luca Baumm

Join Date: Sep 2017

Posts: 19
#15

25 Sep 2017, 10:00

Dear Sergio,

Yes! It did fix the problem.

Thank you!
Comment

Announcement

Reghdfe 4.x standard errors, and categorical variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment