reghdfe takes hours - how to run panel regressions with FE faster?

Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#1

reghdfe takes hours - how to run panel regressions with FE faster?

14 May 2021, 03:09

Good morning,

I have a dataset with 4 million observations on 2 million students in two points in time (grade 5 and grade 8) from 5 different cohorts (observed between 2012 and 2019) from 15,000 schools. I am running the following panel regression with fixed effects:

Code:

qui reghdfe test_score proportion_females, absorb(school year student grade i.school#c.year) vce(cluster i.school)

which takes more than 2 hours. With another model specifications (with control variables instead of the student FE) one whole night was not enough to run one single regression.

Is there a way of running the regression faster? I need to run 20+ of those regressions because of different model specifications. Are there tricks to implement that reduce the computation time (e.g. reducing the number of decimals of the dependent and the independent variables...)?

I am using Stata/SE 15.1 for Mac (64-bit Intel) on a 2014 Macbook Pro (2.4 GHz Quad-Core Intel Core i7) with macOS Catalina 10.15.2.

Thank you in advance for your help and tips.
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2469
#2

14 May 2021, 04:22

Hi Pietro
I dont think there is nothing within Stata that can help you running this models faster. You have a large dataset with a very complicated set of fixed effects. Even though -reghdfe- has an efficient algorithm to account for fixed effects (beyond a simple demeaning process), the computing time will increase the more fixed effects you want to estimate.
An alternative is for you to modify your model moving some of the "absorbed" fixed effects back to the model specification.

Code:

Instead of this reghdfe test_score proportion_females, absorb(school year student grade i.school#c.year) vce(cluster i.school) use this reghdfe test_score proportion_females i.grade i.year , absorb(school student i.school#c.year) vce(cluster school)

That way, the internal algorith has to worry about 4 fixed effects rather than 6. Which may help with the speed.
HTH
Comment
Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#3

14 May 2021, 05:12

Hi FernandoRios, thank you very much for your reply and help. I tried your suggestions, ran a couple of tests on a subsample with timer and found that moving some variables in the model specification actually triples the computation time...
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#4

14 May 2021, 05:34

Not that I think there is anything you can do to speed this up, but why are you defining your year as a continuous variable?

That is, if I were you I would try to work with this term
i.school#c.year What happens if you do
i.school#i.year (as I think you should). What happens if you do

Code:

egen scholyear = group(school year)

and then instead of the term above you do

Code:

reghdfe test_score proportion_females, absorb(school year student grade scholyear) vce(cluster school)
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#5

14 May 2021, 06:43

mmm, that is interesting.
Two question (one related to Joro's suggestion)
1. Do you -need- the school specific time trends?
That is probably the component that takes the longest to incorporate into the model. So excluding that may help increasing the speed. Again, you may want to justify why you want to include or exclude that effect

2. Do all your model specifications share the same structure of "fixed" effects?
If all models include the same set of fixed effects, you can "retrend/demean" all variables first (check hdfe), before you actually run the regressions on the demean variables.
So for more complex models, you wont need to run the demeaning step.

HTH
Comment
Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#6

15 May 2021, 02:55

Joro Kolev Thank you for your reply and for your suggestion. It definitely makes more sense, and incidentally the regressions are also faster to run, thank you very much!
FernandoRios 1. Yes I do because I want to catch the deviation of the proportion of female students in each year from the school specific long-term time trend.
2. Thank you very much for pointing out Correia's hdfe command, I checked it out but did not understand how to use it.... anyway with Joro's suggestion the regressions for some reason are much faster already!

I would have a follow-up question: in an alternative model specification, I use the strategy of Caroline Hoxby (2000, doi:10.3386/w7867) on a dataset on test scores taken in grade 8 by 10 cohorts (2010-2019) of around 500,000 students each, from around 5,000 schools. Therefore I have only one observation for each individual, but I observe the whole cohort of grade 8 of each school every year. So the dataset is a repeated cross-section of the population of grade 8 students in all Italian schools. And in this setting my regression is

Code:

reghdfe test_score proportion_females, absorb(school year school_specific_time_trend) vce(cluster school)

My independent variable is proportion of female students in school s in year t; adding the school_year FE as Joro suggested above, i.e.

Code:

egen scholyear = group(school year)

leads to collinearity in this model, since I have one observation of my independent variable for every school in every year, collinear with the school_year FE. Is there a way I can add the school-specific time trend so that is not collinear? I initially thought of using

Code:

i.school#c.year

but as Joro pointed out it does not make a lot of sense. Thank you in advance!
Comment

Announcement

reghdfe takes hours - how to run panel regressions with FE faster?

Comment

Comment

Comment

Comment

Comment