Panel regression with FE takes ~ 10 hours: how can I make it faster?

Pietro Guglielmi

Join Date: Apr 2021
Posts: 13

Panel regression with FE takes ~ 10 hours: how can I make it faster?

13 Apr 2021, 09:57

Good evening,

I am working on a panel dataset with 3 observations in 3 different years (2015, 2016, 2017) of test scores of 400,000 students (a total of 1,200,000 observations) from 8,700 different schools. The dataset includes a number of characteristics, an example is provided below. Test scores and family income are standardised so that 100 is the average result of the cohort in that year.

student_ID	score	year	school_code	female	family_income	italian
1	100	2015	1000	0	100	0
1	99	2016	1000	0	99	0
1	102	2017	1000	0	101	0
2	104	2015	1000	1	88	0
2	105	2016	1000	1	89	0
2	101	2017	1000	1	88	0
3	98	2015	1001	1	96	1
3	97	2016	1001	1	96	1
3	99	2017	1002	1	94	1
4	105	2015	1002	0	104	1
4	107	2016	1002	0	105	1
4	105	2017	1002	0	104	1
5	94	2015	1002	0	110	0
5	95	2016	1002	0	109	0
5	97	2017	1002	0	112	0

I am running panel regressions with year FE, individual FE and school FE with interaction terms to investigate the impact of some variables (e.g. italian and female) on test scores in different years. The problem is that Stata takes hours to run one single regression: it took at least 7 hours to run the following regression

Code:

set maxvar 10000
set matsize 9000

xtset student_ID year
xtreg score italian##i.year female##i.year i.school_code, fe

It takes hours also to run the same regression as above only with the 3 FEs (year, school and individual), i.e. even without the two interaction terms. It seems that the issue that makes Stata take a lot of time to run the regression are the individual and the school FEs, given that I have 400,000 different individuals from 8,700 different schools.

I am using Stata 15 for macOS Catalina 10.15.2 (Stata was last updated on the 3rd of February 2020) on a 2014 Macbook Pro. Am I making some mistake that prevents Stata to run such a regression in a shorter amount of time, or is it inevitable given the size of the dataset and the poor computation capacity of my machine?

Thank you in advance,
Pietro

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10482
#2

13 Apr 2021, 10:05

Absorb any indicators whose coefficients are of no direct interest. The following uses reghdfe from SSC.

Code:

ssc install reghdfe, replace reghdfe score italian##i.year female##i.year, absorb(student_ID school_code)

In fact, this could be further simplified to:

Code:

reghdfe score italian female italian#i.year female#i.year, absorb(student_ID school_code year)

or

Code:

reghdfe score italian female, absorb(student_ID school_code year italian#i.year female#i.year)

Last edited by Andrew Musau; 13 Apr 2021, 10:11.
2 likes
Comment
Pietro Guglielmi

Join Date: Apr 2021

Posts: 13
#3

20 Apr 2021, 04:06

Andrew Musau apologies for this late reply. Thank you very much for your suggestion: it's exactly what I was looking for, so much quicker. Thanks again!
Comment

Announcement

Panel regression with FE takes ~ 10 hours: how can I make it faster?

Comment

Comment