Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel regression with FE takes ~ 10 hours: how can I make it faster?

    Good evening,

    I am working on a panel dataset with 3 observations in 3 different years (2015, 2016, 2017) of test scores of 400,000 students (a total of 1,200,000 observations) from 8,700 different schools. The dataset includes a number of characteristics, an example is provided below. Test scores and family income are standardised so that 100 is the average result of the cohort in that year.
    student_ID score year school_code female family_income italian
    1 100 2015 1000 0 100 0
    1 99 2016 1000 0 99 0
    1 102 2017 1000 0 101 0
    2 104 2015 1000 1 88 0
    2 105 2016 1000 1 89 0
    2 101 2017 1000 1 88 0
    3 98 2015 1001 1 96 1
    3 97 2016 1001 1 96 1
    3 99 2017 1002 1 94 1
    4 105 2015 1002 0 104 1
    4 107 2016 1002 0 105 1
    4 105 2017 1002 0 104 1
    5 94 2015 1002 0 110 0
    5 95 2016 1002 0 109 0
    5 97 2017 1002 0 112 0

    I am running panel regressions with year FE, individual FE and school FE with interaction terms to investigate the impact of some variables (e.g. italian and female) on test scores in different years. The problem is that Stata takes hours to run one single regression: it took at least 7 hours to run the following regression

    Code:
    set maxvar 10000
    set matsize 9000
    
    xtset student_ID year
    xtreg score italian##i.year female##i.year i.school_code, fe
    It takes hours also to run the same regression as above only with the 3 FEs (year, school and individual), i.e. even without the two interaction terms. It seems that the issue that makes Stata take a lot of time to run the regression are the individual and the school FEs, given that I have 400,000 different individuals from 8,700 different schools.

    I am using Stata 15 for macOS Catalina 10.15.2 (Stata was last updated on the 3rd of February 2020) on a 2014 Macbook Pro. Am I making some mistake that prevents Stata to run such a regression in a shorter amount of time, or is it inevitable given the size of the dataset and the poor computation capacity of my machine?

    Thank you in advance,
    Pietro

  • #2
    Absorb any indicators whose coefficients are of no direct interest. The following uses reghdfe from SSC.

    Code:
    ssc install reghdfe, replace
    reghdfe score italian##i.year female##i.year, absorb(student_ID school_code)

    In fact, this could be further simplified to:

    Code:
    reghdfe score italian female italian#i.year female#i.year, absorb(student_ID school_code year)
    or

    Code:
    reghdfe score italian female, absorb(student_ID school_code year italian#i.year female#i.year)
    Last edited by Andrew Musau; 13 Apr 2021, 10:11.

    Comment


    • #3
      Andrew Musau apologies for this late reply. Thank you very much for your suggestion: it's exactly what I was looking for, so much quicker. Thanks again!

      Comment

      Working...
      X