Good evening,
I am working on a panel dataset with 3 observations in 3 different years (2015, 2016, 2017) of test scores of 400,000 students (a total of 1,200,000 observations) from 8,700 different schools. The dataset includes a number of characteristics, an example is provided below. Test scores and family income are standardised so that 100 is the average result of the cohort in that year.
I am running panel regressions with year FE, individual FE and school FE with interaction terms to investigate the impact of some variables (e.g. italian and female) on test scores in different years. The problem is that Stata takes hours to run one single regression: it took at least 7 hours to run the following regression
It takes hours also to run the same regression as above only with the 3 FEs (year, school and individual), i.e. even without the two interaction terms. It seems that the issue that makes Stata take a lot of time to run the regression are the individual and the school FEs, given that I have 400,000 different individuals from 8,700 different schools.
I am using Stata 15 for macOS Catalina 10.15.2 (Stata was last updated on the 3rd of February 2020) on a 2014 Macbook Pro. Am I making some mistake that prevents Stata to run such a regression in a shorter amount of time, or is it inevitable given the size of the dataset and the poor computation capacity of my machine?
Thank you in advance,
Pietro
I am working on a panel dataset with 3 observations in 3 different years (2015, 2016, 2017) of test scores of 400,000 students (a total of 1,200,000 observations) from 8,700 different schools. The dataset includes a number of characteristics, an example is provided below. Test scores and family income are standardised so that 100 is the average result of the cohort in that year.
student_ID | score | year | school_code | female | family_income | italian |
1 | 100 | 2015 | 1000 | 0 | 100 | 0 |
1 | 99 | 2016 | 1000 | 0 | 99 | 0 |
1 | 102 | 2017 | 1000 | 0 | 101 | 0 |
2 | 104 | 2015 | 1000 | 1 | 88 | 0 |
2 | 105 | 2016 | 1000 | 1 | 89 | 0 |
2 | 101 | 2017 | 1000 | 1 | 88 | 0 |
3 | 98 | 2015 | 1001 | 1 | 96 | 1 |
3 | 97 | 2016 | 1001 | 1 | 96 | 1 |
3 | 99 | 2017 | 1002 | 1 | 94 | 1 |
4 | 105 | 2015 | 1002 | 0 | 104 | 1 |
4 | 107 | 2016 | 1002 | 0 | 105 | 1 |
4 | 105 | 2017 | 1002 | 0 | 104 | 1 |
5 | 94 | 2015 | 1002 | 0 | 110 | 0 |
5 | 95 | 2016 | 1002 | 0 | 109 | 0 |
5 | 97 | 2017 | 1002 | 0 | 112 | 0 |
I am running panel regressions with year FE, individual FE and school FE with interaction terms to investigate the impact of some variables (e.g. italian and female) on test scores in different years. The problem is that Stata takes hours to run one single regression: it took at least 7 hours to run the following regression
Code:
set maxvar 10000 set matsize 9000 xtset student_ID year xtreg score italian##i.year female##i.year i.school_code, fe
I am using Stata 15 for macOS Catalina 10.15.2 (Stata was last updated on the 3rd of February 2020) on a 2014 Macbook Pro. Am I making some mistake that prevents Stata to run such a regression in a shorter amount of time, or is it inevitable given the size of the dataset and the poor computation capacity of my machine?
Thank you in advance,
Pietro
Comment