Working with large Panels (regressions)

Dominique Chaufree

Join Date: Mar 2019

Posts: 14
#1

Working with large Panels (regressions)

22 Mar 2019, 11:54

Hi,

I have a large wide panel dataset: t=804 months. N=45464 individuals. In long form, this equals to 804*45464=36,553,056 observations. I am wondering which of the following procedures in, in general, faster:
Transform the full wide dataset into one large long dataset. Run regressions on sub samples. Perhaps: drop not needed observations before running regressions on sub samples.

Transform only sub sample into long dataset. Run analysis on sub sample. Consider new sub sample.

What is, in general, faster?

Thanks so much!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

22 Mar 2019, 12:24

The -reshape- command, although one of Stata's best features IMHO, is also one of its slowest commands. It has a lot of overhead verifying that the data are suitable for what is requested, and it also typically thrashes the disk a lot. So you probably only want to do it once, for the whole data set, rather than repeatedly.

Once you have transformed the entire data set into long, if your subsamples are mutually exclusive, as is often the case, and indicated (or or could be indicated) by a variable or group of variables, then you can run the regressions fairly quickly after that. If you do not need to store regression outputs and just need the results in your analysis log, then the -by:- prefix will very quickly take you through all of them. If after each regression you need to store or do something else with the regression results, then you can wrap the regression and its post-processing into a little program and do it with -runby- (by Robert Picard and me, available from SSC).
Comment
Dominique Chaufree

Join Date: Mar 2019

Posts: 14
#3

22 Mar 2019, 13:02

Thanks a lot. This is very helpful!
Comment

Announcement

Working with large Panels (regressions)

Comment

Comment