faster to re-load data or preserve/restore?

Jack Reimer

Join Date: Sep 2018

Posts: 52
#1

faster to re-load data or preserve/restore?

22 Aug 2019, 13:28

I have a workflow that requires frequently collapsing data according to different specifications and then running regressions. The data are very large and typically take >15-30 seconds to do, what feels like, anything. In principle, is it computationally less expensive to use preserve/restore repeatedly or simply clear and re-load data sets every time?

I know that my workflow is probably more cumbersome than it needs to be and that the real answer is to bite the bullet and learn how to leverage mata but it would be helpful to know for future reference.
Tags: collapse, data, load data, preserve
Sergio Correia

Join Date: Apr 2014

Posts: 420
#2

22 Aug 2019, 13:42

You probably won't see much difference unless you are using Stata 16, where you can preserve-restore using memory.

Besides that, to answer your question we would need to know a bit more on your data. A few misc tips:
If you can, trim the data as much as possible (only keep required obs and variables). Then, be sure that you are not using types that are too big for the dataset (a "double" variable actually takes 8 times the space of a "byte"). See "describe" for that, as well as "compress"

When you collapse, use the "fast" option!

Even better, use gcollapse (ssc install gtools)

It might also help if you use the collapse+append trick of fcollapse: fcollapse has an append option where the collapsed data is saved at the end of the dataset. So you can collapse by many things and not having to preserve/save (as long as you use the fast option)

Best
S
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4987
#3

22 Aug 2019, 16:58

To elaborate on Sergio's point, I wonder if Stata 16's new frame features would make your life easier. It makes it easy to have multiple data sets open at once. Stata Corps claims "The do- and ado-files that you have previously written that use preserve and restore will run faster if you use Stata/MP because it secretly uses frames in place of temporary files to preserve data. The speed-up is sometimes remarkable. We have old do- and ado-files that run 20 percent faster."

https://www.stata.com/new-in-stata/m...ets-in-memory/

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Eric Goldstein

Join Date: Sep 2016

Posts: 5
#4

22 Aug 2019, 17:53

However, keep in mind that if the dataset is over a certain size (1gb by default) preserve/restore will work the same in Stata16 as it did in Stata15. Use set max_preservemem to change that default to something larger if appropraite.
Comment

Announcement

faster to re-load data or preserve/restore?

Comment

Comment

Comment