Making reshape faster

Robert Picard

Join Date: Mar 2014
Posts: 1536

#16

02 May 2016, 09:28

As Paul has pointed out in #2, Daniel Feenberg observed years ago that reshape long is inexplicably slow with larger datasets. To illustrate his point, he showed a technique that is significantly faster at performing a reshape to long form. The reason why reshape long is much slower was left to speculation. If you scan the code for reshape long, you find the following:

Code:

        while "`1'"!="" {
            restore, preserve
            noisily Longdo `1'
            append using "`new'"
            save "`new'", replace
            mac shift
        }

The code loops over variables to reshape to long form. At each pass, it reloads the whole original wide dataset, fetches the i j id variables and the variable to append (Longdo), and then performs an append and save cycle. As I have pointed out regularly to file appenders on Statalist, such a loop if very inefficient because the file that is constantly saved grows at each pass. Here's an example to illustrate just the I/O involved:

Code:

. * the width of i and j vars in bytes
. local ij 12

. 
. * the width of ij vars and 1 variable to reshape
. local ijvars = `ij' + 4

. 
. * the number of vars to reshape long
. local nvars 100

. 
. * the number of observations
. local nobs 10000

. 
. * the size of the original dataset
. local dta = (`ij' + `nvars' * 4) * `nobs'

. dis "size of data in MB = " `dta' / 1e6
size of data in MB = 4.12

. 
. * the size of the saved dataset after each pass
. clear

. set obs `nvars'
number of observations (_N) was 0, now 100

. gen saved_size = sum(`nobs' * `ijvars')

. 
. * the I/O required at each pass:
. * 1. restore
. * 2. append using the saved_size at that point
. * 3. save it again
. gen cumulative_io = sum(`dta' + saved_size[_n-1] + saved_size)

. dis %20.0fc cumulative_io[_N]
       2,007,719,936

. 
. * if you save each variable separately and then append them all
. local pass1 = `nvars' * (`dta' + `nobs' * `ijvars')

. local pass2 = `nvars' * `nobs' * `ijvars'

. dis %20.0fc `pass1' + `pass2'
         444,000,000

. 
. * Daniel Feenberg goes one further, the first pass loads just
. * what's needed and then saves each variable separately
. local pass1 = `nvars' * `nobs' * `ijvars' * 2

. local pass2 = `nvars' * `nobs' * `ijvars'

. dis %20.0fc `pass1' + `pass2'
          48,000,000

So reshape long is slower than needed because it uses two inefficient ways to get there. If you play with the parameters of the example, e.g. increase the number of variables by a factor of 10, then you get

Code:

. dis %20.0fc cumulative_io[_N]
     200,079,720,448

. dis %20.0fc `pass1' + `pass2'
         480,000,000

So the I/O grows by a factor of 10 with Daniel's approach while the I/O with reshape long grows by a factor of 100.

Comment

paulvonhippel

Join Date: Apr 2014

Posts: 502
#17

02 May 2016, 10:05

Robert Picard: Very good! The inexplicable become explicable. This also explains why my approach using lagged variables is faster. It has no saves or appends.

Is there a simple way to edit the inefficient paragraph of code that you've found in -reshape-? Or does the whole command need an overhaul to address this issue?
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#18

02 May 2016, 11:53

Sorry but this is StataCorp code and I'm not going to offer advice on editing it. I think the best you can do is to voice your request in the Wishlist for Stata 15 thread.
Comment
Rebecca Boehm

Join Date: May 2016

Posts: 44
#19

30 Aug 2016, 13:15

oops.

Last edited by Rebecca Boehm; 30 Aug 2016, 13:16. Reason: Wrong board!
Comment
Rebecca Boehm

Join Date: May 2016

Posts: 44
#20

30 Aug 2016, 13:17

I would like to restart this discussion with a question re: reshape.

I am finding that -reshape- is very slow, even with Stata MP and with increasing Mat Size to 3,000.

I am curious if when I do -reshape wide- on a fairly large data set where i = 4,826 and j = 150 will the speed depend on how much stuff I have previously done in the current Stata session? Put another way, does it help to close out of Stata after running for a while, wiping out all the stuff in the review window, etc.? I don't know exactly how the guts of Stata work, but I thought this may help to speed the reshape process?

Thanks for your information!
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#21

30 Aug 2016, 14:59

Rebecca, imho:
Stata MP doesn't help much. See mp report: expect 1.1 faster execution on MP2 than on SE.

Matsize will not help, from what I expect matsize should accommodate the largest j, which is 150 in your case.

previous session state should not be of any consequence. Stata has an excellent memory manager, and any detectable leaks are detected and fixed still before a version is released to the public. Only rarely we hear about them, such as in here. Command clear should be just as good.

screen buffer eats about 200kb of memory (by default) and is limited to 2mb, which is a negligible amount.

Robert Picard above has shown that the I/O operations (save, append, restore) constitute the main part of the reshape's work, which means your statatmp directory should point to the fastest available drive (ramdrive, ssd drive, etc). You can then expect a reasonable boost.

Best, Sergiy Radyakin
1 like
Comment
Co Ar

Join Date: Sep 2016

Posts: 83
#22

01 Sep 2016, 02:31

Hello everyone,
I have tried to reshape this data so that each household has a single observation; so far, there have been errors and I cant figure out how to do it.
I want each household to be represented without loosing membership; age and bo6. Help please
hhid mem Gender Marital_stat Age b06
1 4 female Never married 13 10
1 2 female Married 32 8
1 3 female Never married 21 6
1 5 male Never married 18 4
1 4 male Married 73 2
1 6 male Never married 8 6
1 7 female Never married 4 10
2 3 female Never married 21 4
2 6 male Never married 15 1
2 4 male Never married 18 2
2 5 female Never married 11 6
2 7 male Never married 9 9
2 2 female Married 40 1
2 8 male Never married 7 10
2 1 male Married 50 0
3 2 female Married 28 4
3 5 male Never married 1 6
3 4 female Never married 5 2
3 3 male Never married 12 5
3 1 male Married 39 2
4 4 male Never married 2 1
4 2 female Married 28 2
4 1 male Married 30 4
4 3 male Never married 4 9
5 2 female Married 22 9
5 3 female Never married 1 11
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#23

01 Sep 2016, 05:39

If interested in Co Ar's question, please follow other thread at
http://www.statalist.org/forums/foru...reshaping-data

Last edited by Nick Cox; 01 Sep 2016, 06:02.
Comment
Jimmy Yang

Join Date: May 2015

Posts: 54
#24

04 Oct 2016, 21:37

very simple
use the following code:

Code:

ssc install parallel parallel: reshape ...

From my experience, using parallel speed up to one hundred times faster.

Code:

reshape

command is actually the key niche of Stata.

And saving and growing data on disk rather than on RAM is important too,
because no matter how many RAM do you have, it is not enough from my perspective.

That is the reason why I endorse Stata's approach to keep single ".dta" in memory as best, because RAM is scarce while storage is abundant.

To the point, RAM is "GB", while DISK is "TB".

My motivation to learn Stata at first time in school is due to

Code:

reshape

command's powerfulness.

It's fantastic to use it to deal with unbalanced panel data and to check the integrity of data.

Logical observation (called "i") is not necessarily continuous number, for me, it is usually random string (called "key" in database language).

And since it is unbalanced panel data, the subobservation value (called "j"), although it is numeric, is not continuous as well.

For this kind of real-word data,

Code:

reshape

is pretty error-proofing, and the log message is valuable.

Usually I

Code:

reshape wide

and

Code:

reshape long

, ie, reshape twice, to find missing value.

For big data, I partition the data, and using block matrix to

Code:

reshape

and append them back.

I think using blocking method and

Code:

parallel: reshape

command is key here to deal with performance issue, not user-written reshape command.

Last edited by Jimmy Yang; 04 Oct 2016, 21:40.
1 like
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#25

08 Mar 2018, 07:29

Given that this is the first thread that shows up on Google (for me anyways) when you look for "stata faster reshape", I wanted to drop a reference here to sreshape by Kenneth L. Simons, fully documented in The Stata Journal. It accepts the same syntax as the standard reshape command so no extra learning is required - simply add an "s" to your "reshape". For my particular application it was 10x faster in a subsample test, which might save me hours if not days in the full sample (teffects nnmatch with large data creates a lot variables).
1 like
Comment
Vishal Sharma

Join Date: Sep 2018

Posts: 60
#26

31 May 2019, 17:57

i tried -sreshape-, but it seems to have max variable limit.. i have 7000 variables in wide format that i wanted to -sreshape- but i got an error saying numlist to large r123

any thoughts?
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment