Stata's reshape program is an essential tool for data prep work. However, it is well-known that the performance of reshape isn't great for large datasets -- see these benchmarking results and this Statalist topic for additional context. Because the poor performance of reshape on big datasets often imposes a significant barrier to my research team's work-flow (our reshapes can take hours!), I went ahead and coded up the suggestions in the previously-mentioned Statalist topic into an .ado-file that should work for any kind of reshape. I imagine that this program will be useful to anyone who uses Stata to process large datasets.
In short, fastreshape is significantly faster than reshape in most use cases, particularly for wide-to-long reshapes. I ran a number of benchmarks with Stata-MP on Stanford's cluster computing service, the results of which show that wide-to-long reshapes run between 2 and 15 times faster when using fastreshape. Similarly, long-to-wide reshapes run a modest (but still substantial) 1.5 to 5 times faster when using fastreshape.
The syntax and output of fastreshape mirrors reshape, with a few notable exceptions. For one, -fastreshape error- does not identify problem observations in cases where the program fails (as reshape does). Second, the atwl(chars) option is not yet supported. Lastly, fastreshape does not yet return all of the information in macro objects that reshape does. In my experience, these features are not particularly important, but I would like to implement them in the near future, and I don't imagine they will slow the program down at all. In addition, I have incorporated a new optional argument ('fast') that allows the user to skip sorting the dataset post-reshape for an additional modest performance boost. The default behavior is to sort by i and j for wide-to-long reshapes / sort by i for long-to-wide reshapes, as -reshape- does.
Although I think the program will replace the vast majority of reshape instances out of the box with no modification of syntax, I should caution that this program has not been tested by anyone other than myself, so there may be bugs. If you have any suggestions for additional functionality or would like to report a bug, please let me know in this topic, or alternatively create an issue / pull request on Github. I will continue to test the program over the next week or two before submitting to SSC. Thanks!
Read more here: https://github.com/mdroste/stata-fastreshape
Shout-outs to Robert Picard, Paul von Hippel, and Daniel Feenberg for the Statalist commentary that inspired this program.
In short, fastreshape is significantly faster than reshape in most use cases, particularly for wide-to-long reshapes. I ran a number of benchmarks with Stata-MP on Stanford's cluster computing service, the results of which show that wide-to-long reshapes run between 2 and 15 times faster when using fastreshape. Similarly, long-to-wide reshapes run a modest (but still substantial) 1.5 to 5 times faster when using fastreshape.
The syntax and output of fastreshape mirrors reshape, with a few notable exceptions. For one, -fastreshape error- does not identify problem observations in cases where the program fails (as reshape does). Second, the atwl(chars) option is not yet supported. Lastly, fastreshape does not yet return all of the information in macro objects that reshape does. In my experience, these features are not particularly important, but I would like to implement them in the near future, and I don't imagine they will slow the program down at all. In addition, I have incorporated a new optional argument ('fast') that allows the user to skip sorting the dataset post-reshape for an additional modest performance boost. The default behavior is to sort by i and j for wide-to-long reshapes / sort by i for long-to-wide reshapes, as -reshape- does.
Although I think the program will replace the vast majority of reshape instances out of the box with no modification of syntax, I should caution that this program has not been tested by anyone other than myself, so there may be bugs. If you have any suggestions for additional functionality or would like to report a bug, please let me know in this topic, or alternatively create an issue / pull request on Github. I will continue to test the program over the next week or two before submitting to SSC. Thanks!
Read more here: https://github.com/mdroste/stata-fastreshape
Shout-outs to Robert Picard, Paul von Hippel, and Daniel Feenberg for the Statalist commentary that inspired this program.
Comment