Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • fastreshape - more efficient implementation of reshape for big datasets

    Stata's reshape program is an essential tool for data prep work. However, it is well-known that the performance of reshape isn't great for large datasets -- see these benchmarking results and this Statalist topic for additional context. Because the poor performance of reshape on big datasets often imposes a significant barrier to my research team's work-flow (our reshapes can take hours!), I went ahead and coded up the suggestions in the previously-mentioned Statalist topic into an .ado-file that should work for any kind of reshape. I imagine that this program will be useful to anyone who uses Stata to process large datasets.

    In short, fastreshape is significantly faster than reshape in most use cases, particularly for wide-to-long reshapes. I ran a number of benchmarks with Stata-MP on Stanford's cluster computing service, the results of which show that wide-to-long reshapes run between 2 and 15 times faster when using fastreshape. Similarly, long-to-wide reshapes run a modest (but still substantial) 1.5 to 5 times faster when using fastreshape.

    The syntax and output of fastreshape mirrors reshape, with a few notable exceptions. For one, -fastreshape error- does not identify problem observations in cases where the program fails (as reshape does). Second, the atwl(chars) option is not yet supported. Lastly, fastreshape does not yet return all of the information in macro objects that reshape does. In my experience, these features are not particularly important, but I would like to implement them in the near future, and I don't imagine they will slow the program down at all. In addition, I have incorporated a new optional argument ('fast') that allows the user to skip sorting the dataset post-reshape for an additional modest performance boost. The default behavior is to sort by i and j for wide-to-long reshapes / sort by i for long-to-wide reshapes, as -reshape- does.

    Although I think the program will replace the vast majority of reshape instances out of the box with no modification of syntax, I should caution that this program has not been tested by anyone other than myself, so there may be bugs. If you have any suggestions for additional functionality or would like to report a bug, please let me know in this topic, or alternatively create an issue / pull request on Github. I will continue to test the program over the next week or two before submitting to SSC. Thanks!

    Read more here: https://github.com/mdroste/stata-fastreshape

    Shout-outs to Robert Picard, Paul von Hippel, and Daniel Feenberg for the Statalist commentary that inspired this program.
    Last edited by Michael Droste; 18 Jan 2018, 15:07.

  • #2
    I have submitted this program to SSC!

    Comment


    • #3
      Thank you very much for the reshape command. Can you please help me the reshape of my data set

      I have a data set in long format so for a household having multiple sources of income the houshold has multiple rows. I want to reshape into wide with all incomes becoming varables (columns) but the issue here is that a household has two income of the same source. how can reshape in a way so that i get as follows:
      ID incomesource1 amount incomesource2 amount and so on ....

      example
      ID Source Amount
      1 Apple 2000
      1 Apple 400
      1 Orange 400
      2 Apple 400
      2 Apple 400'


      I want the data set as follows:


      ID Apple1 apple2 Orange
      1 2000 400 400
      2 400 400


      Thank you

      Comment


      • #4
        Originally posted by Guest View Post
        Thank you very much for the reshape command. Can you please help me the reshape of my data set

        I have a data set in long format so for a household having multiple sources of income the houshold has multiple rows. I want to reshape into wide with all incomes becoming varables (columns) but the issue here is that a household has two income of the same source. how can reshape in a way so that i get as follows:
        ID incomesource1 amount incomesource2 amount and so on ....

        example
        ID Source Amount
        1 Apple 2000
        1 Apple 400
        1 Orange 400
        2 Apple 400
        2 Apple 400'


        I want the data set as follows:


        ID Apple1 apple2 Orange
        1 2000 400 400
        2 400 400


        Thank you
        Hi there - welcome to Statalist. Your question is a general-purpose question about the function and syntax of reshaping, broadly, rather than fastreshape. Therefore, I would recommend you make another topic. You will find that the dataset you want cannot generally be constructed from your input dataset b/c the value of 'source' is not unique within 'id'. It is ambiguous (and left to sort order) what is meant by apple1, apple2 etc in the case you describe; for instance, should apple1 for id=1 be 2000 or 400? That is basically why reshape prohibits you from doing what you're asking.

        BTW, fastreshape is now up on SSC.

        Comment


        • #5
          Hi, Michael, I searched the -fastreshape- command but can not find it. Any suggestion?

          Ho-Chuan (River) Huang
          Stata 19.0, MP(4)

          Comment


          • #6
            The command ssc describe fastreshape shows it; since it went up today it probably won't be in Stata's index for search until tomorrow.

            Comment


            • #7
              Hi William, Thanks for your suggestion. I have installed the package.

              Ho-Chuan (River) Huang
              Stata 19.0, MP(4)

              Comment

              Working...
              X