Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Making reshape faster

    I'm not the first to notice that the reshape command can be unnecessarily slow. In a big program I'm finding that the reshape command is taking a substantial fraction of the runtime, and that I can reshape the data more quickly using other commands. Granted, I'm solving a specific problem, and reshape is designed to be general. But still it's surprising to me that I can outperform a core command with very simple alternatives.

    I would suggest that Stata update the command to make it more efficient, or that some user develop an alternative command, reshape_fast, that accomplishes the same things faster.

    Here is a simple example of what I'm talking about.


    /* Simulate some long data. */
    clear
    set obs 999
    gen id = 1
    gen i = _n
    gen v=rnormal()
    save long, replace

    /* I can make it wide, using reshape. */
    timer clear 1
    timer on 1
    reshape wide v, i(id) j(i)
    timer off 1
    timer list 1
    /* That took 6 seconds. Why? */

    /* Look how much faster I can do the same thing with lagged variables. */
    use long, clear
    timer clear 2
    timer on 2
    forvalues i = 1/999 {
    gen v`i' = v[`i'] in 1
    }
    drop v i
    keep in 1
    timer off 2
    timer list 2
    /* That took 0.03 sec! */

    /* Likewise I can go from wide to long using reshape. */
    save wide, replace

    timer clear 3
    timer on 3
    reshape long v, i(id) j(i)
    timer off 3
    timer list 3
    /* That took 4 seconds. Why? */

    /* Look how much faster I can do the same thing with expand. */

    use wide, clear

    timer clear 4
    timer on 4
    expand 999
    gen v = .
    gen i = .
    forvalues i=1/999 {
    replace v = v`i' in `i'
    replace i = `i' in `i'
    }
    timer off 4
    timer list 4
    /* That took 0.01 sec! */

    Last edited by paulvonhippel; 30 Apr 2016, 19:15.

  • #2
    Dan Feenberg and Friedrich Huebler have previously posted evidence that they can outperform reshape under some circumstances:
    http://www.nber.org/stata/efficient/reshape.html
    http://www.stata.com/statalist/archi.../msg00888.html
    My approach is even simpler than theirs, and even faster, at least for my problem.

    Comment


    • #3
      Well, yes, the actual reshaping of the data, once you have identified all the variables involved, etc., can be done far more quickly. But that is not the same as a general purpose -reshape- command that a) parses and interprets the details of just what to reshape, taking on all comers, b) troubleshoots the data set to make sure that the desired reshaping is compatible with the data as it exists, and c) exits gracefully (i.e. having not left the data in some useless, unrecoverable state) d) with informative error messages if there is a problem.

      For small and medium size data sets, -reshape- runs at a reasonable speed. I have had a few occasions working with large data sets where -reshape-'s performance was aggravatingly slow, and also had some situations where -reshape- failed because I didn't have enough memory to hold the intermediate version of the data set that needed to be created before winnowing down to the final -reshape-d version. In one case, I hand coded the process, using code analogous to what you describe in your post.

      But in general, I don't advise doing that. Any really fast program that you write as a Stata ado to reshape a large data set is likely to have special attributes of that data set embedded in the code. (For instance, your code works fine when there is only one value for the i() variable, but it will not easily generalize.) It will not be a general purpose -reshape- command and retain that speed. The danger is that you will then try to use it on other data sets for which it is not customized. Perhaps you will try to tweak the code to make it adapted to the newer data set--but you could easily get it wrong, or mistakenly do only a partial tweak. If you are lucky, you will fail in an obvious way. If you are unlucky, it will look like you succeeded, but you will have incorrect results that you don't realize are wrong until much later when your analyses begin to produce garbage. If you are really unlucky, you won't find out that it's wrong until after you have published your results, or shown them to a client, etc. StataCorp has developed, maintained, and improved its programs over the years and has thoroughly tested them on a broad range of use cases. While bugs do turn up now and then, Stata has proven itself extremely reliable. When you write your own programs, you bring to them expertise on the specific data and problems you are working on, but can you match the expertise in statistical programming that StataCorp offers? You attempt it at your own risk. As Edsgar Dijkstra used to say about people who would revise robust code to make it faster but less reliable: "why is everybody in such a hurry to get the wrong answers?"

      There is also the economic aspect of it. Computers are cheap; programming is labor-intensive and expensive. If Stata is chronically taking too long to do analyses that are a routine part of your work, then rather than trying to rewrite Stata, you will probably be better off procuring a faster computer (or a computer with more memory, or more cores). Even if you were an expert Stata programmer, writing general-purpose robust programs that substantially improve on Stata's performance without sacrificing correctness would take you many, many hours of labor--it probably isn't worth it.

      Another possibility is that Stata is actually performing well and you are just impatient. We do live in a world that is geared towards instant gratification, and that is not always achievable with very large data problems. If getting a faster computer doesn't solve the problem, consider cognitive behavioral therapy.

      Comment


      • #4
        I've got a fast computer (3.6 GHz) with a lot of memory (32Gb). I'm reasonably patient, but I'm running a program on a large dataset, and about 2 hours of the runtime is due to reshape. And when I run the same program on an even larger dataset, reshape takes so long that I'm not sure it's going to finish at all.

        Hence my custom solution. And I do wonder if reshape could be rewritten to run faster.

        Comment


        • #5
          The fact that reshape runs in 6 sec, while an alternative runs in 0.03 sec is no big deal if you're only reshaping once. But I'm reshaping 78,000 times -- it adds up!
          6 seconds repeated 78,000 times is 5 days.

          Comment


          • #6
            Perhaps the inefficiency is the result of reshaping data 78k times? I suspect you might be able to get the same/similar results using other methods (e.g., using appends/joins from different files, throwing the data in an RDBMS backend and using SQL to handle some of the data management before importing into Stata, etc...). Maybe you could share a bit about your use case?

            Comment


            • #7
              Oh, we've looked at a bunch of different options, and I have a working solution in the form of the expand/replace code I presented earlier as an alternative to reshape long.

              This thread is more about what Dan Feenberg has called the "inexplicable slowness" of reshape.

              Comment


              • #8
                A few thoughts about how to approach your problem.

                ​1. As wbuchanan suggests, doing 78K -reshape-s sounds to me like a very inefficient approach. It is likely that there is a way to revise the algorithm so as to avoid that. Presumably on each of these 78k iterations, the data set has changed from the previous iteration. (If not, the obvious solution is to just save the data once in long layout and once in wide and then just re-load whichever is needed at each point in the calculations.) It may be possible to initially save the data in long layout, -reshape- it once in wide layout and save that, and then whatever calculations are carried out to change the data in each iteration are mirrored in both layouts. Admittedly there are some calculations in long that are very difficult to mirror in wide, but most are possible. For that matter, it may be that you really only need the data in only long or only wide layout and can modify the calculations you have done in one to apply to the other.

                2. If you must actually -reshape- in each iteration, prepare your data set as well as possible. Before you -reshape-, get rid of any observations that you will not actually use after the -reshape-. Equally critically, or perhaps more so, before -reshape-ing, drop any variables that will not be needed afterward. When you -reshape wide-, you obviously need to retain the i() and j() variables. The rest may or may not be needed. Those that are not needed are very expensive to carry along in the reshape: each of them must be checked for internal consistency within the i()-groups, a very slow process.

                3. Do you actually need to fully reshape the data. If, for example, you are -reshape-ing from long to wide because you have, say, panel data, and you have a variable X for which you need separate variables by panel. Instead of a full-blown -reshape wide X, i(timevar) j(panelvar)-, you might be able to use the results of -separate X, by(panelvar)- instead.

                I don't know if any of these are applicable to your problem, nor am I sure just how much time they will save, but they are probably worth looking into.

                By the way, there something odd going on with your computer. You report both more RAM and a faster clock speed than I have on my laptop, but your code in #1 runs about 30% faster on my laptop than what you report. Do you perhaps have other processes running in the background that are dragging your machine's performance?

                Comment


                • #9
                  This conversation is straying from my intended subject of why the reshape command isn't faster!

                  I have 78k by-groups, each of which needs something like a -reshape long- once during processing. Since -reshape long- is slow, I came up with the alternative of using -expand- and -replace- instead (as described in my original post).

                  I am satisfied with the solution. I am puzzled by the slow performance of -reshape long-.

                  Comment


                  • #10
                    Well, to return to your question about why -reshape long- is slow, it is what I pointed out in #3. -reshape- does not just move the data around. It first has to figure out what data you want moved in what way from the command line, then it has to check consistency of the non-i()-j()- variable within i(), then it has to figure out how much to expand each observation and where to put everything. The speed penalty comes from its general purpose nature and the time it spends figuring out exactly what to move where. In your specific situation you can take advantage of your knowledge of the data structure and skip most of these steps and just go to moving things around.

                    Comment


                    • #11
                      I've had the same experience, namely that a simple user-written program can be much faster than -reshape- for a given simple case. One example I've had occasion to work with starts with network-type data in edge (long) format, e.g. just three variables: ego, alter, distance, so that each observation is a pair.. ego and alter are id numbers, and distance is some measure of the connection of each pair. To convert this to wide format, of the form ego dist1 dist2, ... distNalter can often be useful, but is quite slow with reshape, and it's easy enough to do much faster with a user written program.

                      However, I just ran a small set of experiments, and found that in this situation, the time per pair increases well less than linearly with respect to N: I took a series of data sets of N pairs (i.e., sqrt(N) individuals paired up), ranging from N = 2,500 pairs up to N = 250,000. The he time per pair was about 1/3 as long at N = 250,000 than at N = 2,000. So, at least the -reshape- code accomplishes that. This is consistent with the overhead (of error checking, etc.) being a constant load.

                      However, I still find it surprising that a program in which all the reshaping code is written in C (i.e., does not have to be interpreted) can be massively slower on a given problem than a special purpose program written in Mata (which I've tried, no results here.) I would think that, for large N, the overhead cost for -reshape-would be trivial and the speed of non-interpreted code would be superior.

                      I think that the point here is that, if one has a large but simple reshape, it's worth taking the do-it-yourself approach. Perhaps someone would like to try writing a version of -reshape- with restricted functionality, but better speed, say -big_reshape-, that would be designed for situations such as are described here.

                      Comment


                      • #12
                        Originally posted by Mike Lacy View Post
                        However, I still find it surprising that a program in which all the reshaping code is written in C (i.e., does not have to be interpreted) can be massively slower on a given problem than a special purpose program written in Mata (which I've tried, no results here.) I would think that, for large N, the overhead cost for -reshape-would be trivial and the speed of non-interpreted code would be superior.
                        The code for reshape is entirely written in ado language, neither C nor Mata are used. The core process involves loading, saving, merge-ing and appaned-ing temporary files. My guess is that the code could probably be improved but then again, other things are likely considered more important by StataCorp. Also, typically you do not reshape a thousand times, but once or twice to solve a given problem. It is then a minor inconvenience to wait a minute instead of a second.

                        Best
                        Daniel
                        Last edited by daniel klein; 02 May 2016, 03:37.

                        Comment


                        • #13
                          I wonder if -reshape- was written a while back when datasets were smaller. Now that large data are more common, it would be worth rewriting.

                          Comment


                          • #14
                            daniel klein : Very interesting. I took a quick look at the source, and amidst tons of parsing of input and the like, I didn't see much of anything that looked like actual "work." I stand corrected.

                            Comment


                            • #15
                              A compiled version of -reshape- should be much faster, eh?

                              Comment

                              Working...
                              X