Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Big data: Recalling previous sort orders

    Dear Statalist,

    To give some background - I frequently work with big data in Stata (on the order of 5-50gb, say, with 500million-1billion observations). Since using Stata on data of this size, I've found that the biggest bottleneck in terms of time is sorting. Often when I find a command taking a long time, I realize that it's due to an internal sort taking place (collapse, merge, etc). A single sort can take 1-2 hours on data of this size.

    A very large number of Stata commands are very time-inefficient with such data because of unnecessary sorts and other conventions. ( Eg, many programs generate a "touse" variable, whether or not if/in conditions were specified. Why not have -if `"`if'`in'"' != "" marksample..."- ? Moreover, Stata programs typically sort by the touse variable, rather than just viewing the data in mata with the touse variable as a selectvar [O(n) time rather than O(nlogn) time?] ). Ideally Statacorp could write internal program more efficiently. (Eg, the most common command I use for data analysis is collapse, which could be written vastly more efficiently using a hash table in C, rather than the current sort-method).

    Notwithstanding trying to rewrite core Stata commands, I'm trying to efficiently save different sort orders of the data and be able to return to the orders without having to wait hours while Stata -sort-s. Writing some commands that require a different sort order into a program marked with sortpreserve sometimes works, but this will only preserve one particular sort order at a time. See below for a few attempts at storing/recovering sort orders efficiently, but I want to see if anyone has feedback.

    Attempt 1
    If you write a program with the , sortpreserve option, list and macro dir, within the program, you can see that Stata creates a temporary variable equal to the initial sort order (eg gen long `tempvar' = _n) and stores a local macro called _sortindex with the name of the temporary variable. I've tried both a) changing the contents of the local _sortindex to a different order variable of my choosing, and b) changing the values of the temporary variable, but neither method seems to work.

    Code:
    program define _sortrestore, sortpreserve
    syntax varname
    
    // method 1
    local _sortindex `varname'
    
    // method 2
    // replace `_sortindex' = `varname'
    
    end
    
    clear
    set obs 10
    gen n = _n // initial sort order
    gen x = ceil(10*runiform()) // example data
    sort x
    
    // now try to return to the original sort order, n
    list
    _sortrestore n
    list
    
    // clearly we haven't reverted to the original sort order.
    // what went wrong?
    Attempt 2
    Use mata to reorder each variable separately. Ideally we would just st_view the data and use collate(), but collate() isn't allowed on a view. Why? As a work around, we need to do the somewhat wasteful task of making a copy of each stata variable in mata, re-ordering it, then storing it back over the stata variable. This works, but I would hope that there is a more efficient method available.

    Define Program
    Code:
    cap program drop fastorder
    program define fastorder
        syntax varname(numeric)
        /*
        Must specify an ID variable that maps each
        observation to the row that it should be
        mapped to. Ie - each observation of the
        variable must be a unique integer between
        1 and the total number of observations.
        
        Note: there is no error check to confirm
        that the user specified a valid ID variable
        */
        
        mata _st_fastorder(`"`varlist'"')
        
    end
    
    cap mata mata drop _st_fastorder()
    mata
    void _st_fastorder(string scalar idcol) {
    
        // Initializations
        id = st_data(.,idcol)
        V = st_nvar()
        
        // re-order each stata variable
        for (v=1;v<=V;v++) {
            if (st_isnumvar(v)==1) st_store(id,v,st_data(.,v))
            else st_sstore(id,v,st_sdata(.,v))
        }
        
    }
    end
    Test program
    Code:
    clear all
    set obs 10
    
    gen x1 = ceil(100*runiform())
    gen x2 = ceil(100*runiform())
    
    // Suppose we wish to save our original sort
    // order and come back to it later
    gen long n = _n
    order n
    /*
    .... some analysis
    */
    
    // And we want to do some commands with a new
    // sort order. Perhaps we also wish to return
    // to this ordering later on
    sort x1
    gen long n1 = _n
    /*
    .... some more analysis
    */
    
    sort x2
    /*
    .... even more analysis
    */
    
    // Now we can return to the original order
    // with faster (O(n) time), rather than
    // needing to sort by n (O(nlogn) time)
    li
    fastorder n
    li
    // this correctly restores the data to the original order
    Thank you,

    Andrew Maurer

  • #2
    Update:
    Previously I was seeing that even if the `_sortindex' variable was replaced with a user specified n, programs specifying sortpreserve would still somehow revert to the initial order, not the ordering given by n. However, and I have no idea why, but it appears to not be the case if the `: sortedby' extended macro is empty. Thus, you can force a specific sort order by first clearing the `: sortedby' flag and then running a program with the sortpreserve option and replacing the values of `_sortindex'.

    See below for the programs and benchmark.

    Results
    Benchmarked one iteration each on 10 million and 100 million observations
    • Stata's sort: 11.2 seconds and 178 seconds
    • fastorder: 4.9 seconds and 58 seconds
    Define Programs
    Code:
    // fastorder v2.0 8/21/2014 Andrew Maurer
    // - exploits trick with , sortpreserve 
    // benchmarking fastorder vs sort:
    // 4.9 seconds vs 11.2 seconds in one test of 10 million observations
    // 58 seconds vs 178 seconds in one test of 100 million observations
    
    cap program drop fastorder
    program define fastorder
    syntax varname(numeric)
        // In order to force a given sort order, the values of `_sortindex' must
        // be changed AND `: sortedby' must be missing when the , sortpreserve 
        // program is called
        
        // First ensure that `: sortedby' is empty
        // If not empty, use trick to force it empty
        local sorted_var1 : word 1 of `: sortedby'
        if "`sorted_var1'" != "" {
            /*
            Pedantic code here to be positive that we're changing
            the value of `sorted_var1'. If the value in 1 was already
            . and we "replace... = .', the `:sortedby' flag wouldn't
            change.
            */
            local strvar = ( substr(`"`: type `sorted_var1''"',1,3) == "str")
            local oldval = `sorted_var1'[1]
            if !`strvar' {
                replace `sorted_var1' = cond(!mi(`oldval'),.,1) in 1
                replace `sorted_var1' = `oldval' in 1
            }
            else {
                replace `sorted_var1' = cond(!mi(`oldval'),"","a") in 1
                replace `sorted_var1' = `"`oldval'"' in 1
            }
        }
        
        fastorder_inner `varlist'
        
        // dataset already sorted by `varlist', but flag `: sortedby' flag not set
        sort `varlist' 
        
    end
    
    cap program drop fastorder_inner
    program define fastorder_inner, sortpreserve
    syntax varname(numeric)
        
        replace `_sortindex' = `varlist'
        
    end
    Benchmark sort vs fastorder
    Code:
    local N = 10^8
    set obs `N'
    gen long n = _n
    gen x1 = ceil(100*runiform())
    sort x1
    //replace n = _N in `=_N-1'
    //replace n = _N-1 in `=_N'
    tempfile temp
    save `temp'
    
    use `temp', clear
    timer on 10
    fastorder n
    timer off 10
    
    use `temp', clear
    timer on 20
    sort n
    timer off 20
    
    timer list

    Comment


    • #3
      That's a *really* cool trick! (but dangerous of course)!
      That said, I agree that we could benefit *a lot* if some of the most used commands (looking at you, collapse and egen) are tweaked to make them more careful with sorting and memory usage

      Comment


      • #4
        I agree that approach 1 is a bit dangerous. If this is an undocumented feature, StataCorp has the right to change their implementation of sortpreserve at any time.

        I wonder if subscripting with permutation vectors could help. For instance, _st_fastorder() could look like this:

        Code:
        void _st_fastorder(string scalar idcol)
        {
            real matrix V
        
            pragma unset V
            st_view(V, ., .)
            V[st_data(., idcol),] = V
        }
        I think this won't create a copy of the dataset: you can stick to the view. Of course, you'd have to update the code to allow for string variables, and views won't work for strLs. When I benchmark this, it seems a bit faster than your fastorder of approach 2.

        help m1_permutation is a great resource, as is this Stata journal article: http://www.stata-journal.com/sjpdf.h...iclenum=pr0028

        Comment


        • #5
          Thanks, Andrew, I think the problems with sorting are reasonably well known, and also came up in different forms on the Wish list for Stata 14 thread.

          collate and permutation are great resources, though I am not sure how great idea it is to sort views, which might reorder the specific variable in RAM, making them inconsistent with the other variables.

          Also, I'd worry about how general one could make the imposition of a sort order. Missing values or selection could easily mess up the sort order we think one could derive for simpler cases (e.g. for a binary variable where the order within 0 and 1 bins don't matter, calculating indices should be trivial after a simple -sum, meanonly-, then imposing just that). But in general, yes, this kind of approach would be very useful to use more wisely, at least for StataCorp, but also for third-party developers, if done robustly.

          Note though that your own examples with 100 million observations took no more than 3 minutes. Did you really mean 1-2 hours? Or maybe if Stata needs to move around lots of covariates in RAM, and RAM is slow on the system? (In any case, I am no CS expert, but it sounds wasteful to move data around in RAM for operations that just need different indexing.)

          I am not sure about your hash tables in C, but surely -collapse- is orders of magnitude faster with different approaches like MonetDB's or Python pandas smarter indexing (and column-based storage), and/or distributing computation better (BigQuery and other MapReduce derivatives on Hadoop or otherwise). Even old SAS and some SQL implementations (incl. MonetDB) have much faster merges.

          Comment

          Working...
          X