Big data: Recalling previous sort orders

Andrew Maurer

Join Date: Apr 2014

Posts: 28
#1

Big data: Recalling previous sort orders

21 Aug 2014, 12:16

Dear Statalist,

To give some background - I frequently work with big data in Stata (on the order of 5-50gb, say, with 500million-1billion observations). Since using Stata on data of this size, I've found that the biggest bottleneck in terms of time is sorting. Often when I find a command taking a long time, I realize that it's due to an internal sort taking place (collapse, merge, etc). A single sort can take 1-2 hours on data of this size.

A very large number of Stata commands are very time-inefficient with such data because of unnecessary sorts and other conventions. ( Eg, many programs generate a "touse" variable, whether or not if/in conditions were specified. Why not have -if `"`if'`in'"' != "" marksample..."- ? Moreover, Stata programs typically sort by the touse variable, rather than just viewing the data in mata with the touse variable as a selectvar [O(n) time rather than O(nlogn) time?] ). Ideally Statacorp could write internal program more efficiently. (Eg, the most common command I use for data analysis is collapse, which could be written vastly more efficiently using a hash table in C, rather than the current sort-method).

Notwithstanding trying to rewrite core Stata commands, I'm trying to efficiently save different sort orders of the data and be able to return to the orders without having to wait hours while Stata -sort-s. Writing some commands that require a different sort order into a program marked with sortpreserve sometimes works, but this will only preserve one particular sort order at a time. See below for a few attempts at storing/recovering sort orders efficiently, but I want to see if anyone has feedback.

Attempt 1
If you write a program with the , sortpreserve option, list and macro dir, within the program, you can see that Stata creates a temporary variable equal to the initial sort order (eg gen long `tempvar' = _n) and stores a local macro called _sortindex with the name of the temporary variable. I've tried both a) changing the contents of the local _sortindex to a different order variable of my choosing, and b) changing the values of the temporary variable, but neither method seems to work.

Code:

program define _sortrestore, sortpreserve syntax varname // method 1 local _sortindex `varname' // method 2 // replace `_sortindex' = `varname' end clear set obs 10 gen n = _n // initial sort order gen x = ceil(10*runiform()) // example data sort x // now try to return to the original sort order, n list _sortrestore n list // clearly we haven't reverted to the original sort order. // what went wrong?

Attempt 2
Use mata to reorder each variable separately. Ideally we would just st_view the data and use collate(), but collate() isn't allowed on a view. Why? As a work around, we need to do the somewhat wasteful task of making a copy of each stata variable in mata, re-ordering it, then storing it back over the stata variable. This works, but I would hope that there is a more efficient method available.

Define Program

Code:

cap program drop fastorder program define fastorder syntax varname(numeric) /* Must specify an ID variable that maps each observation to the row that it should be mapped to. Ie - each observation of the variable must be a unique integer between 1 and the total number of observations. Note: there is no error check to confirm that the user specified a valid ID variable */ mata _st_fastorder(`"`varlist'"') end cap mata mata drop _st_fastorder() mata void _st_fastorder(string scalar idcol) { // Initializations id = st_data(.,idcol) V = st_nvar() // re-order each stata variable for (v=1;v<=V;v++) { if (st_isnumvar(v)==1) st_store(id,v,st_data(.,v)) else st_sstore(id,v,st_sdata(.,v)) } } end

Test program

Code:

clear all set obs 10 gen x1 = ceil(100*runiform()) gen x2 = ceil(100*runiform()) // Suppose we wish to save our original sort // order and come back to it later gen long n = _n order n /* .... some analysis */ // And we want to do some commands with a new // sort order. Perhaps we also wish to return // to this ordering later on sort x1 gen long n1 = _n /* .... some more analysis */ sort x2 /* .... even more analysis */ // Now we can return to the original order // with faster (O(n) time), rather than // needing to sort by n (O(nlogn) time) li fastorder n li // this correctly restores the data to the original order

Thank you,

Andrew Maurer
Tags: None

Andrew Maurer

Join Date: Apr 2014
Posts: 28

21 Aug 2014, 14:25

Update:
Previously I was seeing that even if the `_sortindex' variable was replaced with a user specified n, programs specifying sortpreserve would still somehow revert to the initial order, not the ordering given by n. However, and I have no idea why, but it appears to not be the case if the `: sortedby' extended macro is empty. Thus, you can force a specific sort order by first clearing the `: sortedby' flag and then running a program with the sortpreserve option and replacing the values of `_sortindex'.

See below for the programs and benchmark.

Results
Benchmarked one iteration each on 10 million and 100 million observations

Stata's sort: 11.2 seconds and 178 seconds
fastorder: 4.9 seconds and 58 seconds

Define Programs

Code:

// fastorder v2.0 8/21/2014 Andrew Maurer
// - exploits trick with , sortpreserve 
// benchmarking fastorder vs sort:
// 4.9 seconds vs 11.2 seconds in one test of 10 million observations
// 58 seconds vs 178 seconds in one test of 100 million observations

cap program drop fastorder
program define fastorder
syntax varname(numeric)
    // In order to force a given sort order, the values of `_sortindex' must
    // be changed AND `: sortedby' must be missing when the , sortpreserve 
    // program is called
    
    // First ensure that `: sortedby' is empty
    // If not empty, use trick to force it empty
    local sorted_var1 : word 1 of `: sortedby'
    if "`sorted_var1'" != "" {
        /*
        Pedantic code here to be positive that we're changing
        the value of `sorted_var1'. If the value in 1 was already
        . and we "replace... = .', the `:sortedby' flag wouldn't
        change.
        */
        local strvar = ( substr(`"`: type `sorted_var1''"',1,3) == "str")
        local oldval = `sorted_var1'[1]
        if !`strvar' {
            replace `sorted_var1' = cond(!mi(`oldval'),.,1) in 1
            replace `sorted_var1' = `oldval' in 1
        }
        else {
            replace `sorted_var1' = cond(!mi(`oldval'),"","a") in 1
            replace `sorted_var1' = `"`oldval'"' in 1
        }
    }
    
    fastorder_inner `varlist'
    
    // dataset already sorted by `varlist', but flag `: sortedby' flag not set
    sort `varlist' 
    
end

cap program drop fastorder_inner
program define fastorder_inner, sortpreserve
syntax varname(numeric)
    
    replace `_sortindex' = `varlist'
    
end

Benchmark sort vs fastorder

Code:

local N = 10^8
set obs `N'
gen long n = _n
gen x1 = ceil(100*runiform())
sort x1
//replace n = _N in `=_N-1'
//replace n = _N-1 in `=_N'
tempfile temp
save `temp'

use `temp', clear
timer on 10
fastorder n
timer off 10

use `temp', clear
timer on 20
sort n
timer off 20

timer list

Comment

Sergio Correia

Join Date: Apr 2014

Posts: 420
#3

21 Aug 2014, 18:38

That's a *really* cool trick! (but dangerous of course)!
That said, I agree that we could benefit *a lot* if some of the most used commands (looking at you, collapse and egen) are tweaked to make them more careful with sorting and memory usage
Comment
Matthew White

Join Date: Apr 2014

Posts: 29
#4

28 Aug 2014, 10:21

I agree that approach 1 is a bit dangerous. If this is an undocumented feature, StataCorp has the right to change their implementation of sortpreserve at any time.

I wonder if subscripting with permutation vectors could help. For instance, _st_fastorder() could look like this:

Code:

void _st_fastorder(string scalar idcol) { real matrix V pragma unset V st_view(V, ., .) V[st_data(., idcol),] = V }

I think this won't create a copy of the dataset: you can stick to the view. Of course, you'd have to update the code to allow for string variables, and views won't work for strLs. When I benchmark this, it seems a bit faster than your fastorder of approach 2.

help m1_permutation is a great resource, as is this Stata journal article: http://www.stata-journal.com/sjpdf.h...iclenum=pr0028
1 like
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#5

31 Aug 2014, 05:57

Thanks, Andrew, I think the problems with sorting are reasonably well known, and also came up in different forms on the Wish list for Stata 14 thread.

collate and permutation are great resources, though I am not sure how great idea it is to sort views, which might reorder the specific variable in RAM, making them inconsistent with the other variables.

Also, I'd worry about how general one could make the imposition of a sort order. Missing values or selection could easily mess up the sort order we think one could derive for simpler cases (e.g. for a binary variable where the order within 0 and 1 bins don't matter, calculating indices should be trivial after a simple -sum, meanonly-, then imposing just that). But in general, yes, this kind of approach would be very useful to use more wisely, at least for StataCorp, but also for third-party developers, if done robustly.

Note though that your own examples with 100 million observations took no more than 3 minutes. Did you really mean 1-2 hours? Or maybe if Stata needs to move around lots of covariates in RAM, and RAM is slow on the system? (In any case, I am no CS expert, but it sounds wasteful to move data around in RAM for operations that just need different indexing.)

I am not sure about your hash tables in C, but surely -collapse- is orders of magnitude faster with different approaches like MonetDB's or Python pandas smarter indexing (and column-based storage), and/or distributing computation better (BigQuery and other MapReduce derivatives on Hadoop or otherwise). Even old SAS and some SQL implementations (incl. MonetDB) have much faster merges.
Comment

Announcement