Re-opening a large dataset vs. preserve/restore inside a loop

MB Ross

Join Date: Apr 2015

Posts: 15
#1

Re-opening a large dataset vs. preserve/restore inside a loop

06 Sep 2017, 10:03

I have a very very large dataset (~200 mil obs) which I am opening inside of a loop. I'm wondering whether it would be faster to open it once and use preserve/restore inside of the loop instead. Sample code below and any help would be greatly appreciated.

forvalues i=1/$max{
use "XXX\large_data.dta", clear
keep if geocode="`i'"
save "XXX\large_data_`i'.dta", replace
}

or

use "XXX\large_data.dta", clear

forvalues i=$max{
preserve
keep if geocode="`i'"
save "XXX\large_data_`i'.dta", replace
restore
}
Tags: None
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#2

06 Sep 2017, 10:18

You can use timer to find out which code is faster.

Code:

help timer
Comment
Daniel Bela

Join Date: Apr 2014

Posts: 246
#3

06 Sep 2017, 10:25

From my understanding, -preserve- does nothing more than (temporarily) saving a dataset to your hard drive, and -restore- re-opens this temporary dataset.

Thus, your second code block should be slower, as it has to (1) use the data once, and inside of the loop (2) save the dataset and (3) re-open it repeatedly. I would expect this to be slower than your first code block.

Proof with example data:

Code:

tempfile largedata // create a large dataset sysuse auto , clear expand 100000 save `"`largedata'"' // clear timers timer clear // variant 1: with repeated -use- timer on 1 forvalues num=1/20 { use `"`largedata'"' , clear keep if make=="AMC Pacer" * do something here } timer off 1 // variant 2: with repeated -preserve- and -restore- timer on 2 use `"`largedata'"' , clear forvalues num=1/20 { preserve keep if make=="AMC Pacer" * do something here restore } timer off 2 // variant 3: with repeated -use using- timer on 3 forvalues num=1/20 { use if make=="AMC Pacer" using `"`largedata'"' , clear * do something here } timer off 3 // compare execution times timer list

The only substantial speedup I can imagine is to only read the subset of data you want to process in the first place, as illustrated in the third variant of my code: -use- can be directly combined with -if-, so that Stata does not have to read in the whole dataset from your file system.

Regards
Bela

PS: Crossed with Friedrich Huebler's answer; he is, of course, correct.

Last edited by Daniel Bela; 06 Sep 2017, 10:26. Reason: added PS
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#4

06 Sep 2017, 10:34

If I recall correctly, there is an earlier thread on this question and it was concluded that saving the data in a tempfile and re-reading that tempfile each time is faster than using -preserve- and -restore-.

But here's a thought that might be faster still for your particular application. It looks like what you are doing is chopping up the large data set into a series of smaller data sets, each of the latter corresponding to a single value of geocode.

So start by reading the full data set. Sort it on geocode and save it in a tempfile. Then create an index of the observation numbers corresponding to each geocode in the sorted data set. So something like this:

Code:

use XXX/large_data, clear gen long obs_no = _n sort geocode tempfile holding save `holding' by geocode: gen long gcount = _N local index local pointer 1 while `pointer' <= _N { local index `pointer' local pointer = `pointer' + gcount[`pointer'] }

Now I would loop over local macro index, reading in from the tempfile only those observations in the range of the current value of geocode

Code:

tokenize index while "`1'" != "" { use in `1'/`=`2'-1' using `holding', clear local i = geocode[1] save XXX/large_data_`i', replace macro shift }

The advantage of this is that -in- is much faster than -if-. (O(1) vs O(N)) and you also don't have to read in the whole file each time and then select what to keep: you only read in what you need (though there is some overhead finding the range of observations desired).

I haven't actually tried this out, so there may be some typos or I may have end-point errors on the loop handling, but I think this will be your best bet.

Added: Crossed with #2 and #3.

Last edited by Clyde Schechter; 06 Sep 2017, 10:37.
2 likes
Comment

eric_a_booth

Join Date: Apr 2014
Posts: 292

06 Sep 2017, 11:25

I hadnt thought about using the -while- loop in this way - it's really interesting/useful, thanks Clyde! A few minor edits (in bold below) to your code to get it to work:

Code:

local index
local pointer 1
while `pointer' <= _N {
    local index `index' `pointer'
    local pointer = `pointer' + gcount[`pointer']
}
 tokenize `"`index'"'
di `"`index'"'
while "`1'" != "" {
    use in `1'/`=`2'-1' using `holding', clear
    local i = geocode[1]
    save XXX/large_data_`i', replace
    macro shift
}

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

Comment

Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#6

24 Sep 2017, 13:12

-in- doesn't actually save any time. In a test program:

Code:

set rmsg on set more off set obs 1000000000 generate x=_n save stuff,replace clear use stuff clear use in 1/1 using stuff clear use stuff

the three -use- statements all take about the same about of time (5.37 to 5.71 seconds)
on a small server. The -in- statement isn't even on the run with the shortest time. My
understanding is that recent .dta files keep some dataset metadata at the end of the
file, so the whole file needs to be read anyway. This is really unfortunate, and I wish is
would be reversed at some point.

I realize the read is taking place out of cache but that doesn't really affect my argument.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#7

24 Sep 2017, 13:41

Here's what I get:

Code:

. clear* r; t=0.00 12:37:10 . set rmsg on r; t=0.00 12:37:10 . set more off r; t=0.00 12:37:10 . set obs 100000000 number of observations (_N) was 0, now 100,000,000 r; t=1.06 12:37:11 . generate long x=_n r; t=1.99 12:37:13 . save stuff,replace file stuff.dta saved r; t=2.13 12:37:15 . clear r; t=0.00 12:37:15 . use stuff r; t=1.88 12:37:17 . clear r; t=0.00 12:37:17 . use in 1/1000000 using stuff r; t=0.08 12:37:17 . clear r; t=0.00 12:37:17 . use stuff r; t=1.93 12:37:19

The full file has 10⁸ observations. The full read takes about 1.9 seconds. The chunk read with -in- is 10⁴ observations. and takes 0.08 seconds to read. If you were breaking up the file into 100 chunks of 10⁶ observations each, those would each take 0.08 seconds, so a total of 8 seconds spent reading all 100 chunks. By contrast if you had to read the entire file 100 times you would have 190 seconds of reading. That's a huge speedup.

I evidently changed your test case around, reading in chunks of 10⁶ observations instead of 1 observation. But I think that's a more realistic test case. And if you make it 1,000 chunks 10⁵ instead, you get very similar results.

Last edited by Clyde Schechter; 24 Sep 2017, 14:00. Reason: Correct typos.
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#8

24 Sep 2017, 14:20

A few notes:
For some reason, I don't see the speedups described by Clyde, my runtime is the same whether I "use in 1/10" or if I use the whole thing (on Stata 14.2 Windows)

Nonetheless, the suggestion about sorting and then loading by chunks makes a lot of sense, except that Stata's implementation of sort is O(N log(N)) which removes all the time savings of using an O(1) operation instead of an O(N) operation.

Now, I'm not sure what MB ultimately wanted to do, but one option is to halve the dataset, in a depth-first algorithm:
Start with a dataset with N obs that you want to split in 16

Use the dataset, keep the first half

Use the first half, then keep the first half

Repeat until you reach the desired size, and back up to the other halves

In this example, you would only end up -use-ing roughly 2x the data of the original dataset, instead of 16x if you use/keep/save.
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#9

24 Sep 2017, 14:35

1/1 or 1/1000000 doesn't make any difference in my runs. I am using Stata 14.2 for my tests - could it be that Stata has fixed this problem in version 15? Are you using 15? I'd still like to see row selection on the save statement - a lot simpler than "halve the dataset, in a depth-first algorithm".
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#10

24 Sep 2017, 16:05

Yes, I am running 15. But my understanding is that the data set structure has not changed between versions 14 and 15.

With regard to #8, the logic of the code in #4 is based on replacing -if- with -in-. -if- is an O(N) operation, and -in- is O(1)-. If you use -if- to partition a data set of N observations into C chunks, you are doing C O(N) operations. To use -in-, you pay a one time toll of O(N log N), in return for the ability to then complete the task with C O(1) operations. Depending on C and N, that can result in a substantial reduction in time. Of course, for other configurations of C and N, the sort time will outweigh the savings. What I can tell you from my experience using this technique is that, at least for the C's and N's that typically come up, it's usually a very substantial gain. Also note that often in these situations, the data have already been sorted in order to do something else. In that case, with the O(N log N) toll prepaid, it's pure gain. Your mileage may vary.

That said, I do agree that being able to select observations with -if- or -in- in the -save- statement would be an desirable.

Added: It also dawns on me that Sergeo Correa (who wrote #8 above) has an -fsort- command that can do the sorts in O(N) time. So using that, in combination with replacing -if- with -in- should give a spectacular savings in time.

Last edited by Clyde Schechter; 24 Sep 2017, 16:07.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#11

24 Sep 2017, 16:33

Interesting addition to #10. I decided to run

Code:

clear set rmsg on use stuff clear use in 1/100000 using stuff clear use stuff

on my Stata 14.2. Like feenberg and Sergio Correa, it takes just as long to read the 1/1000000 segment of the data set as it does to read the whole thing! But I used the same data set that I had used in #7. So it is not about the data set. It is about how -use- works in these two versions of Stata.
Comment

Sergio Correia

Join Date: Apr 2014
Posts: 420

#12

24 Sep 2017, 22:15

Interesting; I haven't heard any instances of Stata 15 being faster than Stata 14 on any commonly used program until this case.

By the way, I ran a timer on a modified version of fsort. This is only faster than plain -sort- if you have lots of obs. (20mm or above), mostly due to the extra overhead of Mata compared to C. However, it also saves the ranges that you require to pass to -in- in a matrix, which is useful. Code below:

#### partition.ado
#### (requires ftools)

Code:

cap pr drop partition
program define partition, rclass
    syntax varlist, [Verbose]
    loc verbose = ("`verbose'" != "")
    mata: F = factor("`varlist'", "`touse'", `verbose', "", ., ., ., 0)
    mata: F.panelsetup()
    
    loc sortedby : sortedby
    if ("`sortedby'" != "`varlist'") {
        qui ds, has(type numeric)
        loc numvars `r(varlist)'
        
        qui ds, has(type string)
        loc strvars `r(varlist)'
        
        if ("`numvars'" != "") {
                mata: st_view(data = ., ., "`numvars'")
                mata: data[.,.] = F.sort(data)
        }

        if ("`strvars'" != "") {
                mata: st_sview(data = ., ., "`strvars'")
                mata: data[.,.] = F.sort(data)
        }
    }
    
    tempname num_levels info
    mata: st_numscalar("`num_levels'", F.num_levels)
    mata: st_matrix("`info'", F.info)
    mata: printf("{txt}(dataset sorted into %g partitions)\n", F.num_levels)
    mata: mata drop F
    return scalar num_levels = `num_levels'
    return matrix info = `info'
end

ftools, check
exit

#### example

Code:

sysuse auto, clear

partition turn
return list
matrix list r(info)

Announcement