Runtime and size of data

paulvonhippel

Join Date: Apr 2014

Posts: 499
#16

13 Apr 2016, 21:17

Good suggestions, thanks. I will test in my own data
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#17

27 Apr 2016, 08:47

Thanks for all the great suggestions. We've cut runtime by a factor of 3-10 using Nick Cox's savesome command, which is a wrapper for the preserve-restore approach. We haven't switched from -if- to -in- yet. I am sure it would be faster, but doesn't it entail some loss of flexibility? What if some subsets have more rows than others? What if there are 20 rows per subset, but the user mistakenly says there are 21?

I have another question. Many database programs can add "indices" which allow users to access subsets of a large dataset quickly:
https://en.wikipedia.org/wiki/Database_index

SAS can add indices to data as well:
http://www.lexjansen.com/nesug/nesug02/bt/bt014.pdf

Can Stata?
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#18

27 Apr 2016, 09:27

Depending on the type of index you might get comparable performance using gsort initially. The help documentation for PostgreSQL has some decent information regarding different types of indexing algorithms it implements http://www.postgresql.org/docs/9.5/s...xes-types.html. Basically you're preventing the system from having to do the equivalent of a full table scan by relying on predefined locations (in your case you could create a numeric sequence given an if/in condition and use that to operate on chunks of the data).
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#19

27 Apr 2016, 09:50

We are sorting the data initially (using sort rather than gsort). Does Stata remember the data have been sorted when it looks for an -if- subset? That would speed performance.

I'm not sure why you are referencing the documentation for PostgreSQL. Can Stata take advantage of PostgreSQL indices?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#20

27 Apr 2016, 11:15

We haven't switched from -if- to -in- yet. I am sure it would be faster, but doesn't it entail some loss of flexibility? What if some subsets have more rows than others? What if there are 20 rows per subset, but the user mistakenly says there are 21?

Yes, it entails loss of flexibility. If the user misreports the group size, the code shown detects the error and aborts.

But the code below overcomes these limitations. It is slightly more complicated, and requires an additional sort at the beginning (or before you call the program). But that will probably still run faster than conditioning the commands with -if-. The idea is that instead of having a single group size stored in a local macro, we simply calculate the size of each group at the beginning and save it in a variable, then we work our way down through the data, updating `start' and `end' based on the size of the current group. This appears to be a rare situation where the older -while- command is necessary and cannot be replaced with a -foreach- (at least not as far as I can see).

Code:

capture program drop myprogram program define myprogram // perhaps rclass, eclass, etc. syntax whatever, by(varlist) // MAYBE OTHER OPTIONS, TOO // IDENTIFY GROUP SIZE AND VERIFY IT IS CONSTANT tempvar size by `by', sort: gen `size' = _N // NOW DO IT local start = 1 local end = `size'[1] while `start' <= _N { command1 in `start'/`end' command2 in `start'/`end' ... local start = `start' + `size'[`start'] local end = `start' + `size'[`start'] - 1 } // MAYBE OTHER STUFF end

Last edited by Clyde Schechter; 27 Apr 2016, 11:17. Reason: Correct errors in code
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#21

27 Apr 2016, 11:27

paulvonhippel The PostgreSQL reference was just to illustrate some of the different typical index types since they each are optimized for specific applications. There is a sortpreserve option that can be added to a program definition to preserve the sort order of the data, but it isn't completely clear what the implementation in your program would look like.

Code:

sortpreserve states that the program changes the sort order of the data and that Stata is to restore the original order when the program concludes. See [P] byable and [P] sortpreserve for a discussion of this important option.

So this could be useful if your program does sorting internally or if you wanted to sort on the if condition before calling subroutines that might sort the data for other purposes. If the sort ordering is simple (e.g., sorted over one or two variables) it probably wouldn't make a ton of difference unless there was a large number of unique values.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#22

27 Apr 2016, 13:04

Clyde Schechter: That is a really clever solution which we can probably implement. Do you have any intuition regarding how much faster it would be than -if- with -savesome-?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#23

27 Apr 2016, 15:21

Well, in general the only thing slower in computing than reading from a disk is writing to a disk. So, it is hard to imagine a scenario where the -in- approach won't be faster than repeatedly reading and writing data. That said, the extent of speedup depends a lot on what the commands in the program do. In isolation, execution time for selecting the computing sample with -in- will be independent of sample size, where as with -if- it will scale linearly with sample size. But commands that will pass -in `start'/`end'- into -marksample touse- and then invoke subsequent commands -if `touse'- will not be sped up by this process. But commands that do not themselves invoke that pattern will benefit from the speedup. In isolation, execution time for selecting the computing sample with -in- will be independent of sample size, whereas with -if- it will scale linearly with sample size.

I don't think there's a good way to figure it out a priori. My advice would be to develop a few relatively small use cases and try it both ways and compare the times used up by a large enough number of reps to detect the difference..
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#24

27 Apr 2016, 15:33

Clyde Schechter: Thanks! Another way to speed runtime is by parallelizing. And I wonder if that would be incompatible with your solution, which seems very serial?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#25

27 Apr 2016, 15:46

We're really moving out of my scope of knowledge. Yes, my approach is clearly serial. It is likely that if you could implement a parallel approach, it would be faster, running several -in `start'/`end'- chunks at a time. Although Stata has parallel computation built in to some of its commands, as far as I know, Stata does not enable user-programmers to build parallellization into their ado files. So I'm not sure this is an option for you.

If somebody on the list knows otherwise, I hope he/she will chime in.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#26

28 Apr 2016, 07:59

Stata/MP parallelizes some tasks. I don't know if it parallelizes by-processing, or under what circumstances.

There's a 2013 user command called parallel that can parallel by-processing:
https://ideas.repec.org/c/boc/bocode/s457527.html
I've been experimenting with it. It may not be compatible with some other things that I'm doing, but it should be very useful for other purposes. And perhaps I can tailor my application to work with it.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment