Runtime and size of data

paulvonhippel

Join Date: Apr 2014

Posts: 499
#1

Runtime and size of data

12 Apr 2016, 13:09

We are finding that our runtime slows way down if our dataset is large, even if we are only analyzing a small subset of the data at a time. What can we do about this?

We are developing a new Stata command called mgbe. We are testing it on data from all US counties. There are 3,100 counties but we only have 15 lines of data per county, so all in all there are about 45,000 rows of data.

We are running the command on one county at a time, like this:
bysort county: mgbe <snip>
And it runs really, really slowly, taking several minutes per county.

But here's the strange thing: if we drop a lot of the data, say keeping only 10 counties, then the same command runs much, much faster, in just a few seconds per county.

What's going on? Why does the total number of counties in the dataset affect runtime when we're only running one county at a time? And given that this is the case, is there a trick we can use to speed up runtime without dropping most of the data?

You will notice I haven't provided any detail on what mgbe does. That's not because it's secret. It's because I suspect the issue I'm describing is general and the details of mgbe wouldn't help and might be distracting. I can say, though, that mgbe relies on "ml model" and that it implements the estimator described in this paper: http://smx.sagepub.com/content/early...81175015599807
Tags: None
ben earnhart

Join Date: May 2014

Posts: 1027
#2

12 Apr 2016, 13:13

Have you looked at memory usage? If it's hogging memory, and dips into virtual memory (which is whee the hard drive is pretending to be RAM), it will dramatically slow down.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#3

12 Apr 2016, 13:24

Well, I ran "set matsize 11000" and "set maxvar 32767" and then ran "bysort county: mgbe <snip>". It ran just as slowly as before.

Before I reset matsize and maxvar, this is what my memory usage looked like.

Memory usage
used allocated
---------------------------------------------------------------------
data 2,102,668,800 2,449,473,536
strLs 0 0
---------------------------------------------------------------------
data & strLs 2,102,668,800 2,449,473,536

---------------------------------------------------------------------
data & strLs 2,102,668,800 2,449,473,536
var. names, %fmts, ... 329,491 361,987
overhead 1,081,912 1,082,152

Stata matrices 784 784
ado-files 307,748 307,748
stored results 211,892 211,892

Mata matrices 111,616 111,616
Mata functions 156,416 156,416

set maxvar usage 5,271,736 5,271,736

other 102,765 102,765
---------------------------------------------------------------------
grand total 2,109,756,492 2,457,080,632
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#4

12 Apr 2016, 13:40

More on memory usage. I ran
set matsize 11000
set maxvar 32767
set niceness 0
set min_memory 16g
and my command still runs very slowly. But if I drop most of the data, it runs quickly.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#5

12 Apr 2016, 13:49

As this is a command you are writing, and you are making it -by-able, and you are using the typical approach of tagging cases with -marksample-, and then processing them -if `touse'-, you are being slowed down because of the -if- calculations. Within each -by- group- Stata has to go through the entire data set figuring out which observations belong and which ones don't by evaluating an -if- condition. So the run time per by group is proportional to the size of the entire data set.

There is usually no way around this. The one situation I know if where you can do better is if the by-groups correspond go consecutive blocks of observations whose beginning and end are known or easily computed on entry into the program. In that case, the -if `touse'- clauses can be replaced with -in start/end- clauses. These run much faster, with each group's execution being order of the group size, and not dependent on the size of the entire data set. But situations like this are relatively uncommon.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

12 Apr 2016, 13:50

Have you tried letting it automatically manage memory? By forcing it to always use 16gb, you may actually be slowing it down. Is it possible to monitor memory usage while you're running your code?
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#7

12 Apr 2016, 14:14

Does the timing improve regardless of the subset of data you are working with? For each group of observations you're fitting an iterative model, so perhaps there are issues fitting the likelihood function related to the values for a given subset of observations. Is everything written in Stata, or do you have portions written in Mata as well? Moving some of your codebase to Mata should also help with the computational speed.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#8

12 Apr 2016, 14:15

Clyde, I think your -in start/end- approach would work for us. Can you point to an example?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#9

12 Apr 2016, 14:37

I don't have a worked example handy, and it would probably take me hours to find one in my archived files. But let's imagine that our program only applies to a data set in which all of the groups are equal size. Then rather than doing it as a -by-able program, I would use a -by- option along with a -groupsize()- option, both required. So it would look something like this:

Code:

program define myprogram // perhaps rclass, eclass, etc. syntax whatever, by(varlist) groupsize(integer) // MAYBE OTHER OPTIONS, TOO // GETTING SET UP TO USE -in- local n_groups = _N/`groupsize' assert `n_groups' == floor(`n_groups') bysort varlist: assert _N == `groupsize' // THIS STEP DEPENDS ON SIZE OF ENTIRE DATA SET // NOW DO IT local start = 1 local end = `groupsize' forvalues i = 1/`n_groups' { command1 in `start'/`end' command2 in `start'/`end' ... local start = `start' + `groupsize' local end = `end' + `groupsize' } // MAYBE OTHER STUFF end

The initial sorting and verification that all groups actually are of size `groupsize' will depend on the size of the entire data set. (But to use a -by-able program you have to -sort- the data first anyway, and the sorting is very much the lion's share of the computing time for this one command.)
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#10

13 Apr 2016, 06:57

I don't see the actual code here but I want to emphasize Clyde's point that "if" is slow - so, "if" you are using "if" more than once in your program you can speed it up by keeping (or dropping) only those observations wanted first, doing the rest of the program without any "if's", saving in a file, etc. - then just append all the files at the end - this is a strategy I have used to great effect in the past; don't know whether it will work in your case, or is even relevant, as you don't show your code
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35637
#11

13 Apr 2016, 08:39

The slowness of if compared with equivalent in was often emphasised on Statalist years ago by Michael Blasnik.

see e.g. http://www.stata.com/statalist/archi.../msg01270.html
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#12

13 Apr 2016, 08:50

This is all very helpful, thanks. It sounds like there are two suggestions.

1. One (from Clyde & Nick) is to use -in- instead of -if-.
2. The other (from Rich) is to read the big dataset repeatedly, each time keeping only the relevant subset of cases.

Which of these approaches would run faster? Approach 1 keeps more data in memory, but approach 2 has to read and subset the data multiple times.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#13

13 Apr 2016, 10:18

paulvonhippel You can get something analogous to both if there is only a single if condition that you need to test using preserve and restore, preserve. This way you can drop the records that are not needed without permanently changing the source data and without any of the I/O penalties that would occur with writing/reading to and from the disk.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 499
#14

13 Apr 2016, 12:03

Clyde makes a convincing case that looping over -in `start'/`end'- should run faster than looping over -if `touse'. And wbuchanan points out that looping over -preserve- and -restore- should run faster than looping over -use-.

What I'm wondering about now is whether it makes any difference whether I actually drop cases or just use -if- to focus on the cases that are relevant in each loop.

That is, does /*1*/ run any faster or slower than /*2*/ below?

/* 1 */
preserve
keep if in `start'/`end'
cmd
restore, preserve

/* 2*/
cmd if in `start'/`end'
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#15

13 Apr 2016, 13:03

-keep if in `start'/`end'- is a syntax error. It won't run at all. It should be just -keep in `start'/`end'-.

Same thing for /*2*/ -if in- is not valid syntax.

The question as to which of /*1*/ and /*2*/ will run faster (assuming you take the -if- out of them) can't really be answered generally and in the abstract. At first glance, since /*1*/ includes everything in /*2*/, /*2*/ should be faster. But, it actually depends on what -cmd- does. For example, if -cmd- itself translates -in `start'/`end'- to -if `touse'- after applying -syntax- and -marksample-, then /*1*/ might indeed be faster. Or if in some other way -cmd-'s execution time depends on the size of the entire data in memory, not just the number of observations -in `start'/`end'-, then /*1*/ might be better. You really have to try it both ways to see.

Generally when I went to test two approaches to see which is faster, I don't do it with a really long time-consuming case. I generally try to test it on a case that is of moderate size but is relatively representative of the use-cases I envision. Then I run it both ways, with -timer-s on and compare. If the resulting difference is small, I might iterate the process in a loop so that the difference becomes more apparent.

Not relevant to #14, but earlier in the thread the question arose whether -preserve- and -restore- is faster than -save- and -use-. On page 390 of the [P] manual it says:

To preserve the data, preserve must make a copy of it on disk.

So I would think it makes no difference. But, empirically, there does seem to be a slight advantage to -preserve- and -restore-:

Code:

. clear* . sysuse auto (1978 Automobile Data) . . tempfile holding . . timer on 1 . forvalues i = 1/10000 { 2. quietly { 3. save `"`holding'"', replace 4. use `holding', clear 5. } 6. } . . timer off 1 . timer list 1 1: 18.66 / 1 = 18.6640 . . timer on 2 . forvalues i = 1/10000 { 2. quietly { 3. preserve 4. restore 5. } 6. } . timer off 2 . timer list 2 2: 17.59 / 1 = 17.5900

I don't understand why this is so, but I've run it a few times with different files and different numbers of replications, and the 5-6% difference is consistent across all trials.
Comment

Announcement

Runtime and size of data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment