A faster summarize command.

Diego Villacreses

Join Date: Oct 2015

Posts: 10
#1

A faster summarize command.

25 Nov 2015, 21:03

Hello everyone, I have a do file that basically does this:

clear all
set obs 10000
gen x = rnormal()
gen y = .
forval x =1/`=_N'{
qui summ x if _n>= `x'-2 & _n <= `x'
qui replace y = r(sum) in `x'
}

I'm trying to run this code for huge datasets and is really slow. I think that perhaps if I write a faster program this problem would be reduced, so I wrote the following mata code:

cap mata mata drop csum()
mata
void csum(string scalar varlist, string scalar touse)
{
real scalar sum
sum = sum(tokens(varlist))
st_numscalar("sum", sum)
}
end
cap program drop csum
program define csum, rclass
syntax varlist [if] [in]
marksample touse
mata: csum("`varlist'", "`touse'")
return scalar sum = sum
end

clear all
set obs 10000
gen x = rnormal()
gen y = .
forval x =1/`=_N'{
csum x if _n>= `x'-2 & _n <= `x'
qui replace y = r(sum) in `x'
}

Which is actually slower than STATA summarize program. I have two questions:
1) How could I access to the summarize ado file? (I've already search in C:\ado and \stata\ado)
2) Do you have any idea of how could I make the first or the second code faster?

Thank you everyone
Tags: None
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#2

25 Nov 2015, 21:35

You don't have to use summarize.

Code:

clear all set obs 10000 gen x = rnormal() gen y = . replace y = x in 1 replace y = x[1] + x[2] in 2 replace y = x[_n-2] + x[_n-1] + x[_n] if _n > 2

Edit: The problem with your code is the unnecessary loop.

Last edited by Friedrich Huebler; 25 Nov 2015, 21:45.
1 like
Comment
Diego Villacreses

Join Date: Oct 2015

Posts: 10
#3

25 Nov 2015, 22:22

Dear Friedrich, you have my respect. This problem was cracking my head for days. Thank you.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35616
#4

26 Nov 2015, 01:16

Friedrich's suggestion is excellent. But those attracted to this thread by the title should perhaps check out

Code:

summarize, meanonly

It is certainly documented, but nevertheless often overlooked. See the puff at http://www.stata-journal.com/sjpdf.h...iclenum=st0135

My guess is that people learn summarize early in their Stata career and feel they know what it does. They know about summarize, detail but argue quite rightly that can't be a faster solution as it slows down summarize by calculating extra stuff you usually don't need or want. Either way, people think they know what summarize does rather well, and overlook the meanonly option. Or they have heard of the meanonly option but think it calculates the mean, only. That one really is down to StataCorp, as unusually, it is a bad name choice, as pointed out in the article cited.
1 like
Comment
Diego Villacreses

Join Date: Oct 2015

Posts: 10
#5

26 Nov 2015, 09:04

Thank you Nick, your solution is very useful. But it's possible to get only the standard deviation or the sum? Also, how can I see the ado-file of the summarize command? Because starting for there you can optimize the algorithm.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35616
#6

26 Nov 2015, 09:11

You can't see the ado file of summarize, as there isn't one: it is built-in code.

Code:

. which summarize built-in command: summarize

The documentation tells you what you can do. As explained already in the linked document, and in the help, the sum is provided by summarize, meanonly. For the SD, you must specify summarize. As getting the variance (equivalently the SD) is much of the work, it's hard to see how you could go much faster, but you could try your own Mata code, as there are Mata functions for getting the variance (and, naturally, the square root too).
1 like
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#7

26 Nov 2015, 09:34

I'm a bit late to the party but it appears that Diego is trying to calculate statistics over a rolling window. That's something that's easy to do with tsegen (from SSC). Here's a quick adaptation to the original example, including the standard deviation:

Code:

clear all set obs 10000 gen x = rnormal() gen long time = _n tsset time tsegen ysum = rowtotal(L(0/2).x) tsegen ysd = rowsd(L(0/2).x)
1 like
Comment
Diego Villacreses

Join Date: Oct 2015

Posts: 10
#8

26 Nov 2015, 10:47

Thank you Robert, your solution is really useful for my for my purposes (I'm building a program), this simplifies the code. Also, thank you Nick, you're right, summarize is a built in option, in that case, what can I do? I can't modify or understand the command algorithm? Because I'd like to get only one or two descriptive statistics. I've already written a MATA function and a STATA command that invoke that function, as you can see in the first post of this discussion, that command is actually a little slower than the STATA command summarize. Actually (this is one of the main reasons I started this post) summarize command is faster than MATA as you can see in the following code:
clear all
set seed 1234

set obs 100000
gen x = rnormal()

mata: st_view(x=.,.,x)

timer on 1
forval x =1/1000{
mata: s = mean(x)
mata: st_numscalar("s", s)
}
timer off 1

timer on 2
forval x =1/1000{
qui summ x
}
timer off 2

timer on 3
forval x =1/1000{
qui summ x, meanonly
}
timer off 3

timer list
Here is the output generated in my pc (Intel i7 - 2670, 8gb ram):

1: 3.29 / 1 = 3.2910
2: 2.23 / 1 = 2.2280
3: 1.20 / 1 = 1.1960

As you can see the summarize , meanonly is the fastest option, followed by summarize and the slowest one is the MATA option. This results worry me, because MATA isn't fast enough, for example if there isn't other option than summarize you're stock in your pc capacity. Finally that's the reason because I wanted to see the summarize code (I'd like to add options to it).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35616
#9

26 Nov 2015, 10:57

Bottom line: To see the summarize source code, which is C code, you need to work at StataCorp. There is no other way.

Mata is fast, but I can't see that you could beat the compiled C code for the same task.
1 like
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#10

26 Nov 2015, 11:34

1. mean is based on quadcross() which is built-in, hence C-code.
2. You can reduce the execution time, by avoiding the loop. Select all the variables and put them in a single matrix. and then use the mean function. It will be faster than alternative 3.

Code:

mata: x = st_data(.,"x1-x1000") // or x = J(1,1000,st_data(.,"x")) mata: s = mean(x) mata: st_matrix("s",s")

3. I guess that doing 1000 times st_numscalar("s", s) is what slows down execution time.
Comment
Diego Villacreses

Join Date: Oct 2015

Posts: 10
#11

26 Nov 2015, 11:37

Thank you Nick, I think I can't go faster, all the posts were very useful, I had already finished my little program. My only wish is that STATA add a few options to summarize.
Comment
Diego Villacreses

Join Date: Oct 2015

Posts: 10
#12

26 Nov 2015, 12:39

Thank you Christophe, your solution actually is faster than summarize but with a different data structure, which could be reliable under certain circumstances, the solution is actually useful.
In the other hand, you could also try without st_numscalar("s", s), summarize still is the fastest option.
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#13

26 Nov 2015, 13:57

Well, it was a wrong guess then
But if you change st_view() to st_data() then mean() becomes slightly faster. I think it is documented in the manual that the advantage of views over copies of the data is that they take less memory, but will slow down execution (probably because views are essentially pointers to the data).

Code:

clear all set seed 1234 set obs 100000 gen x = rnormal() // mata: st_view(x=.,.,"x") mata: x = st_data(.,"x") timer on 1 forval x =1/1000{ mata:s = mean(x) } timer off 1 timer on 3 forval x =1/1000{ qui summ x, meanonly } timer off 3 timer list

Code:

timer list 1: 2.11 / 1 = 2.1110 3: 2.70 / 1 = 2.6990
Comment

Christophe Kolodziejczyk

Join Date: Mar 2014
Posts: 377

#14

26 Nov 2015, 14:10

Here is a variant of your program where you can compare the impact of st_view() and st_data()

Code:

clear
set obs 10000

drawnorm x1-x10

mata: x = st_data(.,.)
** mata: st_view(x=.,.,.)
timer clear

timer on 1
forval i=1/1000 {
qui sum x* , meanonly
}
timer off 1

timer on 2
forval i=1/1000 {
mata: s=mean(x)
}
timer off 2

With st_data()

Code:

. timer list
   1:      1.78 /        1 =       1.7780
   2:      0.76 /        1 =       0.7620

With st_view()

Code:

. timer list
   1:      2.11 /        1 =       2.1130
   2:      2.48 /        1 =       2.4770

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#15

26 Nov 2015, 15:34

In case the point is missed by those skimming this thread, the problem is not with summarize or the use of an equivalent function in mata. Stata is very efficient when you perform operations over all observations at the same time. The issue here is that you are repeating the summarize command over each observation. Even if you apply a condition, Stata has to cycle over each observation to determine which ones to use with the summarize command. So as the problem grows, so will the execution time (exponentially).

In contrast, tsegen will only perform the calculation once. The difference in execution time is dramatic. Here's how it compares with 10,000 obs

Code:

. clear all

. set obs 10000
number of observations (_N) was 0, now 10,000

. gen x = rnormal()

. 
. timer clear

. timer on 1

. gen long time = _n

. tsset time
        time variable:  time, 1 to 10000
                delta:  1 unit

. tsegen ysum = rowtotal(L(0/2).x)

. timer off 1

. 
. timer on 2

. gen y = .
(10,000 missing values generated)

. forval x =1/`=_N'{
  2. qui summ x if _n>= `x'-2 & _n <= `x'
  3. qui replace y = r(sum) in `x'
  4. }

. timer off 2

. timer list
   1:      0.01 /        1 =       0.0100
   2:      2.75 /        1 =       2.7460

Here's with 20,000 obs

Code:

. clear all

. set obs 20000
number of observations (_N) was 0, now 20,000

. gen x = rnormal()

. 
. timer clear

. timer on 1

. gen long time = _n

. tsset time
        time variable:  time, 1 to 20000
                delta:  1 unit

. tsegen ysum = rowtotal(L(0/2).x)

. timer off 1

. 
. timer on 2

. gen y = .
(20,000 missing values generated)

. forval x =1/`=_N'{
  2. qui summ x if _n>= `x'-2 & _n <= `x'
  3. qui replace y = r(sum) in `x'
  4. }

. timer off 2

. timer list
   1:      0.02 /        1 =       0.0170
   2:     10.15 /        1 =      10.1540

The moral of the story is to avoid looping over observations unless there's no other way to do it.

Announcement