computation time of egen function "skew"

charlie wong

Join Date: Jan 2016

Posts: 154
#1

computation time of egen function "skew"

31 May 2017, 08:38

Dear Statalist,

I am generating a set of variables in the following fashion:

by panel_id: egen pop_mean=mean(pop)
by panel_id: egen pop_sd=sd(pop)
by panel_id: egen pop_sk=skew(pop)

I have not encountered any error so far. The problem is that I started executing the code a few days ago (my data contains tens of millions of observations). Based on what I see from the results window, it only takes minutes for doing the first two line (mean, sd) but the third line (skew) looks to take forever. I would be grateful for any advice, such as - does skew really takes much longer time than mean and sd? or is there any faster approach to get skewness (or approximation to that)?

Thanks in advance.

Charlie.
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

31 May 2017, 09:04

For what it's worth, this is the code used to calculate the skewness:

Code:

*! version 1.2.3 09feb2015 program define _gskew version 6.0, missing syntax newvarname =/exp [if] [in] [, BY(varlist)] quietly { tempvar touse group mark `touse' `if' `in' sort `touse' `by' by `touse' `by' : gen `c(obs_t)' `group' = _n == 1 if `touse' replace `group' = sum(`group') local max = `group'[_N] gen `typlist' `varlist' = . local i 1 while `i' <= `max' { su `exp' if `group' == `i', detail replace `varlist' = r(skewness) if `group' == `i' local i = `i' + 1 } } end

It seems the skewness is simply extracted from the sum, detail command. The problem is that sum, d calculates a lot of statistics (quantiles, variances, kurtosis, ...) that you don't need and which slow down calculation. It might be faster to calculate it manually. You can find Stata's approach to calculating skewness statistics in the help file for summarise, available online at www.stata.com/manuals13/rsummarize.pdf on page 8. Seems like it is just a matter of subtracting means and raising to certain powers, so it should not take excessively long. You can check your results against the sum command using a small subsample.

If that is still too slow, you can try implementing it in Mata, but it's not exactly the most user friendly part of Stata...
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#3

31 May 2017, 09:22

Thanks for the reference Jesse. I am a bit surprised by the fact that when doing egen skew stata would calculate some unnecessary items which slow things down - would imagine that every function should serve the purpose of that function and nothing more.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

31 May 2017, 09:28

You can perform the same calculations using rangestat (from SSC). The most recent version includes a skewness function and is much faster than the egen skew() function. Here's a quick example that shows how to duplicate the egen results on a small number of panels.

Code:

clear
set seed 3214
set obs 100
gen long panel_id = _n
expand 100
bysort panel_id: gen time = _n
gen pop = runiform()

by panel_id: egen pop_mean0=mean(pop)
by panel_id: egen pop_sd0=sd(pop)
by panel_id: egen pop_sk0=skew(pop)

rangestat (mean) pop (sd) pop (skewness) pop, interval(time . .) by(panel_id)
assert pop_mean0 == float(pop_mean)
assert pop_sd0 == float(pop_sd)
assert pop_sk0 == float(pop_skewness)

Comment

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#5

31 May 2017, 09:35

It's always a tradeoff between efficiency and coding time. A lot of functions are just a wrapper around sum and because it's so damn fast this works fine even for larger datasets. It's only once you start talking really big that this becomes more tricky and you might need to write a more efficient single-purpose program. Sometimes others have already done this for us, for a lot of statistics you could use Sergio Correia's ftools suite (fegen/fcollapse) where the f stands for fast (I think). If you plan on doing regressions, definitely check out his reghdfe command, which is at times dozens if not hundred times faster than reg/xtreg and is crazy versatile. If you want to calculate pairwise correlations, you can use pwcorrf, which also allows for fast within-group correlation calculations. timeit is also somewhat useful, it allows you to prefix any line with timeit <timer number>:, you can then use the standard timer list to see how long that line took. It's less cumbersome than having to put timer on/timer off's everywhere. (Disclaimer, I wrote pwcorrf and timeit).
1 like
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#6

31 May 2017, 09:40

Originally posted by Jesse Wursten View Post

It's always a tradeoff between efficiency and coding time. A lot of functions are just a wrapper around sum and because it's so damn fast this works fine even for larger datasets. It's only once you start talking really big that this becomes more tricky and you might need to write a more efficient single-purpose program. Sometimes others have already done this for us, for a lot of statistics you could use Sergio Correia's ftools suite (fegen/fcollapse) where the f stands for fast (I think). If you plan on doing regressions, definitely check out his reghdfe command, which is at times dozens if not hundred times faster than reg/xtreg and is crazy versatile. If you want to calculate pairwise correlations, you can use pwcorrf, which also allows for fast within-group correlation calculations. timeit is also somewhat useful, it allows you to prefix any line with timeit <timer number>:, you can then use the standard timer list to see how long that line took. It's less cumbersome than having to put timer on/timer off's everywhere. (Disclaimer, I wrote pwcorrf and timeit).

Thanks a lot Jesse. All very useful advice.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#7

31 May 2017, 09:42

Originally posted by Robert Picard View Post

You can perform the same calculations using rangestat (from SSC). The most recent version includes a skewness function and is much faster than the egen skew() function. Here's a quick example that shows how to duplicate the egen results on a small number of panels.

Code:

clear set seed 3214 set obs 100 gen long panel_id = _n expand 100 bysort panel_id: gen time = _n gen pop = runiform() by panel_id: egen pop_mean0=mean(pop) by panel_id: egen pop_sd0=sd(pop) by panel_id: egen pop_sk0=skew(pop) rangestat (mean) pop (sd) pop (skewness) pop, interval(time . .) by(panel_id) assert pop_mean0 == float(pop_mean) assert pop_sd0 == float(pop_sd) assert pop_sk0 == float(pop_skewness)

Thanks a lot Robert. That idea of using rangestat to calculate skewness indeed came to my mind when i first learnt about rangestat today.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35675
#8

31 May 2017, 10:04

Jesse is right. Indeed archaeology shows that the official function code is not much different from my 1999 original.

Some experiments suggest that substituting Mata code by itself doesn't give massive improvements. (See http://www.stata-journal.com/sjpdf.h...iclenum=st0204 for some sample code.)

I guess that the real culprits here may include how many groups you have. At worst you're looping over many, many groups and using lots of if. Both can spend time extravagantly.

I got timings ~ 10 s for about 1e7 observations and 5 groups on a rather old PC whatever way I did it, so I can't reproduce or explain "days", but I have not tried hard to slow things down. .
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#9

31 May 2017, 10:21

Originally posted by Nick Cox View Post

Jesse is right. Indeed archaeology shows that the official function code is not much different from my 1999 original.

Some experiments suggest that substituting Mata code by itself doesn't give massive improvements. (See http://www.stata-journal.com/sjpdf.h...iclenum=st0204 for some sample code.)

I guess that the real culprits here may include how many groups you have. At worst you're looping over many, many groups and using lots of if. Both can spend time extravagantly.

I got timings ~ 10 s for about 1e7 observations and 5 groups on a rather old PC whatever way I did it, so I can't reproduce or explain "days", but I have not tried hard to slow things down. .

Thanks a lot Nick - you have addressed some questions on my head. You are right - I have many groups (around 1 million). But what surprised me is the difference between the computation time for skew vs others (e.g. mean).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35675
#10

31 May 2017, 10:23

Woosh. I wouldn't much trust skewness for samples of order 10. The paper cited in #8 says much more. (mean - median) / SD often works quite well as a measure of skewness. It's nicely bounded by [-1, 1].
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#11

31 May 2017, 10:33

I was going to say mean/sd use the simple summarize command instead of sum,d and that's why they are much faster, but actually this is not true. They use the sum() function and take the last value per group (because sum() is an "incremental" sum). I would expect a summarize (the command)-based version to be much faster, but I am too lazy to test it. Especially because sum, meanonly is really really fast.

Code:

clear all set obs 10000000 gen x = rnormal() timeit 1: sum x timeit 2: sum x, d timeit 3: sum x, meanonly timer list

gives

Code:

1: 0.20 / 1 = 0.2010 2: 13.93 / 1 = 13.9290 3: 0.11 / 1 = 0.1070

0.11 seconds to calculate the mean across 10 million observations is really mad when you think about it. Anyway, the code used for mean() and sd() october 2004, so that might be why they use sum() rather than sum,m. In fact, this might very well still be Nick's original code ...
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#12

31 May 2017, 10:36

Originally posted by Nick Cox View Post

Woosh. I wouldn't much trust skewness for samples of order 10. The paper cited in #8 says much more. (mean - median) / SD often works quite well as a measure of skewness. It's nicely bounded by [-1, 1].

I guess you are probably right Nick. I would see if skewness would be of any use in my model.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#13

31 May 2017, 10:45

Originally posted by Jesse Wursten View Post

I was going to say mean/sd use the simple summarize command instead of sum,d and that's why they are much faster, but actually this is not true. They use the sum() function and take the last value per group (because sum() is an "incremental" sum). I would expect a summarize (the command)-based version to be much faster, but I am too lazy to test it. Especially because sum, meanonly is really really fast.

Code:

clear all set obs 10000000 gen x = rnormal() timeit 1: sum x timeit 2: sum x, d timeit 3: sum x, meanonly timer list

gives

Code:

1: 0.20 / 1 = 0.2010 2: 13.93 / 1 = 13.9290 3: 0.11 / 1 = 0.1070

0.11 seconds to calculate the mean across 10 million observations is really mad when you think about it. Anyway, the code used for mean() and sd() october 2004, so that might be why they use sum() rather than sum,m. In fact, this might very well still be Nick's original code ...

Thanks Jesse. I used your code and get:

. timer list
1: 0.14 / 1 = 0.1400
2: 7.03 / 1 = 7.0310
3: 0.07 / 1 = 0.0680

Sorry I am not quite following you - is it that the 7.03 vs 0.07 here sheds light on the time required for calculating skew and mean?
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#14

31 May 2017, 10:53

With the detail option, summarize needs to sort the data. If there are lots of by() groups, the data will need to be sorted a lot when using the egen skew() function. No doubt this could be speeded up. The rangestat solution is available now and plenty fast.

Code:

clear all
set seed 3214
set obs 10000
gen long panel_id = _n
expand 100
bysort panel_id: gen time = _n
gen pop = runiform()

timer on 1
by panel_id: egen pop_sk0=skew(pop)
timer off 1

timer on 2
rangestat (skewness) pop, interval(time . .) by(panel_id)
timer off 2

timer list

assert pop_sk0 == float(pop_skewness)

and the results:

Code:

. timer list
   1:    472.91 /        1 =     472.9100
   2:      2.27 /        1 =       2.2650

Comment

charlie wong

Join Date: Jan 2016

Posts: 154
#15

31 May 2017, 11:00

Originally posted by Robert Picard View Post

With the detail option, summarize needs to sort the data. If there are lots of by() groups, the data will need to be sorted a lot when using the egen skew() function. No doubt this could be speeded up. The rangestat solution is available now and plenty fast.

Code:

clear all set seed 3214 set obs 10000 gen long panel_id = _n expand 100 bysort panel_id: gen time = _n gen pop = runiform() timer on 1 by panel_id: egen pop_sk0=skew(pop) timer off 1 timer on 2 rangestat (skewness) pop, interval(time . .) by(panel_id) timer off 2 timer list assert pop_sk0 == float(pop_skewness)

and the results:

Code:

. timer list 1: 472.91 / 1 = 472.9100 2: 2.27 / 1 = 2.2650

Robert I was just about to do this comparison..thanks.
Comment

Announcement