computation time of egen function "skew"

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#16

31 May 2017, 11:14

Originally posted by charlie wong View Post

Thanks Jesse. I used your code and get:

. timer list
1: 0.14 / 1 = 0.1400
2: 7.03 / 1 = 7.0310
3: 0.07 / 1 = 0.0680

Sorry I am not quite following you - is it that the 7.03 vs 0.07 here sheds light on the time required for calculating skew and mean?

Your computer is considerably faster than mine, that's one conclusion :D . But I realise now the comparison is not entirely correct, it should be

Code:

clear all set obs 10000000 gen x = rnormal() timeit 1: gen sum = sum(x) timeit 2: sum x, d timeit 3: sum x, meanonly timer list

. timer list
1: 0.52 / 1 = 0.5220
2: 14.73 / 1 = 14.7260
3: 0.11 / 1 = 0.1050

If we compare 1 and 2, we see that sum,d takes a lot longer than sum(x), which are the commands used respectively by skew() and mean/sd() and explains why performance for mean/sd was still fine, while skew was not. Furthermore, if we compare 1 and 3, we see that using sum x instead of sum(x) could potentially speed up mean/sd by a factor 5 still. Of course, this ignores some details but the contrast might actually be starker still ...

Some might say I'm obsessed with speedtests (I may or may not have a folder on my pc with a bunch of different speed comparisons...). Did you know that using egen tag = tag() followed by drop if tag == 0 is considerably faster than duplicates drop, force (~20-50%) (at least, in my stylised test). Or that drop <varlist> (multiple vars) is massively faster than a succession of drop <varname> (single var at a time)? Well, now you do.
Comment
charlie wong

Join Date: Jan 2016

Posts: 154
#17

31 May 2017, 11:38

Originally posted by Jesse Wursten View Post

Your computer is considerably faster than mine, that's one conclusion :D . But I realise now the comparison is not entirely correct, it should be

Code:

clear all set obs 10000000 gen x = rnormal() timeit 1: gen sum = sum(x) timeit 2: sum x, d timeit 3: sum x, meanonly timer list

. timer list
1: 0.52 / 1 = 0.5220
2: 14.73 / 1 = 14.7260
3: 0.11 / 1 = 0.1050

If we compare 1 and 2, we see that sum,d takes a lot longer than sum(x), which are the commands used respectively by skew() and mean/sd() and explains why performance for mean/sd was still fine, while skew was not. Furthermore, if we compare 1 and 3, we see that using sum x instead of sum(x) could potentially speed up mean/sd by a factor 5 still. Of course, this ignores some details but the contrast might actually be starker still ...

Some might say I'm obsessed with speedtests (I may or may not have a folder on my pc with a bunch of different speed comparisons...). Did you know that using egen tag = tag() followed by drop if tag == 0 is considerably faster than duplicates drop, force (~20-50%) (at least, in my stylised test). Or that drop <varlist> (multiple vars) is massively faster than a succession of drop <varname> (single var at a time)? Well, now you do.

speed is king ...now that i m dealing with a huge dataset...thanks so much for sharing the tips on speeding!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35782
#18

31 May 2017, 13:22

Riffs on speed tests in other problems aside, as his co-author I would like to underline Robert Picard's point in #14 that rangestat already provides faster code.

Anyone following the forum closely may wonder why I didn't make this point myself in #8. The answer lies in an otherwise uninteresting cautionary tale. I tried some speed tests with rangestat and was surprised not to see a massive speed-up, so left the point on one side. Only later did it become clear that there were other problems on my machine and/or local network which were responsible for the slowdown. So, as everyone tells you, apparent speeds depend on your computer and what else it is doing or not doing.
1 like
Comment
jerome falken

Join Date: Aug 2017

Posts: 88
#19

19 May 2018, 19:12

Thank you Robert
Comment

Announcement

Comment

Comment

Comment

Comment