Good morning,
I'm working with some large files, and I am trying to accelerate egen commands.
Looking into the implementation of _gmean and _gtotal, the functions that egen x = total(...), by(...) and egen x = mean(...), by(...) are calling under the hood, I noticed that they essentially sort the data before using a by command to generate the desired result.
In my particular case, it happens that the files are already sorted - stata just doesn't know about it.
Additionally, when running something like:
the call to egen in the second section is about 25% faster, which would make a tremendous difference for my use case.
Is there something I can do to tell stata the data is already sorted?
I was thinking that somehow manually setting the value that is returned by:
could maybe be an option, but I haven't found a way to do so.
Thank you in advance,
Jaqueline
I'm working with some large files, and I am trying to accelerate egen commands.
Looking into the implementation of _gmean and _gtotal, the functions that egen x = total(...), by(...) and egen x = mean(...), by(...) are calling under the hood, I noticed that they essentially sort the data before using a by command to generate the desired result.
In my particular case, it happens that the files are already sorted - stata just doesn't know about it.
Additionally, when running something like:
Code:
* Raw egen
set obs 100000000
gen a = int(_n / 3)
gen b = _n * _n
egen c = sum(b), by(a)
clear
* Sort before
set obs 100000000
gen a = int(_n / 3)
gen b = _n * _n
sort a
egen c = sum(b), by(a)
the call to egen in the second section is about 25% faster, which would make a tremendous difference for my use case.
Is there something I can do to tell stata the data is already sorted?
I was thinking that somehow manually setting the value that is returned by:
Code:
describe, varlist
di "`r(sortlist)'"
Thank you in advance,
Jaqueline
Comment