Manually setting stata's sortlist without running a sorting command

Jaqueline Simmons

Join Date: Mar 2023

Posts: 2
#1

Manually setting stata's sortlist without running a sorting command

10 Mar 2023, 09:05

Good morning,

I'm working with some large files, and I am trying to accelerate egen commands.
Looking into the implementation of _gmean and _gtotal, the functions that egen x = total(...), by(...) and egen x = mean(...), by(...) are calling under the hood, I noticed that they essentially sort the data before using a by command to generate the desired result.
In my particular case, it happens that the files are already sorted - stata just doesn't know about it.

Additionally, when running something like:

Code:

* Raw egen set obs 100000000 gen a = int(_n / 3) gen b = _n * _n egen c = sum(b), by(a) clear * Sort before set obs 100000000 gen a = int(_n / 3) gen b = _n * _n sort a egen c = sum(b), by(a)

the call to egen in the second section is about 25% faster, which would make a tremendous difference for my use case.
Is there something I can do to tell stata the data is already sorted?

I was thinking that somehow manually setting the value that is returned by:

Code:

describe, varlist di "`r(sortlist)'"

could maybe be an option, but I haven't found a way to do so.

Thank you in advance,
Jaqueline

Last edited by Jaqueline Simmons; 10 Mar 2023, 09:48.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#2

10 Mar 2023, 11:16

My understanding is that -by- checks to see that the data is sorted and doesn't sort it if the data set is sorted appropriately. Given that, you could just delete the sort commands out of _gmean.ado and _gtotal.ado and save your own versions of those commands as, say, mygmean.ado and mygtotal.ado and use those instead. I presume that -by- would halt the command for you if the current sort order doesn't match what it expects.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#3

10 Mar 2023, 11:21

I just noticed that the user-written module -gtools- (see -net describe gtools, from(http://fmwww.bc.edu/RePEc/bocode/g-) contains a -gegen- module that executes -egen- command faster via some C code, so that might be worth a try, too. (I haven't used this module myself.)
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2400
#4

10 Mar 2023, 12:42

You may also consider the -hashsort- as part of the -gtools- package mentioned by Mike Lacy to first sort your data faster than the built-in -sort-.
Comment
Jaqueline Simmons

Join Date: Mar 2023

Posts: 2
#5

13 Mar 2023, 04:08

Thank you for your help. I will probably use some custom version of these ados, as mentioned, but I will checkout gtools' options first.
Comment

Announcement

Manually setting stata's sortlist without running a sorting command

Comment

Comment

Comment

Comment