Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Manually setting stata's sortlist without running a sorting command

    Good morning,

    I'm working with some large files, and I am trying to accelerate egen commands.
    Looking into the implementation of _gmean and _gtotal, the functions that egen x = total(...), by(...) and egen x = mean(...), by(...) are calling under the hood, I noticed that they essentially sort the data before using a by command to generate the desired result.
    In my particular case, it happens that the files are already sorted - stata just doesn't know about it.

    Additionally, when running something like:

    Code:
    * Raw egen
    set obs 100000000
    gen a = int(_n / 3)
    gen b = _n * _n
    egen c = sum(b), by(a)
    clear
    
    * Sort before
    set obs 100000000
    gen a = int(_n / 3)
    gen b = _n * _n
    sort a
    egen c = sum(b), by(a)

    the call to egen in the second section is about 25% faster, which would make a tremendous difference for my use case.
    Is there something I can do to tell stata the data is already sorted?

    I was thinking that somehow manually setting the value that is returned by:
    Code:
    describe, varlist
    di "`r(sortlist)'"
    could maybe be an option, but I haven't found a way to do so.

    Thank you in advance,
    Jaqueline
    Last edited by Jaqueline Simmons; 10 Mar 2023, 09:48.

  • #2
    My understanding is that -by- checks to see that the data is sorted and doesn't sort it if the data set is sorted appropriately. Given that, you could just delete the sort commands out of _gmean.ado and _gtotal.ado and save your own versions of those commands as, say, mygmean.ado and mygtotal.ado and use those instead. I presume that -by- would halt the command for you if the current sort order doesn't match what it expects.

    Comment


    • #3
      I just noticed that the user-written module -gtools- (see -net describe gtools, from(http://fmwww.bc.edu/RePEc/bocode/g-) contains a -gegen- module that executes -egen- command faster via some C code, so that might be worth a try, too. (I haven't used this module myself.)

      Comment


      • #4
        You may also consider the -hashsort- as part of the -gtools- package mentioned by Mike Lacy to first sort your data faster than the built-in -sort-.

        Comment


        • #5
          Thank you for your help. I will probably use some custom version of these ados, as mentioned, but I will checkout gtools' options first.

          Comment

          Working...
          X