Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen min but faster

    Hi
    I often find myself working with large data files in long format.

    Typically, i am looking to tag a unique row, and then fill it to the rest of the panel.

    Code:
        by panel , sort : egen filled = min(tag)
    However, this is painfully slow across 10 million rows.

    whereas a piece of code from a Nick Cox presentation is much much faster
    Code:
    gen filled = .
        bysort panel  (tag): replace filled = min(filled[_n-1], tag)
    egen takes 40 seconds across 10 million rows and bysort, replace takes 20 second

    I would have thought they would be doing similar things so why the time difference?

    or does anyone have any faster suggestions?

    bw

    Adrian


  • #2
    I think looking at the code is the main answer.

    Code:
    viewsource egen.ado
    egen and whatever function you call make up several command lines to interpret at the best of times. It also does things on your behalf such as creating a temporary variable holding the sort order and then restoring that order, deleting the temporary variable. Most of those lines will be trivial to implement, but that's the basic difference. To get the minimum groupwise

    Code:
    bysort panel (myvar) : gen min  = myvar[1]
    is one line of Stata to interpret: the corresponding code is compiled.

    Comment


    • #3
      Thanks Nick,

      Your suggestion is even faster, amazing.

      much appreciated.
      A

      Comment


      • #4
        Hi Adrian,

        If you have a large dataset, the ftools package might help you:

        (It's on github https://github.com/sergiocorreia/ftools/ , and works best with stata 13 or newer).

        On the code below, I compared the three approaches (ftools, egen, and Nick's post). With 10 million obs. and 1000 levels of panel, using ftools takes 7 secs vs 31 and 14 respectively.

        Code:
           1:      7.19 /        1 =       7.1870
           2:     31.36 /        1 =      31.3550
           3:     14.76 /        1 =      14.7610


        Code:
        /*
        // Install with these lines:
        cap ado uninstall ftools
        net install ftools, from(https://github.com/sergiocorreia/ftools/raw/master/src/)
        ftools, compile
        */
        
        clear all
        timer clear
        set obs 10000000
        
        gen long panel = int(runiform()*1000)
        gen double tag = rnormal()
        
        preserve
        timer on 1
        // This is the same as collapse, but the -merge- option will merge back the results
        fcollapse (min) filled1=tag, by(panel) merge
        timer off 1
        restore
        
        preserve
        timer on 2
        bysort panel: egen double filled2 = min(tag)
        timer off 2
        restore
        
        preserve
        timer on 3
        bysort panel (tag) : gen double filled3  = tag[1]
        timer off 3
        restore
        
        timer list
        
        //assert filled1==filled2
        //assert filled1==filled3
        exit

        Comment

        Working...
        X