Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gtools update available on SSC: greshape, gstats winsor, gstats tab, and more!

    Thanks to Kit Baum, an update to the package gtools is now available for download from SSC. From Stata 13.1 or later, use

    Code:
    ssc install gtools, replace
    See the original announcement here. In short, gtools implements a faster version of several Stata commands, incuding: collapse, reshape, xtile, tabstat, isid, egen, pctile, winsor, contract, levelsof, duplicates, and unique/distinct. For details on the package, see the official documentation. For details on the update, see the release notes. Some highlights:

    New commands:
    • greshape long/wide, 4-20x faster than reshape long/wide (additionally accepts any number of i or j variables).
    • greshape gather/spread, similar to long/wide but made to mimic the gather and spread commands in R's tidyr package.
    • gstats tab, 5-40x faster than tabstat (additionally accepts any number of grouping variables).
    • gstats sum, 5-10x faster than sum, detail (regular summarize is not slow, but -detail- is slow to compute all the percentiles).
    • gstats winsor, 10-20x faster than winsor2.
    New features:
    • gcollapse, gegen, and gstats tab now allow the following statistics:
      • select# and select-#, to select the #th smallest or largest value
      • rawselect# and rawselect-#, ibid but ignoring weights.
      • cv, to compute the coefficient of variation
      • variance
      • range
    • gtop and glevelsof can save their results in a mata object via mata(name).
    • gtop (gtoplevelsof) can list all the levels via ntop(.), similar to tablist (ntop(-.) lists from least to most common order; option -alpha- lists the top levels in variable order instead of frequency order.
    • greshape allows varlist syntax for long to wide reshapes (though this cannot be combined with @ in the same sub); wide to long matches do not allow varlist syntax, but complex matches can be achieved via the option match(regex), which takes the stubs to be regular expressions (details here).
    Some quick benchmarks for the new commands (ran on Stata 15/MP for Unix, 8 cores):

    Code:
    clear all
    ssc install winsor2
    
    program bench
        gettoken timer call: 0,    p(:)
        gettoken colon call: call, p(:)
        cap timer clear `timer'
        timer on `timer'
        `call'
        timer off `timer'
        qui timer list
        c_local r`timer' `=r(t`timer')'
    end
    
    set obs 10000000
    gen groups = int(runiform() * 1000)
    gen smallg = mod(groups, 10)
    gen rsort  = rnormal()
    gen rvar   = rnormal()
    gen ix     = _n
    sort rsort
    
    preserve
        rename (rsort rvar) (r1 r2)
        bench 11: greshape long r, i(ix) j(j)
    restore, preserve
        rename (rsort rvar) (r1 r2)
        greshape long r, i(ix) j(j) nochecks
        bench 16: greshape wide r, i(ix) j(j)
    restore, preserve
        rename (rsort rvar) (r1 r2)
        bench 10: reshape long r, i(ix) j(j)
    restore, preserve
        rename (rsort rvar) (r1 r2)
        greshape long r, i(ix) j(j) nochecks
        bench 15: reshape wide r, i(ix) j(j)
    restore
    
    bench 21: qui gstats winsor rvar, s(_wg)
    bench 20: qui winsor2 groups
    
    bench 26: qui gstats sum rvar
    bench 25: qui sum rvar, detail
    
    bench 31: qui gstats tab rvar, by(smallg) s(n mean min max)
    bench 30: qui tabstat rvar,    by(smallg) s(n mean min max)
    
    local commands       ///
            reshape_long ///
            reshape_wide ///
            winsor       ///
            sum_detail   ///
            tabstat
    
    local bench_table `"       Versus | Native | gtools | % faster "'
    local bench_table `"`bench_table'"' _n(1) `" ------------ | ------ | ------ | -------- "'
    forvalues i = 10(5)30 {
        gettoken cmd commands: commands
        local pct      "`:disp %7.2f  100 * (`r`i'' - `r`=`i'+1'') / `r`i'''"
        local dnative  "`:disp %6.2f `r`i'''"
        local dgtools  "`:disp %6.2f `r`=`i'+1'''"
        local cmd      `"`:disp %12s "`cmd'"'"'
        local bench_table `"`bench_table'"' _n(1) `" `cmd' | `dnative' | `dgtools' | `pct'% "'
    }
    disp _n(1) `"`bench_table'"'
    Results

    Code:
          Versus | Native | gtools | % faster 
    ------------ | ------ | ------ | -------- 
    reshape_long | 111.63 |   8.21 |   92.65% 
    reshape_wide | 127.61 |  16.52 |   87.05% 
          winsor |  28.87 |   1.17 |   95.96% 
      sum_detail |  30.50 |   1.63 |   94.65% 
         tabstat |  32.63 |   1.03 |   96.83%

  • #2
    Dear Mauricio,
    I am curious to test the performance with your code on my own system. I did change the current directory to a folder of which I am certain files can be written. However, after the line:
    Code:
    bench 15: reshape wide r, i(ix) j(j)
    I get the following error message:
    Code:
    (note: j = 1 2)
    file C:\Users\Vught\AppData\Local\Temp\ST_1a70_000004.tmp cannot be modified or erased;
        likely cause is read-only directory or file
    r(608);
    which is probably correct. But, I do not see why your code is not saving into my current directory and instead looking for some windows or Stata system folder.
    Possibly you can tell me what to do to make this work.
    Best regards,
    Eric
    http://publicationslist.org/eric.melse

    Comment


    • #3
      ericmelse That is a very interesting error. I am merely using Stata's own "tempfile" functionality. Nothing more than "tempfile a" and using the local "a" to save temporary files. My idea was trying to let Stata determine where to save temporary files. I coded a workaround to this, however. Can you try

      Code:
      gtools, upgrade branch(develop)
      and define

      Code:
      global GTOOLS_TEMPDIR .
      before running the benchmarks? (Make sure you do this after "clear all"; last, "." is the current directory, but it can be set to any directory that exists.)

      Hope this helps,
      Mauricio

      Comment


      • #4
        Dear Mauricio,
        Thanks for the quick follow up. From a fresh start of Stata, I ran your code lines and indeed I can see the temporary files being stored in the current directory.
        My statistics are, running Stata MP 15.1 rev. 20190321, on an 'older system' Windows 7 SP1, i7 CPU 860 @ 2.8GHz, 16GB RAM:
        Code:
               Versus | Native | gtools | % faster
         ------------ | ------ | ------ | --------
         reshape_long |  64.31 |  11.88 |   81.52%
         reshape_wide | 119.45 |  16.48 |   86.20%
               winsor |   8.13 |   1.31 |   83.89%
           sum_detail |   8.45 |   1.94 |   77.01%
              tabstat |  19.18 |   1.24 |   93.55%
        http://publicationslist.org/eric.melse

        Comment


        • #5
          Speeding things up is always welcome.

          I just have a small flag for people reading this thread and/or using these tools. Authors (including me) are increasingly sensitive to documentation of sources and inspirations. Broadly, what might seem a bit too much explanation is much better than a bit too little.

          That is, not just winsor2 but also unique and distinct are community-contributed commands. Even if you didn't use any of the code in those, explaining these programs more fully would be welcome, meaning mentioning the authors and saying where the code is to be found. The same goes for anything in gegen based on the work of other users.

          Comment


          • #6
            For anyone wondering, yes gtools is just as great as claimed in the original post. Together with reghdfe these are probably the two tools you simply cannot do without when dealing with (very) large data in Stata.

            Comment


            • #7
              Nick Cox I note in the README which commands are native to Stata and which are community-contributed. In the help file for "gstats winsor", for example, I do note this was based on "winsor2" and acknowledge that package's author by name.

              Nevertheless, the broader point is well-taken. I do not reference the authors directly in the README, and I will fix that shortly. Further, I do not name the authors in "distinct" and "unique", as I do in "gstats winsor", which I will also fix. I appreciate you mentioning that. (Though I will note that other than bearing the name and functionality, very little of the original file's code remains, but I will comb through those ado files to make sure I am not copying anything at length without attribution.)

              I am less sure what you mean by "gegen." I thought "egen.ado" was by StataCorp? Despite bearing the name, "gegen.ado" does not have that much in common with "egen.ado". All the functions that are mentioned in the README I wrote separately and are implemented internally in C. If gegen is called with a function not implemented internally, the log file will reflect it, and the help file for that function (which would not be mentioned in the help file for "gegen") will reflect that as well.

              EDIT: Attribution is now fixed, as outlined in the paragraphs above, in the help files of the latest version (type "gtools, upgrade") as well as in the online documentation. This mainly pertains to the idea for the commands and the options therein.
              Last edited by Mauricio Caceres; 04 Apr 2019, 10:49.

              Comment


              • #8
                Thanks for attending to this. I much appreciate that gtools is a large, substantial project and that documentation alone was a lot of work. Your approach seems to have been to look to see what commands did and then to write code afresh. Fine, but even sources of inspiration should be documented, in your interests too.

                A possible principle for you is to distinguish which features are intended to do exactly what other commands do (just faster) and which are written without reference to existing software. Users are going to care about reproducibility in many cases.

                The point about egen and your version of it is that several functions there don't match functions in the official distribution, yet by accident if not by design some have similar functionality to some of the functions in egenmore (SSC).

                An awkward although not painful detail is that the author of winsor2 (SSC) did not include all the features in winsor (SSC, that is me). I had occasion to explain this only recently. https://www.statalist.org/forums/for...lue-in-winsor2

                Comment


                • #9
                  Nick Cox I document differences between gtools functionality and native or community-contributed commands in some detail here (sub-section "Differences and Extras"). And I have sought to, as you say, document "sources of inspiration" in each of the command's help files, as applicable to community-contributed commands, in response to you bringing that to my attention (which was the right thing to do, clearly, so I do appreciate you bringing it up).

                  I intended "gstats winsor" to cover the functionality of "winsor2". If you think that what I have currently written is inadequate or insufficient wrt "winsor" (see here), I would be happy to elaborate.

                  With respect to "egenmore", I thought the only overlap was "first", "last", "firstnm", and "lastnm", which I note in the README for "gtools" (and I make sure to note they are different). However, it looks like there are a few more commands that overlap, which I had not noticed (it's not that difficult to independently name "var" your command that will compute the variance, for example; those are, in fact, accidents, and I will be adding additional notes to make sure users can tell they are not quite the same as the "egenmore" counterparts).

                  Comment


                  • #10
                    All fine by me. Thanks again.

                    PS It's not that surprising that people often have the same problems and sometimes write very similar programs. I once wrote a command (the predecessor of contract) and then realised that I had written essentially the same command about a year before.

                    Comment


                    • #11
                      Mauricio Caceres Thank you for the awesome package. I was following the examples from https://gtools.readthedocs.io/en/lat....html#examples

                      I got error 'Uknown transformations: range_mean|-3|0|year' when i try

                      Code:
                      webuse grunfeld, clear
                      gstats range (mean -3 0 year) x10 = invest
                      The following works fine though:
                      Code:
                      gstats transform (range mean -3 0 year) x1 = invest
                      I am running Stata/MP 14.2 on Windows 10. Grateful for any advice.

                      Comment


                      • #12
                        Not sure if I am missing something but in the following I expect same results from gstats and rangerun (from SSC) :

                        Code:
                        webuse grunfeld, clear
                        local i (range mean . . mvalue) x = invest
                        gstats transform `i', interval(-0 0 year)
                        
                        cap program drop myprog
                        program myprog
                            su invest if inrange(mvalue, ., .), meanonly
                            g xx = r(mean)
                        end
                        
                        rangerun myprog , i(year -0 0) sprefix(rr_) use(mvalue invest)
                        su x  xx
                        gstats however seems to ignore the global interval on year.

                        Comment


                        • #13
                          charlie wong Sorry for the later reply. The first case is a bug in my code, now fixed. For the second case, I am not sure what you mean by specifying both ". . mvalue" and "-0 0 year" at the same time. The global interval is intended for range stats that don't specify their own interval, so

                          Code:
                          local i (range mean) x = invest
                          gstats transform `i', interval(-0 0 year)
                          is the syntax to use in this case, I think.

                          "local i (range mean . . mvalue) x = invest" states to take the mean with no upper or lower bound for mvalue, which is to say the entire range. This is equivalent to just taking the mean, so not sure why you would specify both if you want "gstats range" to use the global interval.

                          Comment


                          • #14
                            Thank you for the replies Mauricio Caceres . So I misunderstood the purpose of global interval. Thank you for the clarification. Incidentally, I wonder if gstats range can handle multiple intervals as in the rangerun example I quoted above. That is, the first interval selects a subset based on year, and on this subset, the second interval selects yet another subset based on mvalue. For my work I rely on rangerun to do this but this usually causes the bottleneck as rangerun is orders of magnitude slower than rangestat. So if gstats range can support such a feature that will be awesome.

                            Comment


                            • #15
                              To be clear, I modify the rangerun example by adding reference to current mvalue in the second interval, as follows:
                              Code:
                              cap program drop myprog
                              program myprog    
                              su invest if inrange(mvalue, rr_mvalue - 100, rr_mvalue + 100), meanonly    
                              g xx = r(mean)
                              end  
                              rangerun myprog , i(year -0 0) sprefix(rr_) use(mvalue invest)

                              Comment

                              Working...
                              X