Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • computation time of egen function "skew"

    Dear Statalist,

    I am generating a set of variables in the following fashion:

    by panel_id: egen pop_mean=mean(pop)
    by panel_id: egen pop_sd=sd(pop)
    by panel_id: egen pop_sk=skew(pop)

    I have not encountered any error so far. The problem is that I started executing the code a few days ago (my data contains tens of millions of observations). Based on what I see from the results window, it only takes minutes for doing the first two line (mean, sd) but the third line (skew) looks to take forever. I would be grateful for any advice, such as - does skew really takes much longer time than mean and sd? or is there any faster approach to get skewness (or approximation to that)?

    Thanks in advance.

    Charlie.


  • #2
    For what it's worth, this is the code used to calculate the skewness:
    Code:
    *! version 1.2.3  09feb2015
    program define _gskew
            version 6.0, missing
            syntax newvarname =/exp [if] [in] [, BY(varlist)]        
            quietly {
                    tempvar touse group
                    mark `touse' `if' `in'
                    sort `touse' `by'
                    by `touse' `by' : gen `c(obs_t)' `group' = _n == 1 if `touse'
                    replace `group' = sum(`group')
                    local max = `group'[_N]
                    gen `typlist' `varlist' = .
                    local i 1
                    while `i' <= `max' {
                            su `exp' if `group' == `i', detail
                            replace `varlist' = r(skewness) if `group' == `i'
                            local i = `i' + 1
                    }
            }
    end
    It seems the skewness is simply extracted from the sum, detail command. The problem is that sum, d calculates a lot of statistics (quantiles, variances, kurtosis, ...) that you don't need and which slow down calculation. It might be faster to calculate it manually. You can find Stata's approach to calculating skewness statistics in the help file for summarise, available online at www.stata.com/manuals13/rsummarize.pdf on page 8. Seems like it is just a matter of subtracting means and raising to certain powers, so it should not take excessively long. You can check your results against the sum command using a small subsample.

    If that is still too slow, you can try implementing it in Mata, but it's not exactly the most user friendly part of Stata...

    Comment


    • #3
      Thanks for the reference Jesse. I am a bit surprised by the fact that when doing egen skew stata would calculate some unnecessary items which slow things down - would imagine that every function should serve the purpose of that function and nothing more.

      Comment


      • #4
        You can perform the same calculations using rangestat (from SSC). The most recent version includes a skewness function and is much faster than the egen skew() function. Here's a quick example that shows how to duplicate the egen results on a small number of panels.

        Code:
        clear
        set seed 3214
        set obs 100
        gen long panel_id = _n
        expand 100
        bysort panel_id: gen time = _n
        gen pop = runiform()
        
        by panel_id: egen pop_mean0=mean(pop)
        by panel_id: egen pop_sd0=sd(pop)
        by panel_id: egen pop_sk0=skew(pop)
        
        rangestat (mean) pop (sd) pop (skewness) pop, interval(time . .) by(panel_id)
        assert pop_mean0 == float(pop_mean)
        assert pop_sd0 == float(pop_sd)
        assert pop_sk0 == float(pop_skewness)

        Comment


        • #5
          It's always a tradeoff between efficiency and coding time. A lot of functions are just a wrapper around sum and because it's so damn fast this works fine even for larger datasets. It's only once you start talking really big that this becomes more tricky and you might need to write a more efficient single-purpose program. Sometimes others have already done this for us, for a lot of statistics you could use Sergio Correia's ftools suite (fegen/fcollapse) where the f stands for fast (I think). If you plan on doing regressions, definitely check out his reghdfe command, which is at times dozens if not hundred times faster than reg/xtreg and is crazy versatile. If you want to calculate pairwise correlations, you can use pwcorrf, which also allows for fast within-group correlation calculations. timeit is also somewhat useful, it allows you to prefix any line with timeit <timer number>:, you can then use the standard timer list to see how long that line took. It's less cumbersome than having to put timer on/timer off's everywhere. (Disclaimer, I wrote pwcorrf and timeit).

          Comment


          • #6
            Originally posted by Jesse Wursten View Post
            It's always a tradeoff between efficiency and coding time. A lot of functions are just a wrapper around sum and because it's so damn fast this works fine even for larger datasets. It's only once you start talking really big that this becomes more tricky and you might need to write a more efficient single-purpose program. Sometimes others have already done this for us, for a lot of statistics you could use Sergio Correia's ftools suite (fegen/fcollapse) where the f stands for fast (I think). If you plan on doing regressions, definitely check out his reghdfe command, which is at times dozens if not hundred times faster than reg/xtreg and is crazy versatile. If you want to calculate pairwise correlations, you can use pwcorrf, which also allows for fast within-group correlation calculations. timeit is also somewhat useful, it allows you to prefix any line with timeit <timer number>:, you can then use the standard timer list to see how long that line took. It's less cumbersome than having to put timer on/timer off's everywhere. (Disclaimer, I wrote pwcorrf and timeit).
            Thanks a lot Jesse. All very useful advice.

            Comment


            • #7
              Originally posted by Robert Picard View Post
              You can perform the same calculations using rangestat (from SSC). The most recent version includes a skewness function and is much faster than the egen skew() function. Here's a quick example that shows how to duplicate the egen results on a small number of panels.

              Code:
              clear
              set seed 3214
              set obs 100
              gen long panel_id = _n
              expand 100
              bysort panel_id: gen time = _n
              gen pop = runiform()
              
              by panel_id: egen pop_mean0=mean(pop)
              by panel_id: egen pop_sd0=sd(pop)
              by panel_id: egen pop_sk0=skew(pop)
              
              rangestat (mean) pop (sd) pop (skewness) pop, interval(time . .) by(panel_id)
              assert pop_mean0 == float(pop_mean)
              assert pop_sd0 == float(pop_sd)
              assert pop_sk0 == float(pop_skewness)
              Thanks a lot Robert. That idea of using rangestat to calculate skewness indeed came to my mind when i first learnt about rangestat today.

              Comment


              • #8
                Jesse is right. Indeed archaeology shows that the official function code is not much different from my 1999 original.

                Some experiments suggest that substituting Mata code by itself doesn't give massive improvements. (See http://www.stata-journal.com/sjpdf.h...iclenum=st0204 for some sample code.)

                I guess that the real culprits here may include how many groups you have. At worst you're looping over many, many groups and using lots of if. Both can spend time extravagantly.

                I got timings ~ 10 s for about 1e7 observations and 5 groups on a rather old PC whatever way I did it, so I can't reproduce or explain "days", but I have not tried hard to slow things down. .

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  Jesse is right. Indeed archaeology shows that the official function code is not much different from my 1999 original.

                  Some experiments suggest that substituting Mata code by itself doesn't give massive improvements. (See http://www.stata-journal.com/sjpdf.h...iclenum=st0204 for some sample code.)

                  I guess that the real culprits here may include how many groups you have. At worst you're looping over many, many groups and using lots of if. Both can spend time extravagantly.

                  I got timings ~ 10 s for about 1e7 observations and 5 groups on a rather old PC whatever way I did it, so I can't reproduce or explain "days", but I have not tried hard to slow things down. .
                  Thanks a lot Nick - you have addressed some questions on my head. You are right - I have many groups (around 1 million). But what surprised me is the difference between the computation time for skew vs others (e.g. mean).

                  Comment


                  • #10
                    Woosh. I wouldn't much trust skewness for samples of order 10. The paper cited in #8 says much more. (mean - median) / SD often works quite well as a measure of skewness. It's nicely bounded by [-1, 1].

                    Comment


                    • #11
                      I was going to say mean/sd use the simple summarize command instead of sum,d and that's why they are much faster, but actually this is not true. They use the sum() function and take the last value per group (because sum() is an "incremental" sum). I would expect a summarize (the command)-based version to be much faster, but I am too lazy to test it. Especially because sum, meanonly is really really fast.

                      Code:
                      clear all
                      set obs 10000000
                      gen x = rnormal()
                      timeit 1: sum x
                      timeit 2: sum x, d
                      timeit 3: sum x, meanonly
                      timer list
                      gives

                      Code:
                         1:      0.20 /        1 =       0.2010
                         2:     13.93 /        1 =      13.9290
                         3:      0.11 /        1 =       0.1070
                      0.11 seconds to calculate the mean across 10 million observations is really mad when you think about it. Anyway, the code used for mean() and sd() october 2004, so that might be why they use sum() rather than sum,m. In fact, this might very well still be Nick's original code ...

                      Comment


                      • #12
                        Originally posted by Nick Cox View Post
                        Woosh. I wouldn't much trust skewness for samples of order 10. The paper cited in #8 says much more. (mean - median) / SD often works quite well as a measure of skewness. It's nicely bounded by [-1, 1].
                        I guess you are probably right Nick. I would see if skewness would be of any use in my model.

                        Comment


                        • #13
                          Originally posted by Jesse Wursten View Post
                          I was going to say mean/sd use the simple summarize command instead of sum,d and that's why they are much faster, but actually this is not true. They use the sum() function and take the last value per group (because sum() is an "incremental" sum). I would expect a summarize (the command)-based version to be much faster, but I am too lazy to test it. Especially because sum, meanonly is really really fast.

                          Code:
                          clear all
                          set obs 10000000
                          gen x = rnormal()
                          timeit 1: sum x
                          timeit 2: sum x, d
                          timeit 3: sum x, meanonly
                          timer list
                          gives

                          Code:
                          1: 0.20 / 1 = 0.2010
                          2: 13.93 / 1 = 13.9290
                          3: 0.11 / 1 = 0.1070
                          0.11 seconds to calculate the mean across 10 million observations is really mad when you think about it. Anyway, the code used for mean() and sd() october 2004, so that might be why they use sum() rather than sum,m. In fact, this might very well still be Nick's original code ...
                          Thanks Jesse. I used your code and get:

                          . timer list
                          1: 0.14 / 1 = 0.1400
                          2: 7.03 / 1 = 7.0310
                          3: 0.07 / 1 = 0.0680

                          Sorry I am not quite following you - is it that the 7.03 vs 0.07 here sheds light on the time required for calculating skew and mean?

                          Comment


                          • #14
                            With the detail option, summarize needs to sort the data. If there are lots of by() groups, the data will need to be sorted a lot when using the egen skew() function. No doubt this could be speeded up. The rangestat solution is available now and plenty fast.

                            Code:
                            clear all
                            set seed 3214
                            set obs 10000
                            gen long panel_id = _n
                            expand 100
                            bysort panel_id: gen time = _n
                            gen pop = runiform()
                            
                            timer on 1
                            by panel_id: egen pop_sk0=skew(pop)
                            timer off 1
                            
                            timer on 2
                            rangestat (skewness) pop, interval(time . .) by(panel_id)
                            timer off 2
                            
                            timer list
                            
                            assert pop_sk0 == float(pop_skewness)
                            and the results:
                            Code:
                            . timer list
                               1:    472.91 /        1 =     472.9100
                               2:      2.27 /        1 =       2.2650

                            Comment


                            • #15
                              Originally posted by Robert Picard View Post
                              With the detail option, summarize needs to sort the data. If there are lots of by() groups, the data will need to be sorted a lot when using the egen skew() function. No doubt this could be speeded up. The rangestat solution is available now and plenty fast.

                              Code:
                              clear all
                              set seed 3214
                              set obs 10000
                              gen long panel_id = _n
                              expand 100
                              bysort panel_id: gen time = _n
                              gen pop = runiform()
                              
                              timer on 1
                              by panel_id: egen pop_sk0=skew(pop)
                              timer off 1
                              
                              timer on 2
                              rangestat (skewness) pop, interval(time . .) by(panel_id)
                              timer off 2
                              
                              timer list
                              
                              assert pop_sk0 == float(pop_skewness)
                              and the results:
                              Code:
                              . timer list
                              1: 472.91 / 1 = 472.9100
                              2: 2.27 / 1 = 2.2650
                              Robert I was just about to do this comparison..thanks.

                              Comment

                              Working...
                              X