Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A faster summarize command.

    Hello everyone, I have a do file that basically does this:

    clear all
    set obs 10000
    gen x = rnormal()
    gen y = .
    forval x =1/`=_N'{
    qui summ x if _n>= `x'-2 & _n <= `x'
    qui replace y = r(sum) in `x'
    }

    I'm trying to run this code for huge datasets and is really slow. I think that perhaps if I write a faster program this problem would be reduced, so I wrote the following mata code:

    cap mata mata drop csum()
    mata
    void csum(string scalar varlist, string scalar touse)
    {
    real scalar sum
    sum = sum(tokens(varlist))
    st_numscalar("sum", sum)
    }
    end
    cap program drop csum
    program define csum, rclass
    syntax varlist [if] [in]
    marksample touse
    mata: csum("`varlist'", "`touse'")
    return scalar sum = sum
    end

    clear all
    set obs 10000
    gen x = rnormal()
    gen y = .
    forval x =1/`=_N'{
    csum x if _n>= `x'-2 & _n <= `x'
    qui replace y = r(sum) in `x'
    }


    Which is actually slower than STATA summarize program. I have two questions:
    1) How could I access to the summarize ado file? (I've already search in C:\ado and \stata\ado)
    2) Do you have any idea of how could I make the first or the second code faster?

    Thank you everyone

  • #2
    You don't have to use summarize.
    Code:
    clear all
    set obs 10000
    gen x = rnormal()
    gen y = .
    replace y = x in 1
    replace y = x[1] + x[2] in 2
    replace y = x[_n-2] + x[_n-1] + x[_n] if _n > 2
    Edit: The problem with your code is the unnecessary loop.
    Last edited by Friedrich Huebler; 25 Nov 2015, 21:45.

    Comment


    • #3
      Dear Friedrich, you have my respect. This problem was cracking my head for days. Thank you.

      Comment


      • #4
        Friedrich's suggestion is excellent. But those attracted to this thread by the title should perhaps check out

        Code:
         
        summarize, meanonly
        It is certainly documented, but nevertheless often overlooked. See the puff at http://www.stata-journal.com/sjpdf.h...iclenum=st0135

        My guess is that people learn summarize early in their Stata career and feel they know what it does. They know about summarize, detail but argue quite rightly that can't be a faster solution as it slows down summarize by calculating extra stuff you usually don't need or want. Either way, people think they know what summarize does rather well, and overlook the meanonly option. Or they have heard of the meanonly option but think it calculates the mean, only. That one really is down to StataCorp, as unusually, it is a bad name choice, as pointed out in the article cited.

        Comment


        • #5
          Thank you Nick, your solution is very useful. But it's possible to get only the standard deviation or the sum? Also, how can I see the ado-file of the summarize command? Because starting for there you can optimize the algorithm.

          Comment


          • #6
            You can't see the ado file of summarize, as there isn't one: it is built-in code.

            Code:
            . which summarize
            built-in command:  summarize
            The documentation tells you what you can do. As explained already in the linked document, and in the help, the sum is provided by summarize, meanonly. For the SD, you must specify summarize. As getting the variance (equivalently the SD) is much of the work, it's hard to see how you could go much faster, but you could try your own Mata code, as there are Mata functions for getting the variance (and, naturally, the square root too).

            Comment


            • #7
              I'm a bit late to the party but it appears that Diego is trying to calculate statistics over a rolling window. That's something that's easy to do with tsegen (from SSC). Here's a quick adaptation to the original example, including the standard deviation:

              Code:
              clear all
              set obs 10000
              gen x = rnormal()
              
              gen long time = _n
              tsset time
              tsegen ysum = rowtotal(L(0/2).x)
              tsegen ysd = rowsd(L(0/2).x)

              Comment


              • #8
                Thank you Robert, your solution is really useful for my for my purposes (I'm building a program), this simplifies the code. Also, thank you Nick, you're right, summarize is a built in option, in that case, what can I do? I can't modify or understand the command algorithm? Because I'd like to get only one or two descriptive statistics. I've already written a MATA function and a STATA command that invoke that function, as you can see in the first post of this discussion, that command is actually a little slower than the STATA command summarize. Actually (this is one of the main reasons I started this post) summarize command is faster than MATA as you can see in the following code:
                clear all
                set seed 1234

                set obs 100000
                gen x = rnormal()

                mata: st_view(x=.,.,x)

                timer on 1
                forval x =1/1000{
                mata: s = mean(x)
                mata: st_numscalar("s", s)
                }
                timer off 1

                timer on 2
                forval x =1/1000{
                qui summ x
                }
                timer off 2


                timer on 3
                forval x =1/1000{
                qui summ x, meanonly
                }
                timer off 3

                timer list
                Here is the output generated in my pc (Intel i7 - 2670, 8gb ram):

                1: 3.29 / 1 = 3.2910
                2: 2.23 / 1 = 2.2280
                3: 1.20 / 1 = 1.1960

                As you can see the summarize , meanonly is the fastest option, followed by summarize and the slowest one is the MATA option. This results worry me, because MATA isn't fast enough, for example if there isn't other option than summarize you're stock in your pc capacity. Finally that's the reason because I wanted to see the summarize code (I'd like to add options to it).

                Comment


                • #9
                  Bottom line: To see the summarize source code, which is C code, you need to work at StataCorp. There is no other way.

                  Mata is fast, but I can't see that you could beat the compiled C code for the same task.

                  Comment


                  • #10
                    1. mean is based on quadcross() which is built-in, hence C-code.
                    2. You can reduce the execution time, by avoiding the loop. Select all the variables and put them in a single matrix. and then use the mean function. It will be faster than alternative 3.
                    Code:
                    mata: x = st_data(.,"x1-x1000") // or x = J(1,1000,st_data(.,"x"))
                    mata: s = mean(x)
                    mata: st_matrix("s",s")
                    3. I guess that doing 1000 times st_numscalar("s", s) is what slows down execution time.

                    Comment


                    • #11
                      Thank you Nick, I think I can't go faster, all the posts were very useful, I had already finished my little program. My only wish is that STATA add a few options to summarize.

                      Comment


                      • #12
                        Thank you Christophe, your solution actually is faster than summarize but with a different data structure, which could be reliable under certain circumstances, the solution is actually useful.
                        In the other hand, you could also try without st_numscalar("s", s), summarize still is the fastest option.

                        Comment


                        • #13
                          Well, it was a wrong guess then
                          But if you change st_view() to st_data() then mean() becomes slightly faster. I think it is documented in the manual that the advantage of views over copies of the data is that they take less memory, but will slow down execution (probably because views are essentially pointers to the data).

                          Code:
                          clear all
                          set seed 1234
                          
                          set obs 100000
                          gen x = rnormal()
                          
                          // mata: st_view(x=.,.,"x")
                          mata: x = st_data(.,"x")
                          
                          timer on 1
                          forval x =1/1000{
                                  mata:s = mean(x)
                          } 
                          
                          timer off 1
                          
                          timer on 3
                          forval x =1/1000{
                          qui summ x, meanonly
                          }
                          timer off 3
                          
                          timer list

                          Code:
                           timer list
                             1:      2.11 /        1 =       2.1110
                             3:      2.70 /        1 =       2.6990

                          Comment


                          • #14
                            Here is a variant of your program where you can compare the impact of st_view() and st_data()
                            Code:
                            clear
                            set obs 10000
                            
                            drawnorm x1-x10
                            
                            mata: x = st_data(.,.)
                            ** mata: st_view(x=.,.,.)
                            timer clear
                            
                            timer on 1
                            forval i=1/1000 {
                            qui sum x* , meanonly
                            }
                            timer off 1
                            
                            timer on 2
                            forval i=1/1000 {
                            mata: s=mean(x)
                            }
                            timer off 2
                            With st_data()
                            Code:
                            . timer list
                               1:      1.78 /        1 =       1.7780
                               2:      0.76 /        1 =       0.7620
                            With st_view()

                            Code:
                            . timer list
                               1:      2.11 /        1 =       2.1130
                               2:      2.48 /        1 =       2.4770

                            Comment


                            • #15
                              In case the point is missed by those skimming this thread, the problem is not with summarize or the use of an equivalent function in mata. Stata is very efficient when you perform operations over all observations at the same time. The issue here is that you are repeating the summarize command over each observation. Even if you apply a condition, Stata has to cycle over each observation to determine which ones to use with the summarize command. So as the problem grows, so will the execution time (exponentially).

                              In contrast, tsegen will only perform the calculation once. The difference in execution time is dramatic. Here's how it compares with 10,000 obs

                              Code:
                              . clear all
                              
                              . set obs 10000
                              number of observations (_N) was 0, now 10,000
                              
                              . gen x = rnormal()
                              
                              . 
                              . timer clear
                              
                              . timer on 1
                              
                              . gen long time = _n
                              
                              . tsset time
                                      time variable:  time, 1 to 10000
                                              delta:  1 unit
                              
                              . tsegen ysum = rowtotal(L(0/2).x)
                              
                              . timer off 1
                              
                              . 
                              . timer on 2
                              
                              . gen y = .
                              (10,000 missing values generated)
                              
                              . forval x =1/`=_N'{
                                2. qui summ x if _n>= `x'-2 & _n <= `x'
                                3. qui replace y = r(sum) in `x'
                                4. }
                              
                              . timer off 2
                              
                              . timer list
                                 1:      0.01 /        1 =       0.0100
                                 2:      2.75 /        1 =       2.7460
                              Here's with 20,000 obs

                              Code:
                              . clear all
                              
                              . set obs 20000
                              number of observations (_N) was 0, now 20,000
                              
                              . gen x = rnormal()
                              
                              . 
                              . timer clear
                              
                              . timer on 1
                              
                              . gen long time = _n
                              
                              . tsset time
                                      time variable:  time, 1 to 20000
                                              delta:  1 unit
                              
                              . tsegen ysum = rowtotal(L(0/2).x)
                              
                              . timer off 1
                              
                              . 
                              . timer on 2
                              
                              . gen y = .
                              (20,000 missing values generated)
                              
                              . forval x =1/`=_N'{
                                2. qui summ x if _n>= `x'-2 & _n <= `x'
                                3. qui replace y = r(sum) in `x'
                                4. }
                              
                              . timer off 2
                              
                              . timer list
                                 1:      0.02 /        1 =       0.0170
                                 2:     10.15 /        1 =      10.1540
                              The moral of the story is to avoid looping over observations unless there's no other way to do it.

                              Comment

                              Working...
                              X