Fastest way to calculate the column sums by panels in Mata

Cyrus Levy

Join Date: Nov 2014

Posts: 99
#1

Fastest way to calculate the column sums by panels in Mata

29 Mar 2020, 15:09

What is the fastest way to calculate the column sums by panels (IDs) in Mata? I use this in a panel maximum likelihood estimation algorithm, and I currently do it as shown below. While it does the job, it is slow especially since the data is huge and the calculation loop runs over and over again for finding the maximum likelihood. Is there a faster way to do this?

This is a simplified single-run example of the algorithm:

Code:

clear set obs 1000000 gen id=_n expand 10 sort id gen var1 = runiform(0,1) gen var2 = runiform(0,1) gen var3 = runiform(0,1) timer clear 1 timer on 1 mata { id = st_data(.,"id") big = (st_data(.,"var1"), st_data(.,"var2"), st_data(.,"var3")) V = panelsetup(id, 1) bigtot = J(rows(id),3,.) // this takes too much time --> for (i=1; i<=rows(V); i++) { X1 = panelsubmatrix(big, i, V) bigtot[V[i,1]::V[i,2],.]=J(rows(X1),1,colsum(X1)) } //<-- } timer off 1 timer list 1
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2470
#2

29 Mar 2020, 15:24

Having done a code in a similar spirit before I think the biggest bottle neck on your code is loading the data very single time.
perhaps you can load the data first and then do the subpanel process
alternatively perhaps st_view may be more efficient in updating the data (assuming the data changes every time)
hth
fernando
Comment
Cyrus Levy

Join Date: Nov 2014

Posts: 99
#3

29 Mar 2020, 17:37

I am thinking that the reason why it takes Mata too long to complete this job is that the for loop inside the Mata code is not parallelized. I use an MP version of Stata, and a function such as "colsum() by panel" should be parallelize-able, but when it is implemented within a for loop, the code just does it step-by-step. What do you think?
Comment

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 555

30 Mar 2020, 00:54

What takes time is maintaining/updating the matrix bigtot (16 out off almost 20 sec.), not colsum.
Actually the colsum part is the fastest.
A way of speeding things up could be to do the necessary calculations within the loop without saving in bigtot.

Code:

. /*
> clear
> set obs 1000000
> gen id=_n
> expand 10
> sort id
> gen var1 = runiform(0,1)
> gen var2 = runiform(0,1)
> gen var3 = runiform(0,1)
> */
. mata mata clear
. mata:
------------------------------------------------- mata (type end to exit) ------------------------------------------------------------------------------------------------------------
:     id = st_data(.,"id")
:     big = (st_data(.,"var1"), st_data(.,"var2"), st_data(.,"var3"))
:     V = panelsetup(id, 1)
:     
:     bigtot = J(rows(id),3,.)
:     
:     timer_clear()
:     timer_on(1)
:     // this takes too much time -->
:     R = rows(V)
:     for (i=1; i<=R; i++) {
>         timer_on(11)
>         X1 = panelsubmatrix(big, i, V)
>         timer_off(11)
>         timer_on(12)
>         tmp=J(rows(X1),1,colsum(X1))
>         timer_off(12)
>         timer_on(13)
>         bigtot[V[i,1]::V[i,2],.]=tmp
>         timer_off(13)
>     } //<--
:     timer_off(1)
:     timer()
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
timer report
  1.       19.8 /        1 =    19.753
 11.       1.75 /  1000000 =  1.75e-06
 12.       1.04 /  1000000 =  1.04e-06
 13.         16 /  1000000 =   .000016
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
: end

kind regards
nhb

Kind regards

nhb

Comment

Mustafa Ugur Karakaplan

Join Date: Jun 2015
Posts: 26

30 Mar 2020, 16:56

Niels, you are right about that it is the bigtot matrix within the Mata for loop that takes a lot. But what Cyrus is asking for requires that matrix to be constructed somehow. I think what Cyrus wants is a Mata equivalent of "bys id: egen double bigtot=total(var1)". The problem is that in Stata, "bys id: egen ....=total( )" is parallelized, so if you use Stata MP, the process would be very fast. As far as I know, Mata does not have an equivalent of "bys id: egen ....=total( )" so the process cannot be parallelized in Mata. Technically, it should be possible to get the same outcome purely in Mata, but as far as I know (and I'll double-check about this with the Stata tech service), Mata does not directly provide functionality for parallelized panels or parallel for loops in Mata. Not yet. But below are some workarounds.

For simplicity, I dropped var2 and var3. When I run the code below on my computer with 6 processors Stata MP, it takes Stata's egen 3 second to generate bigtot. It takes Cyrus' Mata loop 34 seconds to generate bigtot. Using -parallel- with Cyrus' Mata loop decreases that time to 20 seconds. Calling Stata's egen from within Mata takes only 5.2 seconds. I think the last workaround is good enough especially if Cyrus wants to keep everything inside Mata.

Code:

. timer list
   1:      3.07 /        1 =       3.0670
   2:     34.07 /        1 =      34.0710
   3:     19.87 /        1 =      19.8710
   4:      5.20 /        1 =       5.2000

Code:

clear
set obs 1000000
gen id=_n
expand 20
sort id
gen var1 = runiform(0,1)

cap program drop coltot
program define coltot
{
    mata {
        id = st_data(.,"id")
        big = st_data(.,"var1")
        V = panelsetup(id, 1)
        
        bigtot = J(rows(id),1,.)
        
        for (i=1; i<=rows(V); i++) {
            X1 = panelsubmatrix(big, i, V)
            bigtot[V[i,1]::V[i,2],.]=J(rows(X1),1,colsum(X1))    
        }
        
        st_addvar("double", "bigtot")
        st_store(., "bigtot", bigtot)
    }
}
end

cap program drop coltot2
program define coltot2
{
    mata {
        id = st_data(.,"id")
        big = st_data(.,"var1")
        
        (void) st_addvar("double", bigcopy=st_tempname())
        st_store(., bigcopy, big)
        stata("bys id: egen double bigtot4=total(" + bigcopy + ")")
    }
}
end

timer clear
cap drop big*

timer on 1
bys id: egen double bigtot1=total(var1)
timer off 1

timer on 2
coltot
timer off 2
ren bigtot bigtot2

timer on 3
parallel setclusters 6 //number of processors you have
parallel clean, all force
sort id
parallel, by(id) prog(coltot): coltot
timer off 3
ren bigtot bigtot3

timer on 4
coltot2
timer off 4

count if bigtot1!=bigtot2
count if bigtot2!=bigtot3
count if bigtot3!=bigtot4

cap timer clear 97
cap timer clear 98
cap timer clear 99
timer list

Comment

Niels Henrik Bruun

Join Date: Aug 2014

Posts: 555
#6

31 Mar 2020, 23:44

Mustafa Ugur Karakaplan: This is quite interesting. Thank you very much for presenting this.

It is not that I would start a long discussion on this.
But, in my experience a matrix of that size is usually a mean to an end, not the goal itself.
If the handling later requires such a matrix, my first comment would be that the overall design might be bad.
Looking at the whole process, stepwise calculations and later aggregating might be a faster solution.

Parallel computing should never be the solution to bad coding! (Not that this necessarily is the case here)

Kind regards

nhb
Comment
JanDitzen

Join Date: Jan 2015

Posts: 350
#7

01 Apr 2020, 01:53

This is an interesting topic, thank you very much for this.

I am a bit surprised that the mata code takes so long, but I can confirm that egen is faster even on Stata 16 SE. I am surprised because my experience so far is that well written Mata code runs much faster than Stata code. Moving from Stata to Mata improved the speed of some of my programs by a factor 3 or 4, especially with large data.

In addition, the parallel solution is great, but usually there are some costs of invoking it. Thus for smaller tasks parallel computing becomes almost too expensive.
Comment

Attaullah Shah

Join Date: Aug 2014
Posts: 1669

01 Apr 2020, 11:04

I think the difference is made by the use of range subscripts. My code is faster than egen, though marginally

Code:

clear
set obs 500000
gen id=_n
expand 20
sort id
gen var1 = runiform(0,1)

cap program drop coltot
program define coltot
{
    mata {
        id = st_data(.,"id")
        big = st_data(.,"var1")
        panel = panelsetup(id, 1)
        
        bigtot = J(rows(id),1,.)
        rows = rows(panel)
        
        for (i=1; i<=rows; i++) {
            X1 = big[|panel[i,1],1 \ panel[i,2],1|]
            bigtot[|panel[i,1],1 \ panel[i,2],1|]=J(rows(X1),1,colsum(X1))    
        }
        
        st_addvar("double", "bigtot")
        st_store(., "bigtot", bigtot)
    }
}
end


timer clear
cap drop big*

timer on 1
bys id: egen double bigtot1=total(var1)
timer off 1

timer on 2
coltot
timer off 2


timer list

   1:      5.29 /        1 =       5.2870
   2:      4.50 /        1 =       4.5010

Last edited by Attaullah Shah; 01 Apr 2020, 11:12.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

Comment

JanDitzen

Join Date: Jan 2015

Posts: 350
#9

01 Apr 2020, 11:21

Sorry for repeating myself, this solution is super interesting.

Is there a reason why the range subscripts are so much faster?
Comment

Niels Henrik Bruun

Join Date: Aug 2014
Posts: 555

#10

01 Apr 2020, 14:43

This is indeed interesting.
Looking at the function panelsubmatrix, it is clear that it is based on code similar to Attaullah Shah 's

Code:

*! version 1.0.1  01jun2013
version 9.0
mata:

/*      
        X = panelsubmatrix(V, i, info)
                return matrix containing i-th panel.
*/


matrix panelsubmatrix(matrix V, real scalar i, real matrix info)
{
        return(V[|info[i,1], 1 \ info[i,2], .|])
}
        
end

And that part was not slow in my previous code.

I did a bit more testing:

Code:

cls
/*
clear
set obs 1000000
gen id=_n
expand 10
sort id
gen var1 = runiform(0,1)
gen var2 = runiform(0,1)
gen var3 = runiform(0,1)
*/

timer clear

timer on 1
mata mata clear
mata:
    id = st_data(.,"id")
    big = st_data(.,"var?")
    panel = panelsetup(id, 1)
    
    bigtot = J(rows(id),3,.)
    rows = rows(panel)
    for (i=1; i<=rows; i++) {
        X1 = big[|panel[i,1],1 \ panel[i,2],3|]
        bigtot[|panel[i,1],1 \ panel[i,2],3|]=J(rows(X1),1,colsum(X1))    
    }
end
timer off 1

timer on 2
mata mata clear
mata:
    id = st_data(.,"id")
    big = st_data(.,"var?")
    panel = panelsetup(id, 1)
    
    bigtot = J(rows(id),3,.)
    rows = rows(panel)
    for (i=1; i<=rows; i++) {
        //X2 = big[|panel[i,1],1 \ panel[i,2],3|]
        X2 = panelsubmatrix(big, i, panel)
        bigtot[|panel[i,1],1 \ panel[i,2],3|]=J(rows(X2),1,colsum(X2))    
    }
end
timer off 2

timer on 3
mata mata clear
mata:
    id = st_data(.,"id")
    big = st_data(.,"var?")
    panel = panelsetup(id, 1)
    
    bigtot = J(rows(id),3,.)
    rows = rows(panel)
    for (i=1; i<=rows; i++) {
        //X3 = big[|panel[i,1],1 \ panel[i,2],3|]
        X3 = panelsubmatrix(big, i, panel)
        //bigtot[|panel[i,1],1 \ panel[i,2],3|]=J(rows(X2),1,colsum(X2))
        bigtot[panel[i,1]..panel[i,2],.] =J(rows(X3),1,colsum(X3))
    }
end
timer off 3
timer list

And the times I got was:

Code:

. timer list
   1:      7.77 /        1 =       7.7730
   2:      7.83 /        1 =       7.8290
   3:     47.12 /        1 =      47.1230

So JanDitzen is right, the range subscript, but mainly at the left side of =.
This is also described in https://www.stata-journal.com/sjpdf....iclenum=pr0028 , section 4 by W. Gould himself.

I forgot, so it is very nice, that someone remembers.

Still, I ran Attaullah Shah's code as well and got:

Code:

. timer list
   1:      3.01 /        1 =       3.0080
   2:      6.04 /        1 =       6.0420

So, on Stata MP version 16.1 Stata is still twice as fast as Mata

Last edited by Niels Henrik Bruun; 01 Apr 2020, 14:52.

Kind regards

nhb

Comment

Mustafa Ugur Karakaplan

Join Date: Jun 2015
Posts: 26

#11

02 Apr 2020, 01:08

I have been thinking about this topic. I emailed Ben Jann about it and I also asked Stata Tech Service about the issue. Ben emailed me some options and I edited them further. I also heard back from StataCorp. They mentioned the existence of an undocumented Mata function called panelsum. Below are several options to calculate column sums by panels. I also tried using them in a maximum likelihood estimation algorithm with big data -just like how Cyrus explained in the original post. Based on my tests, the Mata routine with the undocumented panelsum function is the fastest of all. The amount of time that it saves is substantial especially since the ML estimation calls it repeatedly who knows how many times and all the data to be evaluated change in every call. Note that the panelsum routine does not have a step-by-step for-loop here, but I haven't checked if there is one behind the scenes. Also, as Ben noted, looking at the timer list, the gen routine is actually the fastest (faster than egen and gegen) but since the routine requires going back and forth between Stata and Mata, that slows down the ML estimation algorithm written in Mata. All in all, I think the best solution to Cyrus' question is the panelsum routine here -if you guys cannot come up with a better one

Code:

. timer list
   1:     34.41 /        1 =      34.4140
   2:     14.37 /        1 =      14.3680
   3:     12.97 /        1 =      12.9680
   4:     14.96 /        1 =      14.9650
   5:      8.20 /        1 =       8.2020
   6:      1.03 /        1 =       1.0270
   7:      3.32 /        1 =       3.3200
   8:      3.16 /        1 =       3.1640
.
. // 1: sorting
. // 2: coltot2 using panelsetup() for loop
. // 3: coltot3 using panelsum()
. // 4: coltot4 using mm_panels() for loop
. // 5: egen
. // 6: gen
. // 7: gegen, by
. // 8: by: gegen

Code:

clear all

// programs
mata:
    void coltot2() {
        id = st_data(.,"id")
        big = st_data(.,"var1")
        V = panelsetup(id, 1)
        bigtot = J(rows(id),1,.)
        for (i=rows(V); i; i--) {
            X1 = panelsubmatrix(big, i, V)
            bigtot[|V[i,1],.\V[i,2],.|]=J(rows(X1),1,colsum(X1))    
        }
        st_addvar("double", "bigtot1")
        st_store(., "bigtot1", bigtot)
    }

    void coltot3() {
        ID = st_data(.,"ID") //just for ensuring that ID has no gap (1,2,3...)
        big = st_data(.,"var1")
        V = panelsetup(ID, 1)
        Xt = panelsum(big,V) //the undocumented panelsum function
        bigtot = Xt[ID,]
        st_addvar("double", "bigtot2")
        st_store(., "bigtot2", bigtot)
    }

    void coltot4() {
        id = st_data(.,"id")
        big = st_data(.,"var1")
        V = _mm_panels(id) //Ben Jann's moremata
        bigtot = J(rows(id),1,.)
        b = rows(id)
        for (i = rows(V); i; i--) {
            k = V[i]
            a = b - k + 1
            X1 = big[|a,.\b,.|]
            bigtot[|a,.\b,.|] = J(k, 1, colsum(X1))
            b = a - 1
        }
        st_addvar("double", "bigtot3")
        st_store(., "bigtot3", bigtot)
    }
end


// generate data
set obs 1000000
gen id=_n
expand 50
gegen ID=group(id) //this is for coltot2
gen var1 = runiform(0,1)

timer clear
 
// sort
timer on 1
sort id
timer off 1
 
// coltot2
timer on 2
mata: coltot2()
timer off 2

// coltot3
timer on 3
mata: coltot3()
timer off 3

// coltot4
timer on 4
mata: coltot4()
timer off 4

// egen
timer on 5
by id: egen double bigtot4=total(var1)
timer off 5
 
// gen
timer on 6
by id: gen double bigtot5=sum(var1)
by id: replace bigtot5=bigtot5[_N]
timer off 6

// gegen
timer on 7
gegen double bigtot6=total(var1), by(id) replace
timer off 7

// gegen
timer on 8
by id: gegen double bigtot7=total(var1), replace
timer off 8
 
// check results
assert (bigtot1==bigtot2)
assert (bigtot2==bigtot3)
assert (bigtot3==bigtot4)
assert (bigtot4==bigtot5)
assert (bigtot5==bigtot6)
assert (bigtot6==bigtot7)

 
// timer
timer list

// 1: sorting
// 2: coltot2 using panelsetup() for loop
// 3: coltot3 using panelsum()
// 4: coltot4 using mm_panels() for loop
// 5: egen
// 6: gen
// 7: gegen, by
// 8: by: gegen

Comment

Niels Henrik Bruun

Join Date: Aug 2014

Posts: 555
#12

02 Apr 2020, 03:01

Mustafa Ugur Karakaplan: Interesting. Nice work

Kind regards

nhb
Comment

JanDitzen

Join Date: Jan 2015
Posts: 350

#13

02 Apr 2020, 03:44

I am still fascinated by this topic and how much of a time difference different commands make. I am working on panel time series estimators and thus with a "large" number of observations over time time and units ("id"). I recently recoded one of my commands entirely in mata which made it much faster. During this work I tried to optimize the code so it runs fastest in Monte Carlo simulations as well. I realised that one of the bottle necks is actually often the data generating process too.

With this in mind I am interested in which role the number of observations of id and t play and I did the following small experiment. I vary the number of observations over ids and time between: 100, 200, 500, 1000 and 10000. Then I compare the timings of the egen, gen approach and the coltot2 and coltot3 programs.

The results for the timings are:

Code:

egen
             1         2         3         4         5         6
    +-------------------------------------------------------------+
  1 |      N\T       100       200       500      1000     10000  |
  2 |      100     0.017     0.028     0.066     0.109     0.957  |
  3 |      200     0.015     0.036     0.080     0.176     1.984  |
  4 |      500     0.044     0.080     0.210     0.435     5.203  |
  5 |     1000     0.080     0.171     0.448     0.943    10.960  |
  6 |    10000     0.969     2.000     5.262    11.068   125.830  |
    +-------------------------------------------------------------+
gen
           1       2       3       4       5       6
    +-------------------------------------------------+
  1 |    N\T     100     200     500    1000   10000  |
  2 |    100   0.002   0.002   0.007   0.013   0.098  |
  3 |    200   0.002   0.003   0.009   0.019   0.174  |
  4 |    500   0.004   0.008   0.022   0.039   0.437  |
  5 |   1000   0.010   0.015   0.038   0.077   0.872  |
  6 |  10000   0.074   0.152   0.376   0.759   7.500  |
    +-------------------------------------------------+
coltot2
            1        2        3        4        5        6
    +-------------------------------------------------------+
  1 |     N\T      100      200      500     1000    10000  |
  2 |     100    0.009    0.007    0.017    0.033    0.319  |
  3 |     200    0.010    0.013    0.031    0.067    0.623  |
  4 |     500    0.016    0.032    0.076    0.153    1.553  |
  5 |    1000    0.032    0.061    0.153    0.300    3.081  |
  6 |   10000    0.317    0.611    1.508    3.036   38.563  |
    +-------------------------------------------------------+
coltot3
            1        2        3        4        5        6
    +-------------------------------------------------------+
  1 |     N\T      100      200      500     1000    10000  |
  2 |     100    0.004    0.007    0.017    0.036    0.305  |
  3 |     200    0.008    0.012    0.032    0.068    0.595  |
  4 |     500    0.015    0.031    0.076    0.153    1.562  |
  5 |    1000    0.031    0.061    0.157    0.300    3.100  |
  6 |   10000    0.306    0.601    1.479    2.975   44.182  |
    +-------------------------------------------------------+

I think it is interesting - if not fascinating - how much faster the combination of by and gen is, especially in comparison to egen. The mata solution is slower in all cases as well, but still faster than egen. The speed of egen is not referred to in the manual - at least I was not able to find it.

This discussion is immensely helpful for me. As pointed out, I am working with simulations and often find Stata a bit on the slow side, especially with "large" data. It would be really helpful to have a guide on "fast" programming for large datasets (any volunteers to help me :p).

Appendix:
The code is the following (it is a bit ad-hoc....):

Code:

clear all

mata:
    void coltot2() {
        id = st_data(.,"id")
        big = st_data(.,"var1")
        V = panelsetup(id, 1)
        bigtot = J(rows(id),1,.)
        for (i=rows(V); i; i--) {
            X1 = panelsubmatrix(big, i, V)
            bigtot[|V[i,1],.\V[i,2],.|]=J(rows(X1),1,colsum(X1))    
        }
        st_addvar("double", "bigtot3")
        st_store(., "bigtot3", bigtot)
    }

     void coltot3() {
        ID = st_data(.,"id") //just for ensuring that ID has no gap (1,2,3...)
        big = st_data(.,"var1")
        V = panelsetup(ID, 1)
        Xt = panelsum(big,V) //the undocumented panelsum function
        bigtot = Xt[ID,]
        st_addvar("double", "bigtot4")
        st_store(., "bigtot4", bigtot)
    }
end

mata results = J(0,7,.)

qui{
    foreach i in 100 200 500 1000 10000  {
        foreach t in 100 200 500 1000 10000  {
            clear
            ** set seed in loop so drawn numbers are the same
            set seed 123
            timer clear
            timer on 99
            set obs `i'
            gen id = _n
            expand `t'
            by id, sort: gen t = _n

            drawnorm var1

            timer on 1
            by t (id), sort: egen bigtot1 = sum(var1)
            timer off 1

            timer on 2
            by t  (id), sort: gen double bigtot2=sum(var1)
            by t  (id), sort: replace bigtot2=bigtot2[_N]
            timer off 2
            
            sort id t
            
            timer on 3
            mata: coltot2()
            timer off 3
            
            timer on 4
            mata: coltot3()
            timer off 4
            
            timer off 99
            
            noi disp "Results for `i', `t'"
            noi timer list
            mata results = results \ (`i',`t',`r(t1)',`r(t2)',`r(t3)',`r(t4)',`r(t99)')
        }
    }
}

clear
getmata (res*)= results

rename res1 id
rename res2 t
rename res3 egen
rename res4 gen
rename res5 coltot2
rename res6 coltot3

qui{
    foreach type in egen gen coltot2 coltot3 {
        preserve
            keep id t `type'
            rename `type'* t_*
            noi disp "`type'"
            reshape wide t_ , i(id) j(t)
            putmata res = (id t_*), replace
            
            mata res2 = ("N\T","100","200","500","1000","10000") \ (strofreal(res[.,1]) , strofreal(res[.,2..6],"%9.3f"))
            
            noi mata res2
        
        restore
        
    }
}

Comment

Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#14

02 Apr 2020, 10:33

Cyrus Levy I think I should point out that one way to speed up to the loop without doing anything too complicated is to use range subscripts instead of V[i, 1]::V[i, 2], which creates a temporary array at each run of the loop. I will also point out that reading the data into mata takes a while (a few seconds in my system).

Code:

clear set obs 1000000 gen id = _n expand 10 sort id gen var1 = runiform(0,1) gen var2 = runiform(0,1) gen var3 = runiform(0,1) timer clear timer on 1 mata { id = st_data(.,"id") big = (st_data(.,"var1"), st_data(.,"var2"), st_data(.,"var3")) V = panelsetup(id, 1) } timer off 1 timer on 2 mata { ncol = 3 bigtot = J(rows(id), ncol, .) for (i = 1; i <= rows(V); i++) { X1 = panelsubmatrix(big, i, V) bigtot[|V[i,1], 1 \ V[i, 2], ncol|] = J(rows(X1), 1, colsum(X1)) } } timer off 2 timer on 3 mata { ncol = 3 bigtot = J(rows(id), ncol, .) for (i = 1; i <= rows(V); i++) { X1 = panelsubmatrix(big, i, V) bigtot[V[i, 1]::V[i, 2], .]=J(rows(X1), 1, colsum(X1)) } } timer off 3 timer list

gives

Code:

. timer list 1: 3.12 / 1 = 3.1190 2: 1.41 / 1 = 1.4130 3: 7.10 / 1 = 7.1020

So you can see it's 5.5x faster to use range subscripts, and it's not too complicated to implement.

PS To everyone else: If this is part of an MLE implementation in mata, reading the data into mata, setting up the panel, and back into Stata would be things that have to be done anyway. Hence timing the loop by itself is what really counts, and in that case the loop by itself actually runs faster than all the other solutions if you do 3 variables (comparing single variable solutions was not appropriate since adding variables scales roughly linearly when using gen, egen, etc. but in mata the marginal cost of adding variables is smaller).
Comment

JanDitzen

Join Date: Jan 2015
Posts: 350

#15

08 Jul 2020, 04:42

Bringing this topic back because I have something important to add.

I realised it makes a huge difference where to put the sort t id. I updated the code above so that the data is sorted before the egen and gen command. Otherwise the first command would sort the data, while the second command would not repeat this operation because it is already sorted. The updated results (run on a different machine!) still confirm that gen is faster than egen, but by a smaller magnitude (about 50%):

Code:

egen
            1        2        3        4        5        6
    +-------------------------------------------------------+
  1 |     N\T      100      200      500     1000    10000  |
  2 |     100    0.006    0.006    0.015    0.027    0.147  |
  3 |     200    0.003    0.006    0.014    0.025    0.254  |
  4 |     500    0.010    0.013    0.030    0.059    0.701  |
  5 |    1000    0.014    0.025    0.063    0.126    2.018  |
  6 |   10000    0.195    0.306    0.904    1.522   17.579  |
    +-------------------------------------------------------+
gen
            1        2        3        4        5        6
    +-------------------------------------------------------+
  1 |     N\T      100      200      500     1000    10000  |
  2 |     100    0.001    0.003    0.007    0.013    0.087  |
  3 |     200    0.002    0.003    0.008    0.016    0.154  |
  4 |     500    0.003    0.008    0.018    0.038    0.366  |
  5 |    1000    0.008    0.015    0.035    0.077    1.110  |
  6 |   10000    0.097    0.180    0.520    0.783   10.451  |
    +-------------------------------------------------------+
coltot2
            1        2        3        4        5        6
    +-------------------------------------------------------+
  1 |     N\T      100      200      500     1000    10000  |
  2 |     100    0.004    0.005    0.011    0.024    0.217  |
  3 |     200    0.004    0.008    0.019    0.038    0.390  |
  4 |     500    0.010    0.020    0.080    0.097    1.020  |
  5 |    1000    0.025    0.043    0.101    0.189    2.660  |
  6 |   10000    0.266    0.534    1.311    2.095   24.717  |
    +-------------------------------------------------------+
coltot3
            1        2        3        4        5        6
    +-------------------------------------------------------+
  1 |     N\T      100      200      500     1000    10000  |
  2 |     100    0.002    0.004    0.012    0.023    0.206  |
  3 |     200    0.006    0.011    0.020    0.040    0.372  |
  4 |     500    0.011    0.019    0.048    0.103    1.075  |
  5 |    1000    0.021    0.041    0.104    0.199    2.811  |
  6 |   10000    0.250    0.529    1.339    2.201   29.699  |
    +-------------------------------------------------------+

The revised code with the change in bold is:

Code:

clear all

mata:
    void coltot2() {
        id = st_data(.,"id")
        big = st_data(.,"var1")
        V = panelsetup(id, 1)
        bigtot = J(rows(id),1,.)
        for (i=rows(V); i; i--) {
            X1 = panelsubmatrix(big, i, V)
            bigtot[|V[i,1],.\V[i,2],.|]=J(rows(X1),1,colsum(X1))    
        }
        st_addvar("double", "bigtot3")
        st_store(., "bigtot3", bigtot)
    }

     void coltot3() {
        ID = st_data(.,"id") //just for ensuring that ID has no gap (1,2,3...)
        big = st_data(.,"var1")
        V = panelsetup(ID, 1)
        Xt = panelsum(big,V) //the undocumented panelsum function
        bigtot = Xt[ID,]
        st_addvar("double", "bigtot4")
        st_store(., "bigtot4", bigtot)
    }
end

mata results = J(0,7,.)

qui{
    foreach i in 100 200 500 1000 10000  {
        foreach t in 100 200 500 1000 10000  {
            clear
            ** set seed in loop so drawn numbers are the same
            set seed 123
            timer clear
            timer on 99
            set obs `i'
            gen id = _n
            expand `t'
            by id, sort: gen t = _n

            drawnorm var1
            
            sort t id
            
            timer on 1
            by t (id), sort: egen bigtot1 = sum(var1)
            timer off 1

            timer on 2
            by t  (id), sort: gen double bigtot2=sum(var1)
            by t  (id), sort: replace bigtot2=bigtot2[_N]
            timer off 2
            
            sort id t
            
            timer on 3
            mata: coltot2()
            timer off 3
            
            timer on 4
            mata: coltot3()
            timer off 4
            
            timer off 99
            
            noi disp "Results for `i', `t'"
            noi timer list
            mata results = results \ (`i',`t',`r(t1)',`r(t2)',`r(t3)',`r(t4)',`r(t99)')
        }
    }
}

clear
getmata (res*)= results

rename res1 id
rename res2 t
rename res3 egen
rename res4 gen
rename res5 coltot2
rename res6 coltot3

qui{
    foreach type in egen gen coltot2 coltot3 {
        preserve
            keep id t `type'
            rename `type'* t_*
            noi disp "`type'"
            reshape wide t_ , i(id) j(t)
            putmata res = (id t_*), replace
            
            mata res2 = ("N\T","100","200","500","1000","10000") \ (strofreal(res[.,1]) , strofreal(res[.,2..6],"%9.3f"))
            
            noi mata res2
        
        restore
        
    }
}

Announcement