Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to construct loops doing something by group, with a Mata function inside the loop instead of a command?

    Good evening,

    Can somebody please show me how to calculate stuff by groups, when I am calling a Mata function within my loop, instead of calling a Stata command?

    For concreteness, lets say that I want to calculate the weighted mean of price by groups defined by rep. In Stata I would set up my loop like this:

    Code:
    sysuse auto, clear
    keep price weight rep
    
    egen groupsofrep = group(rep), missing
    
    gen wtmean = .
    
    summ groupsofrep, meanonly
    
    forvalues i = 1/`r(max)' {
    summ price [aw=weight] if groupsofrep==`i', meanonly
    replace wtmean = r(mean) if groupsofrep==`i'
    }
    Now supposed that the Stata command -summarize- does not exist, but a very kind person has written the Mata function -mean()-.

    So how can I accomplish what I did above, but calling within my loop the Mata function -mean()- instead of the Stata command -summarize- ?

    And the code calling the Mata function should be as fast as possible.




  • #2
    So this is one way that you could do it. I took the approach of writing a byable program. Most of the preamble allows for convenient syntax of the form:

    Code:
    mymean [type] mean_var = var [aw or fw] [if] [in]
    A single Mata function handles the weighted or unweighted mean calculation which is nothing more than a wrapper for Mata's mean(), then pushes the result back out to the new variable. That said, I would aim to have all calculations handled within Mata functions so that I don't leave extraneous Mata objects behind once the command is finished.

    Code:
    version 17
    mata:
      mata set matastrict on
     
      void mymean(string scalar avar, string scalar targetvar, string scalar touse, string scalar wvar) {
          real colvector  X,   w
        real scalar     meanX
        real colvector  meanXvec
        
        if (wvar=="") {
            X = st_data(., avar, touse)
          meanX = mean(X)
        }
        else {
            X = st_data(., avar, touse)
          w = st_data(., wvar, touse)
          meanX = mean(X, w)
        }
        
        meanXvec = J(rows(X), 1, meanX)
        st_store(., targetvar, touse, meanXvec)
      }
    end
    
    
    cap program drop mymean
    program mymean , byable(recall, noheader)
      version 17
      syntax [anything(equalok)] [if] [in] [aw fw /]
      marksample touse
    
      gettoken type 0 : 0, parse(" =")
      gettoken name 0 : 0, parse(" =")
     
      if "`name'"=="=" {
          local name "`type'"
        local type `c(type)'
      }
      else {
        gettoken eqsign 0 : 0, parse(" =")
        if "`eqsign'" != "=" {
            di as err "Missing equals sign."
          exit 198
        }
      }
    
      gettoken avar 0 : 0, parse(" =")
     
      if "`weight'"=="fweight" {
          local ww "`exp'"
      }
      else if "`weight'"=="aweight" {
          tempvar wgt
        qui gen double `wgt' = `exp' if `touse'
        local ww `wgt'
      }
     
      cap gen double `name' = .
      mata: mymean("`avar'", "`name'", "`touse'", "`ww'")
    end
    and the test program passes with the given example.

    Code:
    sysuse auto, clear
    keep price weight rep
    
    egen groupsofrep = group(rep), missing
    sort groupsofrep price
    
    gen double wtmean = .
    summ groupsofrep, meanonly
    forvalues i = 1/`r(max)' {
      summ price [aw=weight] if groupsofrep==`i', meanonly
      qui replace wtmean = r(mean) if groupsofrep==`i'
    }
    
    bys rep78: mymean double wtmean1 = price [aw=weight]
    
    assert wtmean==wtmean1
    I can't speak to what is "fastest", but in this particular example where you are trying to substitute Mata code for -summarize-, you will note that -summarize- is built-in. I would therefore expect the Mata version to be slower (or at least no faster) than using summarize directly, and certainly much more succinct.

    Comment


    • #3
      Here is another approach that more closely resembles the original code

      Code:
      sysuse auto, clear
      keep price weight rep
      
      egen groupsofrep = group(rep), missing
      
      gen wtmean = .
      
      summ groupsofrep, meanonly
      
      forvalues i = 1/`r(max)' {
          // mark the subsample
          generate byte sample = groupsofrep==`i'
          mata : st_numscalar("mean", mean(st_data(., "price", "sample"), ///
                                           st_data(., "weight", "sample")))
          // fill in the mean
          replace wtmean = mean if groupsofrep==`i'
          // remove sample variable (could use temporary variables instead)
          drop sample
      }

      Edit: Regarding speed, it might be faster to implement the loop directly in Mata. Fast solutions might require quite some amount of coding, see, e.g., runby (Picard & Schechter, SSC).
      Last edited by daniel klein; 11 May 2021, 01:12.

      Comment


      • #4
        Thank you very much, Leonardo Guizzetti and daniel klein for showing two solutions to the problem at hand. The two approaches you show beautifully illustrate two distinct and very natural lines of solving the problem. By the way your two approaches are faster than the Stata loop in #1, and are of about the same speed compared to each other. Daniel's loop is a tad slower, but on the positive is super simple and transparent.

        However there is still one more major conceptual approach to the problem. It would involve loading the whole dataset as a Mata matrix, then doing a Mata loop over the groups, and applying Mata's -mean()- function on subscripted Mata matrices. Can somebody please show how this can be done?

        Any other lines of attack on the problem would also be appreciated.

        Comment


        • #5
          You might find

          Code:
          help mata panelsetup()
          helpful. Be careful: the dataset must be sorted by idcol to yield correct results. I deem this a bug; one could argue that it is just not well documented.

          Comment


          • #6
            Originally posted by Joro Kolev View Post
            Thank you very much, Leonardo Guizzetti and daniel klein for showing two solutions to the problem at hand. The two approaches you show beautifully illustrate two distinct and very natural lines of solving the problem. By the way your two approaches are faster than the Stata loop in #1, and are of about the same speed compared to each other. Daniel's loop is a tad slower, but on the positive is super simple and transparent.

            However there is still one more major conceptual approach to the problem. It would involve loading the whole dataset as a Mata matrix, then doing a Mata loop over the groups, and applying Mata's -mean()- function on subscripted Mata matrices. Can somebody please show how this can be done?

            Any other lines of attack on the problem would also be appreciated.
            You're welcome, Joro. I found the opposite trend in terms of timings, but it might be related to specific architecture/Stata flavour? The times were as below, with my method being the slowest, and your method being the fastest in all cases. The discrepancy is only noticeable in "large" datasets.
            Click image for larger version

Name:	speed_test.jpg
Views:	1
Size:	19.1 KB
ID:	1608930

            Comment


            • #7
              Leonardo Guizzetti you seem to have done the timing in more extensive way. What I did was very primitive, and is below. Could it be that you have pre-compiled your Mata function? On this thread here, https://www.statalist.org/forums/for...unction-s-code
              where you contributed as well, Daniel had the interesting theory that it might be in fact faster not to pre-compile the Mata functions, because according to him it takes more time to find the compiled function on the path as opposed to compile it on the fly.

              This is what I did, and I did it under Stata 15.1 SE.

              Code:
              . timer clear
              
              . mata: mata clear
              
              . sysuse auto, clear
              (1978 Automobile Data)
              
              . keep price weight rep
              
              . 
              . expand 100000
              (7,399,926 observations created)
              
              . 
              . timer on 1
              
              . egen groupsofrep = group(rep), missing
              
              . 
              . qui gen wtmean = .
              
              . 
              . summ groupsofrep, meanonly
              
              . 
              . qui forvalues i = 1/`r(max)' {
              
              . 
              . timer off 1
              
              . 
              . 
              . ******
              . timer on 2
              
              . version 15.1
              
              . mata:
              ------------------------------------------------- mata (type end to exit) --------------------------------
              :   mata set matastrict on
              
              :  
              :   void mymean(string scalar avar, string scalar targetvar, string scalar touse, string scalar wvar) {
              >       real colvector  X,   w
              >     real scalar     meanX
              >     real colvector  meanXvec
              >     
              >     if (wvar=="") {
              >         X = st_data(., avar, touse)
              >       meanX = mean(X)
              >     }
              >     else {
              >         X = st_data(., avar, touse)
              >       w = st_data(., wvar, touse)
              >       meanX = mean(X, w)
              >     }
              >     
              >     meanXvec = J(rows(X), 1, meanX)
              >     st_store(., targetvar, touse, meanXvec)
              >   }
              
              : end
              ----------------------------------------------------------------------------------------------------------
              
              . 
              . 
              . cap program drop mymean
              
              . program mymean , byable(recall, noheader)
                1.   version 15.1
                2.   syntax [anything(equalok)] [if] [in] [aw fw /]
                3.   marksample touse
                4. 
              .   gettoken type 0 : 0, parse(" =")
                5.   gettoken name 0 : 0, parse(" =")
                6.  
              .   if "`name'"=="=" {
                7.       local name "`type'"
                8.     local type `c(type)'
                9.   }
               10.   else {
               11.     gettoken eqsign 0 : 0, parse(" =")
               12.     if "`eqsign'" != "=" {
               13.         di as err "Missing equals sign."
               14.       exit 198
               15.     }
               16.   }
               17. 
              .   gettoken avar 0 : 0, parse(" =")
               18.  
              .   if "`weight'"=="fweight" {
               19.       local ww "`exp'"
               20.   }
               21.   else if "`weight'"=="aweight" {
               22.       tempvar wgt
               23.     qui gen double `wgt' = `exp' if `touse'
               24.     local ww `wgt'
               25.   }
               26.  
              .   cap gen double `name' = .
               27.   mata: mymean("`avar'", "`name'", "`touse'", "`ww'")
               28. end
              
              . 
              . bys rep78: mymean double wtmean1 = price [aw=weight]
              
              . timer off 2
              
              . 
              . 
              . timer on 3
              
              . egen groupsofrep2 = group(rep), missing
              
              . 
              . qui gen wtmean2 = .
              
              . 
              . summ groupsofrep2, meanonly
              
              . 
              . qui forvalues i = 1/`r(max)' {
              
              . 
              . timer off 3
              
              . 
              . timer list
                 1:     15.92 /        1 =      15.9220
                 2:     10.40 /        1 =      10.3970
                 3:     10.51 /        1 =      10.5070
              
              . 
              . summ wtmean wtmean1 wtmean2
              
                  Variable |        Obs        Mean    Std. Dev.       Min        Max
              -------------+---------------------------------------------------------
                    wtmean |  7,400,000    6544.843    481.7805   4608.602   7003.154
                   wtmean1 |  7,400,000    6544.843    481.7806   4608.602   7003.154
                   wtmean2 |  7,400,000    6544.843    481.7805   4608.602   7003.154

              Comment


              • #8
                I wrote a little do-file to test the timings, but did not precompile the Mata code. I can't explain the discrepancy here.

                My program, for those interested (uses frames so requires version 16+) and run under 17/MP:

                Code:
                // https://www.statalist.org/forums/forum/general-stata-discussion/general/1608770-how-to-construct-loops-doing-something-by-group-with-a-mata-function-inside-the-loop-instead-of-a-command
                clear *
                cls
                
                set seed 1608841
                
                version 17
                mata:
                  mata set matastrict on
                 
                  void mymean(string scalar avar, string scalar targetvar, string scalar touse, string scalar wvar) {
                      real colvector  X,   w
                    real scalar     meanX
                    real colvector  meanXvec
                    
                    if (wvar=="") {
                        X = st_data(., avar, touse)
                      meanX = mean(X)
                    }
                    else {
                        X = st_data(., avar, touse)
                      w = st_data(., wvar, touse)
                      meanX = mean(X, w)
                    }
                    
                    meanXvec = J(rows(X), 1, meanX)
                    st_store(., targetvar, touse, meanXvec)
                  }
                end
                
                cap progam drop dgm
                program dgm
                  version 17
                  syntax, [n(real 2000)]
                  drop _all
                  scalar nobs = max(round(`n'), 100)
                  set obs `=scalar(nobs)'
                 
                  gen int group = runiformint(1, 100)
                  gen double y = rnormal(0, 1)
                  gen double aw = runiform(0.2, 0.8)
                end
                
                
                cap program drop mymean
                program mymean , byable(recall, noheader)
                  version 17
                  syntax [anything(equalok)] [if] [in] [aw fw /]
                  marksample touse
                
                  gettoken type 0 : 0, parse(" =")
                  gettoken name 0 : 0, parse(" =")
                 
                  if "`name'"=="=" {
                      local name "`type'"
                    local type `c(type)'
                  }
                  else {
                    gettoken eqsign 0 : 0, parse(" =")
                    if "`eqsign'" != "=" {
                        di as err "Missing equals sign."
                      exit 198
                    }
                  }
                
                  gettoken avar 0 : 0, parse(" =")
                 
                  if "`weight'"=="fweight" {
                      local ww "`exp'"
                  }
                  else if "`weight'"=="aweight" {
                      tempvar wgt
                    qui gen double `wgt' = `exp' if `touse'
                    local ww `wgt'
                  }
                 
                  cap gen double `name' = .
                  mata: mymean("`avar'", "`name'", "`touse'", "`ww'")
                end
                
                
                tempname Res
                frame create `Res' str2(method) long(rep) long(n) double(time)
                
                local nreps 30
                local sizes 2000 5000 10000 50000 100000 500000 1000000 5000000
                foreach size of local sizes {
                  scalar nobs = `size'
                  forval rep = 1/`nreps' {
                      di "Running: n=`size', interation #`rep' of `nreps'"
                    
                    qui dgm, n(`=nobs')
                    gen id = _n
                    
                    // Method 1
                    sort group id
                    timer on 1
                    by group : mymean double wtmean1 = y [aw=aw]
                    timer off 1
                    qui timer list 1
                    frame post `Res' ("LG") (`rep') (`=nobs') (`r(t1)')
                    
                    // Method 2
                    egen newgroups = group(group), missing
                    timer on 2
                    qui gen double wtmean2 = .
                    summ newgroups, meanonly
                    forvalues i = 1/`r(max)' {
                      qui gen byte sample = newgroups==`i'
                      mata : st_numscalar("mean", mean(st_data(., "y", "sample"), ///
                                                       st_data(., "aw", "sample")))
                      qui replace wtmean2 = mean if newgroups==`i'
                      qui drop sample
                    }
                    timer off 2
                    qui timer list 2
                    frame post `Res' ("DK") (`rep') (`=nobs') (`r(t2)')
                    
                    // Method 3
                    timer on 3
                    qui gen double wtmean3 = .
                    summ newgroups, meanonly
                    forvalues i = 1/`r(max)' {
                      summ y [aw=aw] if newgroups==`i', meanonly
                      qui replace wtmean3 = r(mean) if newgroups==`i'
                    }
                    timer off 3
                    qui timer list 3
                    frame post `Res' ("JK") (`rep') (`=nobs') (`r(t3)')
                    
                    timer clear
                  }
                }
                
                frame copy `Res' Res
                frame change Res
                collapse (mean) t_mean=time (median) t_med=time, by(method n)
                list, sepby(method)
                
                gen ln_n = log(n)
                
                mylabels `sizes', myscale(ln(@)) local(myxla)
                sc t_mean ln_n if method=="LG", c(l) || sc t_mean ln_n if method=="DK", c(l) || sc t_mean ln_n if method=="JK", c(l) ///
                  xla(`myxla', labsize(small)) ylab(, angle(0) nogrid labsize(small)) ///
                  xti("Number of observations (log scale)", size(small)) yti("Mean time, seconds", size(small)) ///
                  legend(label(1 "LG") label(2 "DK") label(3 "JK") col(1) symxsize(small) size(vsmall) region(lstyle(none)) pos(11) ring(0) ) ///
                  graphregion(color(white)) name(tmeans, replace)

                Comment


                • #9
                  Here is another line of attack on the problem, using the user written command -rangestat-. This is the simplest solution, but also the slowest, it is about 8 times slower than the solutions by Leonardo and Daniel.

                  Code:
                  *** Output for methods 1,2,3 is omitted because is the same as the output shown in post #7 above
                  timer on 4
                          mata:  
                              mata clear
                              real rowvector mymean(real matrix X) {
                                  return(mean(X[,1], X[,2]))
                              }
                          end 
                   rangestat (mymean) price weight, interval(price . .) by(rep) 
                   
                  timer off 4
                  
                  . timer list
                     1:     15.60 /        1 =      15.6040
                     2:     10.17 /        1 =      10.1710
                     3:     10.40 /        1 =      10.4010
                     4:     85.76 /        1 =      85.7600
                  
                  . 
                  . summ wtmean wtmean1 wtmean2 mymean
                  
                      Variable |        Obs        Mean    Std. Dev.       Min        Max
                  -------------+---------------------------------------------------------
                        wtmean |  7,400,000    6544.843    481.7805   4608.602   7003.154
                       wtmean1 |  7,400,000    6544.843    481.7806   4608.602   7003.154
                       wtmean2 |  7,400,000    6544.843    481.7805   4608.602   7003.154
                       mymean1 |  7,400,000    6544.843    481.7806   4608.602   7003.154

                  Comment


                  • #10
                    Mata is fast, but not necessarily in all cases. I had written asgen program (available on the SSC) first in Mata to find weighted average mean. Later, I discovered that the sum() function is faster. Therefore, I went back to Stata
                    Code:
                    *! Attaullah Shah, [email protected]
                    *! Version 2.0, July 29, 2020
                    
                    /* This program is mostly similar to _gwmean.ado program, 
                    except that it does not use egen and hence offers some speed 
                    efficiency by avoiding the egen's overhead */
                    
                    * Version 1.0, 30September2017
                    cap prog drop asgen
                    prog asgen, sortpreserve byable(onecall)
                        syntax namelist =/ exp [if] [in], [Weights(varname) by(varlist) XFocal]
                        
                        marksample touse
                        
                        
                        if "`_byvars'"!="" {
                            local by "`_byvars'"
                        }
                        if "`by'"=="" {
                            tempvar by
                            qui gen `by' = 1
                        }
                        if "`weights'"    != "" {    
                            qui bys `by' `touse': gen double `namelist' = sum((`exp') * `weights') / ///
                            sum(`weights' * !missing(`exp')) if `touse' 
                        }
                        else {
                            qui bys `by' `touse': gen double `namelist' = sum(`exp') / ///
                            sum(!missing(`exp')) if `touse' 
                        }
                        qui bys `by' `touse' : replace `namelist' = `namelist'[_N]
                    
                    end
                    Regards
                    --------------------------------------------------
                    Attaullah Shah, PhD.
                    Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                    FinTechProfessor.com
                    https://asdocx.com
                    Check out my asdoc program, which sends outputs to MS Word.
                    For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

                    Comment

                    Working...
                    X