runby: a new command on SSC that runs Stata commands on by-groups of observations

Robert Picard

Join Date: Mar 2014

Posts: 1536
#1

runby: a new command on SSC that runs Stata commands on by-groups of observations

13 Oct 2017, 15:06

Thanks to Kit Baum, a new command called runby (with Clyde Schechter) is now available on SSC. To install it, type in Stata's Command window:

Code:

ssc install runby

runby loops over data by-groups. A by-group is a subset of the initial data in memory and includes all observations with the same value for the variables specified in the by(varlist) option.

You can run as many Stata commands as you want on each by-group. All you need to do is to wrap these commands in a generic Stata program.

With each loop iteration, runby replaces the data in memory with the by-group's data and runs your program. What's left in memory when your program terminates is considered results and is stored. When runby finishes, the data in memory contains the combined results from all by-groups. runby does not care about what's left in memory, it will grab it all and save it all.

runby is a more efficient alternative to commands like statsby and loop based solutions (via levelsof and foreach ...). Because the commands run on data subsets, there is no need to use if or in qualifiers to target by-group observations.

runby will be useful if you need to run estimations by groups (see the panel-specific regressions example in the help file). It will also be useful with some matching problems when the number of possible pairwise combinations is too large to handle in one pass. There's a great example of case-control pairing in the help file.

For large problems, there's a status option that will trigger progress reports to print in the Results window. These show the elapsed time, how many by-groups have been processed so far (with how many that end with program errors or no data), how many results observations have been saved, and finally an estimated time to completion. The frequency of reports is 1 per second initially and gradually slows down to every 5 minutes after 1 hour of running time.

For those who like to think outside the box, runby can be useful for some data management tasks. You can easily partition a large dataset into separate datasets, one for each by-group. You can even use runby to automate the import of a bunch of files into Stata. You use runby to loop over a list of files and let your program handle all the steps needed to import each file. There are examples for each of these uses in the help file. Here's an example from today that shows how to import problematic Excel files using runby.

Here's a quick example that shows the basic functionality:

Code:

clear all program try_this summarize rep78, meanonly replace rep78 = r(mean) gen mrep78_N = r(N) keep foreign rep78 mrep78_N keep in 1 end sysuse auto runby try_this, by(foreign) list

and the results:

Code:

. list +-------------------------------+ | rep78 foreign mrep78_N | |-------------------------------| 1. | 3.02083 Domestic 48 | 2. | 4.28571 Foreign 21 | +-------------------------------+

By default, runby uses Mata to do its thing because it is very fast at moving data around. The downside is that it requires extra memory to store a copy of the initial data and to store results. There's an option to use Stata only commands (use, save, and append) if you are tight on memory, with a definite impact on execution times.
Tags: export, group, import, loop

6 likes
Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

21 Oct 2017, 08:32

Thanks again to Kit Baum, runby has been updated on SSC. The new version fixes an issue that prevented runby from running on older versions of Stata (from version 11 to 13).

To update, type in Stata's Command window:

Code:

ssc install runby, replace

or go the adoupdate route using:

Code:

adoupdate runby
1 like
Comment

nishtha ruhil

Join Date: Jan 2019
Posts: 3

07 Jan 2019, 02:39

Hi I tried using this command but keep getting some error. I have a data which has a variable cashratio for different business activity codes starting from A1,A2, A3,,....C13. I have another variable which has states which has 37 categories with values AN, TZ, ....and so on. Now I want to get the result as -for each state -the median values of cash ratio according to different activity codes.

STATE		cash ratio
AP	A1	median value of all fields which are in state AP and have business activity code A1
	A2
	A3
	A4
	B1
	B2
	B3
	B4
	B5
	C1
	C2
	C3
	C4
	C5
	C6
	C7
	C8
	C9
	C10
	C11
	C12
	C13

Please suggest.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

07 Jan 2019, 03:00

nishtha ruhil You don't show any data, code or error message, but

Code:

egen wanted = median(cashratio), by(activity state)

should be enough of a hint for a direct solution. If that isn't a good answer, you do please need to read and act on https://www.statalist.org/forums/help#stata
Comment
nishtha ruhil

Join Date: Jan 2019

Posts: 3
#5

07 Jan 2019, 22:34

Nick Cox I apologise. The data is as follows:
input str3 business_act_cod float cashratio str2 state
"A1" .08 "AN"
"B1" .01 "TG"
"A1" .34 "TG"
"A1" .7 "RJ"
"A1" 5.3 "TG"
"A1" .55 "TG"
"A4" 0 "TG"
"A4" .05 "TG"
"A4" 1.02 "TG"
"A4" 0 "TG"
"A4" 0 "TG"
"A1" 59.63 "TG"
"A4" .06 "TG"
"A4" .06 "TG"
"A4" .09 "TG"
"A4" 0 "TG"
"A4" 0 "TG"
"A1" .37 "TG"
"A4" 0 "TG"
"A1" .62 "TG"
"A1" .11 "TG"
"A1" .06 "TG"
"A4" 0 "TG"
"A4" 0 "TG"
"A4" 1.91 "TG"
"C1" .02 "TG"
"A4" .01 "TG"
"A4" 0 "TG"
"A4" 0 "TG"
"A4" 0 "TG"
"A1" .01 "TG"
"A4" 62.7 "TG"
"A4" 40.72 "TG"
"C13" 0 "TG"
"A4" .21 "TG"
"A4" 8.34 "TG"

business_act_code has 22 different fields and for state there are 37 different fields.

Now I had run the following command on my data above:

program try_this
1. sumarize cashratio,medianonly
2. replace cashratio = r(median)
3. gen mcash_N = r(N)
4. keep state business_act_cod cashratio mcash_N
5. keep in 1
6. end

runby try_this, by(state business_act_cod)

THis is what I get after running this:
--------------------------------------
Number of by-groups = 641
by-groups with errors = 641
by-groups with no data = 0
Observations processed = 118,884
Observations saved = 0
--------------------------------------

Last edited by nishtha ruhil; 07 Jan 2019, 22:37.
Comment

David Benson

Join Date: Oct 2018
Posts: 489

07 Jan 2019, 23:23

So probably the easiest way to get what you want is to use collapse (note: save your data before doing this as collapse deletes data and creates a new dataset of summary statistics);

Code:

sort business_act_cod state cashratio
table business_act_cod state, c(median cashratio) row col

--------------------------------------
business_ |           state          
act_cod   |    AN     RJ     TG  Total
----------+---------------------------
       A1 |   .08     .7    .37    .37
       A4 |                .005   .005
       B1 |                 .01    .01
       C1 |                 .02    .02
      C13 |                   0      0
          |
    Total |   .08     .7   .055    .06
--------------------------------------


collapse (count) cash_count = cashratio (mean) cash_mean = cashratio (median) cash_median = cashratio, by( business_act_cod state)
rename business_act_cod act_code  // just shortened to make it easier to list

. list, sepby(act_code ) noobs abbrev(16)

  +---------------------------------------------------------+
  | act_code   state   cash_count   cash_mean   cash_median |
  |---------------------------------------------------------|
  |       A1      AN            1         .08           .08 |
  |       A1      RJ            1          .7            .7 |
  |       A1      TG            9    7.443334           .37 |
  |---------------------------------------------------------|
  |       A4      TG           22       5.235          .005 |
  |---------------------------------------------------------|
  |       B1      TG            1         .01           .01 |
  |---------------------------------------------------------|
  |       C1      TG            1         .02           .02 |
  |---------------------------------------------------------|
  |      C13      TG            1           0             0 |
  +---------------------------------------------------------+

* Or if you wanted this by state and then by account code
sort state act_code
order state, first
list, sepby(state ) noobs abbrev(16)

  +---------------------------------------------------------+
  | state   act_code   cash_count   cash_mean   cash_median |
  |---------------------------------------------------------|
  |    AN         A1            1         .08           .08 |
  |---------------------------------------------------------|
  |    RJ         A1            1          .7            .7 |
  |---------------------------------------------------------|
  |    TG         A1            9    7.443334           .37 |
  |    TG         A4           22       5.235          .005 |
  |    TG         B1            1         .01           .01 |
  |    TG         C1            1         .02           .02 |
  |    TG        C13            1           0             0 |
  +---------------------------------------------------------+

Last edited by David Benson; 07 Jan 2019, 23:28.

Comment

nishtha ruhil

Join Date: Jan 2019

Posts: 3
#7

08 Jan 2019, 01:31

David Benson Thank you so much, got the result I needed.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#8

08 Jan 2019, 02:20

There are several bugs in your original code.

Code:

sumarize cashratio,medianonly replace cashratio = r(median)

There is no command called sumarize that I know of; presumably you mean summarize, but that doesn't have an option medianonly. To get medians, you need summarize, detail after which the saved result you want is r(p50).

Where on Earth did that syntax come from? Perhaps guessing wildly, which is a poorer programming strategy than reading documentation.
Comment
Jen Walker

Join Date: Mar 2019

Posts: 22
#9

21 Mar 2019, 19:34

Hi

I am using runby for a pca where we have 3 respondents and 2 cohorts (code below).

capture program drop one_group
program define one_group
display "Respondent #" Respondent " Cohort #" cohort
pca Item1- Item25
fapara, pca reps(10)
exit
end

runby one_group, by(Respondent cohort) verbose

The fapara analyses produce a graph (6 in total), and I can see them as the code is running, but I can only see the last graph once the program has finished running. Is there some way I can keep all 6 graphs or preferably embed the graphs into the Stata output window?

I am using Stata 15.1.

Thanks in advance for your time
Jen
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#10

21 Mar 2019, 22:09

-fapara- is not part of official Stata and I know nothing about it. I presume that it is the command that produces the graphs you refer to, as nothing else does. Official Stata programs that produce graphs usually allow pass through of options to the -graph- command so you can customize it. In this case, what you need to do is use the -name()- option and give each graph a new name. So assuming that fapara allows this, change that command to:

Code:

fapara, pca reps(10) name(`"R`=respondent[1]'C`=cohort[1]'"', replace)

After that, the graph for Respondent 1 and Cohort 2 will be in Graph window R1C2, etc.

If -fapara- does not accept the name() option, then I think the best you can do, kludgy though it is, is to -graph save `"R`=respondent[1]'C`=cohort[1]'"', replace- at the end of program one_group. Then after you come out of -runby- you can -graph use- each one.

As far as I know, there is no way, ever, to embed Stata graphs into the Results window.
Comment
Jen Walker

Join Date: Mar 2019

Posts: 22
#11

26 Mar 2019, 00:18

Thanks heaps Clyde, the additional code ran perfectly. Also thanks for the additional information regarding official Stata programs.
Comment
David Sosa

Join Date: Jul 2022

Posts: 3
#12

22 Aug 2022, 20:42

Hello everyone!

Currently, I'm trying to substitute statsby with runby for some panel-specific regressions, I want to save the results of each regression to a different dataset, but I haven't been able to save these regressions in different datasets. My code is:

Code:

program define hitsch local vars "" foreach number of numlist 1/5 { qui sum lp_comp`number', detail local percentile=r(p25) if `percentile'!=8888 { local vars "`vars'" + " lp_comp`number' " } } *Regression with interaction reghdfe `quantity' `price' i.`promo' c.`price'#1.`promo' `vars', absorb(month_year seller) cluster(seller) gen obs=e(N) lp=_b[lp_variable] save "${dir}reg_`promo'.dta", replace end parallel, by(brand) programs(hitsch): runby hitsch, by(brand city)

While the parallel process seems to run fine immediatly after ending the child processes I get the error: "No dataset for instance 0001, r(601)"

Before I would use the saving option of statsby, but I don't know what the equivalent in runby is.

Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#13

23 Aug 2022, 09:43

First, know that I am not familiar with -parallel- and I cannot be sure whether it is compatible with -runby-. But let me assume that it is.

The problem that stands out looking at this is that a bunch of undefined local macros appear in program hitsch: quantity, price, promo. Even if you have assigned values to these macros in the parent process, they are undefined within program hitsch. This is a defining principle of local macros: their scope is limited to the block in which they are defined. You can see this most transparently with:

Code:

. clear* . . capture program drop demo . program define demo 1. display `"`my_macro'"' 2. exit 3. end . . local my_macro ABCDE . . demo . . display `"`my_macro'"' ABCDE

In particular, since local macro promo is not defined within program hitsch, the -save- command at the end is going to overwrite the same file, ${dir}reg_.dta, repeatedly, if it even reaches that command. I don't think it reaches that command, because with promo undefined, it seems to me that the -reghdfe- command is expanded as -reghdfe i. c.#1. expansion_of_vars...-, which is a syntax error.

Passing information from the calling program to the called program in -runby- is difficult because -runby- does not allow the calling program to have arguments or options, and locals can't cross into the program. So you are left with more problematic ways only: global macros, creating "variables" in the data set that contain the values, or saving them in a text file that the called program reads. There may be other ways that I haven't thought of--these are the ones that I have used myself. Of these, I generally prefer creating variables in the data set.

Before I would use the saving option of statsby, but I don't know what the equivalent in runby is.

There is no equivalent. To accomplish saving some interim or final results of the called program at each iteration you just use a -save- command in the called program. That's what you attempted to do: you just got it wrong because of the illicit use of undefined local macros.
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#14

06 Apr 2023, 17:17

In a future update could you store the results of runby? For example,

Code:

-------------------------------------- Number of by-groups = 4 by-groups with errors = 1 by-groups with no data = 0 Observations processed = 50 Observations saved = 35 --------------------------------------

I would like to have something like

Code:

assert r(by_group_errors) == 0

I use runby a lot and I would like to halt my code if something has gone wrong. Or does that functionality exist and I’m not aware of it?
Comment

Justin Niakamal

Join Date: Aug 2017
Posts: 760

#15

28 Jan 2025, 18:32

As an update to #14, the following modifications in blue will work:

Code:

/*
-------------------------------------------------------------------------------

final_report()
==============

-------------------------------------------------------------------------------
*/

void final_report(

        real scalar g,
        real scalar gerrors,
        real scalar gnodata,
        real scalar mN,
        real scalar rN

)
{

        printf("\n{hline 38}\n")
        printf("Number of by-groups    = {res}%13.0fc{txt}\n", g)
        printf("by-groups with errors  = ")
        if (gerrors) printf("{err}%13.0fc{txt}\n", gerrors)
        else printf("{res}%13.0fc{txt}\n", 0)
        printf("by-groups with no data = {res}%13.0fc{txt}\n", gnodata)
        printf("Observations processed = {res}%13.0fc{txt}\n", mN)
        printf("Observations saved     = {res}%13.0fc{txt}\n", rN)
        printf("{hline 38}\n")
        displayflush()
        
         // Store values in r() objects
        st_numscalar("r(by_groups)", g)
        st_numscalar("r(by_group_errors)", gerrors)
        st_numscalar("r(by_groups_no_data)", gnodata)
        st_numscalar("r(obs_processed)", mN)
        st_numscalar("r(obs_saved)", rN)
        
}

Example output:

Code:

  elapsed ----------- by-groups ----------    ------- observations ------       time
     time      count     errors    no-data        processed         saved  remaining
------------------------------------------------------------------------------------
 00:00:01          2          0          0              360           360   00:00:37
 00:00:02          4          0          0              720           720   00:00:36
 00:00:03          6          0          0            1,080         1,080   00:00:34
 00:00:04          8          0          0            1,440         1,440   00:00:33
(now reporting every 5 seconds)
 00:00:10         18          0          0            3,240         3,240   00:00:28
 00:00:15         28          0          0            5,040         5,040   00:00:23
 00:00:20         38          0          0            6,840         6,840   00:00:18
 00:00:25         48          0          0            8,640         8,640   00:00:12
 00:00:31         58          0          0           10,440        10,440   00:00:07
 00:00:36         68          0          0           12,240        12,240   00:00:02
 00:00:37         71          0          0           12,780        12,780   00:00:00

--------------------------------------
Number of by-groups    =            71
by-groups with errors  =             0
by-groups with no data =             0
Observations processed =        12,780
Observations saved     =        12,780
--------------------------------------

. 
. return list 

scalars:
             r(k_drop) =  1
          r(by_groups) =  71
    r(by_group_errors) =  0
  r(by_groups_no_data) =  0
      r(obs_processed) =  12780
          r(obs_saved) =  12780

. assert r(by_group_errors) == 0

Announcement