Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • runby: a new command on SSC that runs Stata commands on by-groups of observations

    Thanks to Kit Baum, a new command called runby (with Clyde Schechter) is now available on SSC. To install it, type in Stata's Command window:
    Code:
    ssc install runby
    runby loops over data by-groups. A by-group is a subset of the initial data in memory and includes all observations with the same value for the variables specified in the by(varlist) option.

    You can run as many Stata commands as you want on each by-group. All you need to do is to wrap these commands in a generic Stata program.

    With each loop iteration, runby replaces the data in memory with the by-group's data and runs your program. What's left in memory when your program terminates is considered results and is stored. When runby finishes, the data in memory contains the combined results from all by-groups. runby does not care about what's left in memory, it will grab it all and save it all.

    runby is a more efficient alternative to commands like statsby and loop based solutions (via levelsof and foreach ...). Because the commands run on data subsets, there is no need to use if or in qualifiers to target by-group observations.

    runby will be useful if you need to run estimations by groups (see the panel-specific regressions example in the help file). It will also be useful with some matching problems when the number of possible pairwise combinations is too large to handle in one pass. There's a great example of case-control pairing in the help file.

    For large problems, there's a status option that will trigger progress reports to print in the Results window. These show the elapsed time, how many by-groups have been processed so far (with how many that end with program errors or no data), how many results observations have been saved, and finally an estimated time to completion. The frequency of reports is 1 per second initially and gradually slows down to every 5 minutes after 1 hour of running time.

    For those who like to think outside the box, runby can be useful for some data management tasks. You can easily partition a large dataset into separate datasets, one for each by-group. You can even use runby to automate the import of a bunch of files into Stata. You use runby to loop over a list of files and let your program handle all the steps needed to import each file. There are examples for each of these uses in the help file. Here's an example from today that shows how to import problematic Excel files using runby.

    Here's a quick example that shows the basic functionality:
    Code:
    clear all
    program try_this
      summarize rep78, meanonly
      replace rep78 = r(mean)
      gen mrep78_N  = r(N)
      keep foreign rep78 mrep78_N
      keep in 1
    end
    
    sysuse auto
    runby try_this, by(foreign)
    
    list
    and the results:
    Code:
    . list
    
         +-------------------------------+
         |   rep78    foreign   mrep78_N |
         |-------------------------------|
      1. | 3.02083   Domestic         48 |
      2. | 4.28571    Foreign         21 |
         +-------------------------------+
    By default, runby uses Mata to do its thing because it is very fast at moving data around. The downside is that it requires extra memory to store a copy of the initial data and to store results. There's an option to use Stata only commands (use, save, and append) if you are tight on memory, with a definite impact on execution times.

  • #2
    Thanks again to Kit Baum, runby has been updated on SSC. The new version fixes an issue that prevented runby from running on older versions of Stata (from version 11 to 13).

    To update, type in Stata's Command window:
    Code:
    ssc install runby, replace
    or go the adoupdate route using:
    Code:
    adoupdate runby

    Comment


    • #3
      Hi I tried using this command but keep getting some error. I have a data which has a variable cashratio for different business activity codes starting from A1,A2, A3,,....C13. I have another variable which has states which has 37 categories with values AN, TZ, ....and so on. Now I want to get the result as -for each state -the median values of cash ratio according to different activity codes.
      STATE cash ratio
      AP A1 median value of all fields which are in state AP and have business activity code A1
      A2
      A3
      A4
      B1
      B2
      B3
      B4
      B5
      C1
      C2
      C3
      C4
      C5
      C6
      C7
      C8
      C9
      C10
      C11
      C12
      C13
      Please suggest.

      Comment


      • #4
        nishtha ruhil You don't show any data, code or error message, but

        Code:
        egen wanted = median(cashratio), by(activity state) 
        should be enough of a hint for a direct solution. If that isn't a good answer, you do please need to read and act on https://www.statalist.org/forums/help#stata

        Comment


        • #5
          Nick Cox I apologise. The data is as follows:
          input str3 business_act_cod float cashratio str2 state
          "A1" .08 "AN"
          "B1" .01 "TG"
          "A1" .34 "TG"
          "A1" .7 "RJ"
          "A1" 5.3 "TG"
          "A1" .55 "TG"
          "A4" 0 "TG"
          "A4" .05 "TG"
          "A4" 1.02 "TG"
          "A4" 0 "TG"
          "A4" 0 "TG"
          "A1" 59.63 "TG"
          "A4" .06 "TG"
          "A4" .06 "TG"
          "A4" .09 "TG"
          "A4" 0 "TG"
          "A4" 0 "TG"
          "A1" .37 "TG"
          "A4" 0 "TG"
          "A1" .62 "TG"
          "A1" .11 "TG"
          "A1" .06 "TG"
          "A4" 0 "TG"
          "A4" 0 "TG"
          "A4" 1.91 "TG"
          "C1" .02 "TG"
          "A4" .01 "TG"
          "A4" 0 "TG"
          "A4" 0 "TG"
          "A4" 0 "TG"
          "A1" .01 "TG"
          "A4" 62.7 "TG"
          "A4" 40.72 "TG"
          "C13" 0 "TG"
          "A4" .21 "TG"
          "A4" 8.34 "TG"

          business_act_code has 22 different fields and for state there are 37 different fields.

          Now I had run the following command on my data above:

          program try_this
          1. sumarize cashratio,medianonly
          2. replace cashratio = r(median)
          3. gen mcash_N = r(N)
          4. keep state business_act_cod cashratio mcash_N
          5. keep in 1
          6. end

          runby try_this, by(state business_act_cod)

          THis is what I get after running this:
          --------------------------------------
          Number of by-groups = 641
          by-groups with errors = 641
          by-groups with no data = 0
          Observations processed = 118,884
          Observations saved = 0
          --------------------------------------
          Last edited by nishtha ruhil; 07 Jan 2019, 22:37.

          Comment


          • #6
            So probably the easiest way to get what you want is to use collapse (note: save your data before doing this as collapse deletes data and creates a new dataset of summary statistics);

            Code:
            sort business_act_cod state cashratio
            table business_act_cod state, c(median cashratio) row col
            
            --------------------------------------
            business_ |           state          
            act_cod   |    AN     RJ     TG  Total
            ----------+---------------------------
                   A1 |   .08     .7    .37    .37
                   A4 |                .005   .005
                   B1 |                 .01    .01
                   C1 |                 .02    .02
                  C13 |                   0      0
                      |
                Total |   .08     .7   .055    .06
            --------------------------------------
            
            
            collapse (count) cash_count = cashratio (mean) cash_mean = cashratio (median) cash_median = cashratio, by( business_act_cod state)
            rename business_act_cod act_code  // just shortened to make it easier to list
            
            . list, sepby(act_code ) noobs abbrev(16)
            
              +---------------------------------------------------------+
              | act_code   state   cash_count   cash_mean   cash_median |
              |---------------------------------------------------------|
              |       A1      AN            1         .08           .08 |
              |       A1      RJ            1          .7            .7 |
              |       A1      TG            9    7.443334           .37 |
              |---------------------------------------------------------|
              |       A4      TG           22       5.235          .005 |
              |---------------------------------------------------------|
              |       B1      TG            1         .01           .01 |
              |---------------------------------------------------------|
              |       C1      TG            1         .02           .02 |
              |---------------------------------------------------------|
              |      C13      TG            1           0             0 |
              +---------------------------------------------------------+
            
            * Or if you wanted this by state and then by account code
            sort state act_code
            order state, first
            list, sepby(state ) noobs abbrev(16)
            
              +---------------------------------------------------------+
              | state   act_code   cash_count   cash_mean   cash_median |
              |---------------------------------------------------------|
              |    AN         A1            1         .08           .08 |
              |---------------------------------------------------------|
              |    RJ         A1            1          .7            .7 |
              |---------------------------------------------------------|
              |    TG         A1            9    7.443334           .37 |
              |    TG         A4           22       5.235          .005 |
              |    TG         B1            1         .01           .01 |
              |    TG         C1            1         .02           .02 |
              |    TG        C13            1           0             0 |
              +---------------------------------------------------------+
            Last edited by David Benson; 07 Jan 2019, 23:28.

            Comment


            • #7
              David Benson Thank you so much, got the result I needed.

              Comment


              • #8
                There are several bugs in your original code.

                Code:
                sumarize cashratio,medianonly
                replace cashratio = r(median)
                There is no command called sumarize that I know of; presumably you mean summarize, but that doesn't have an option medianonly. To get medians, you need summarize, detail after which the saved result you want is r(p50).

                Where on Earth did that syntax come from? Perhaps guessing wildly, which is a poorer programming strategy than reading documentation.

                Comment


                • #9
                  Hi

                  I am using runby for a pca where we have 3 respondents and 2 cohorts (code below).

                  capture program drop one_group
                  program define one_group
                  display "Respondent #" Respondent " Cohort #" cohort
                  pca Item1- Item25
                  fapara, pca reps(10)
                  exit
                  end

                  runby one_group, by(Respondent cohort) verbose


                  The fapara analyses produce a graph (6 in total), and I can see them as the code is running, but I can only see the last graph once the program has finished running. Is there some way I can keep all 6 graphs or preferably embed the graphs into the Stata output window?

                  I am using Stata 15.1.

                  Thanks in advance for your time
                  Jen

                  Comment


                  • #10
                    -fapara- is not part of official Stata and I know nothing about it. I presume that it is the command that produces the graphs you refer to, as nothing else does. Official Stata programs that produce graphs usually allow pass through of options to the -graph- command so you can customize it. In this case, what you need to do is use the -name()- option and give each graph a new name. So assuming that fapara allows this, change that command to:

                    Code:
                    fapara, pca reps(10) name(`"R`=respondent[1]'C`=cohort[1]'"', replace)
                    After that, the graph for Respondent 1 and Cohort 2 will be in Graph window R1C2, etc.

                    If -fapara- does not accept the name() option, then I think the best you can do, kludgy though it is, is to -graph save `"R`=respondent[1]'C`=cohort[1]'"', replace- at the end of program one_group. Then after you come out of -runby- you can -graph use- each one.

                    As far as I know, there is no way, ever, to embed Stata graphs into the Results window.

                    Comment


                    • #11
                      Thanks heaps Clyde, the additional code ran perfectly. Also thanks for the additional information regarding official Stata programs.

                      Comment


                      • #12
                        Hello everyone!

                        Currently, I'm trying to substitute statsby with runby for some panel-specific regressions, I want to save the results of each regression to a different dataset, but I haven't been able to save these regressions in different datasets. My code is:

                        Code:
                        program define hitsch
                            local vars ""
                            foreach number of numlist 1/5 {
                                qui sum lp_comp`number', detail
                                local percentile=r(p25)
                                if `percentile'!=8888 {
                                    local vars "`vars'" + " lp_comp`number' "
                                }
                            }
                            *Regression with interaction
                            reghdfe `quantity' `price' i.`promo' c.`price'#1.`promo' `vars', absorb(month_year seller) cluster(seller)
                        
                        gen obs=e(N)
                        lp=_b[lp_variable]
                        save "${dir}reg_`promo'.dta", replace
                            end
                        
                        parallel, by(brand) programs(hitsch): runby hitsch, by(brand city)
                        While the parallel process seems to run fine immediatly after ending the child processes I get the error: "No dataset for instance 0001, r(601)"

                        Before I would use the saving option of statsby, but I don't know what the equivalent in runby is.


                        Thanks!

                        Comment


                        • #13
                          First, know that I am not familiar with -parallel- and I cannot be sure whether it is compatible with -runby-. But let me assume that it is.

                          The problem that stands out looking at this is that a bunch of undefined local macros appear in program hitsch: quantity, price, promo. Even if you have assigned values to these macros in the parent process, they are undefined within program hitsch. This is a defining principle of local macros: their scope is limited to the block in which they are defined. You can see this most transparently with:
                          Code:
                          . clear*
                          
                          .
                          . capture program drop demo
                          
                          . program define demo
                            1.     display `"`my_macro'"'
                            2.     exit
                            3. end
                          
                          .
                          . local my_macro ABCDE
                          
                          .
                          . demo
                          
                          
                          .
                          . display `"`my_macro'"'
                          ABCDE
                          In particular, since local macro promo is not defined within program hitsch, the -save- command at the end is going to overwrite the same file, ${dir}reg_.dta, repeatedly, if it even reaches that command. I don't think it reaches that command, because with promo undefined, it seems to me that the -reghdfe- command is expanded as -reghdfe i. c.#1. expansion_of_vars...-, which is a syntax error.

                          Passing information from the calling program to the called program in -runby- is difficult because -runby- does not allow the calling program to have arguments or options, and locals can't cross into the program. So you are left with more problematic ways only: global macros, creating "variables" in the data set that contain the values, or saving them in a text file that the called program reads. There may be other ways that I haven't thought of--these are the ones that I have used myself. Of these, I generally prefer creating variables in the data set.

                          Before I would use the saving option of statsby, but I don't know what the equivalent in runby is.
                          There is no equivalent. To accomplish saving some interim or final results of the called program at each iteration you just use a -save- command in the called program. That's what you attempted to do: you just got it wrong because of the illicit use of undefined local macros.

                          Comment


                          • #14
                            In a future update could you store the results of runby? For example,

                            Code:
                            --------------------------------------
                            Number of by-groups    =             4
                            by-groups with errors  =             1
                            by-groups with no data =             0
                            Observations processed =            50
                            Observations saved     =            35
                            --------------------------------------
                            I would like to have something like

                            Code:
                            assert r(by_group_errors) == 0
                            I use runby a lot and I would like to halt my code if something has gone wrong. Or does that functionality exist and I’m not aware of it?

                            Comment

                            Working...
                            X