Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New dobatch command available

    Thanks to Kit Baum, my new command, dobatch, is now available on SSC. dobatch runs a do-file as a background batch process, allowing multiple do-files to execute in parallel. It requires Stata MP and macOS terminal or Linux.

    dobatch checks system resources to ensure sufficient CPU availability and to limit the number of active Stata processes. There are two related use cases.

    1. Running a large number of scripts in parallel without overloading your server

    Suppose you are running a large number of Stata scripts that are independent of each other:
    Code:
    do script1.do
    do script2.do
    do script3.do
    …
    On a linux server, one could run each of these in parallel by launching them as separate jobs from the terminal:
    Code:
    nohup stata-mp -b do script1.do &
    nohup stata-mp -b do script2.do &
    nohup stata-mp -b do script3.do &
    …
    This approach allows faster execution by leveraging multiple processors. However, the user must be cautious not to overload the server. Each background process consumes CPU and memory. You can use dobatch to manage this safely and efficiently. dobatch launches only a limited number of jobs at once and automatically starts new ones as earlier ones finish. All you need to do is replace do with dobatch:
    Code:
    dobatch script1.do
    dobatch script2.do
    dobatch script3.do
    …
    By default, on a server with 64 processors running Stata MP 8, dobatch will wait until at least 7 CPUs are free and fewer than 8 Stata MP processes are running. If no other processes are running on the server, this allows up to 8 do-files to run in parallel in the background.

    2. Parallelizing a for loop

    Suppose you have the following script:
    Code:
    * mydofile.do
    forval x = 1/100 {
        [...]
    }
    If each iteration of this loop runs independently, meaning it doesn’t rely on previous iterations, the loop can be parallelized. To do this, first modify the beginning of the script as follows:
    Code:
    * mydofile.do
    local lower `1'
    local upper `2'
    forval x = `lower'/`upper' {
        [...]
    }
    Then, create a master script that uses dobatch to run the modified do-file multiple times, distributing the workload across parallel jobs. The example below splits the loop into four Stata jobs, each handling one-quarter of the iterations:
    Code:
    * master.do
    dobatch mydofile.do 1 25
    dobatch mydofile.do 26 50
    dobatch mydofile.do 51 75
    dobatch mydofile.do 76 100
    In this example, dobatch mydofile.do 1 25 passes the values 1 and 25 as arguments to mydofile.do, which stores them in the local macros lower and upper, respectively. To log the output of each job, include a log command in the do-file:
    Code:
    * mydofile.do
    log query
    if mi("`r(name)'") log using "mydofile_`1'_`2'.log", text replace
    local lower `1'
    local upper `2'
    forval x = `lower'/`upper' {
        [...]
    }
    Additional information is available in the Stata help file and on Github. The command can be installed from SSC (ssc install dobatch, replace) or Github (net install dobatch, from("https://raw.githubusercontent.com/reifjulian/dobatch/master") replace).
    Associate Professor of Finance and Economics
    University of Illinois
    www.julianreif.com

  • #2
    Many thanks for the command, Julian. It is a much needed command.

    Comment


    • #3
      dobatch has been updated to provide support for Windows. To install the latest version from Github, type the following at the Stata prompt: net install dobatch, from("https://raw.githubusercontent.com/reifjulian/dobatch/master") replace
      Associate Professor of Finance and Economics
      University of Illinois
      www.julianreif.com

      Comment


      • #4
        Thanks much, Julian Reif This should be handy. I was mostly batching them in a master do files although my datasets are fairly small and not memory intensive. Will test it out.

        Comment


        • #5
          Julian Reif ... how does this command compare to the parallel command? What would dobatch be more suited to for example? I've used parallel for cases where I have to run the same model over many endpoints ('omics datasets) and it works quite well. Also, for many simulated datasets it speeds things up. Thanks for your efforts!

          Comment


          • #6
            I’m not an expert on parallel, but my understanding is that dobatch and parallel solve related but somewhat different problems.

            From what I can tell, parallel is a full parallelization framework inside Stata: it splits work across multiple Stata instances, handles distributing data/tasks, and then recombines results. It's well suited for things like simulations, bootstraps, or running the same model over many variables or datasets---exactly the use cases you mention.

            dobatch is much simpler. It just launches multiple independent Stata batch jobs and manages how many run at once based on available CPU. It doesn’t split datasets or coordinate results; each job is just a separate do-file. That makes it useful when you already have a set of independent scripts (e.g., a pipeline with separate stages) and just want them to run concurrently. In my own work, this has decreased runtime on large projects by roughly an order of magnitude, simply by changing do to dobatch in my master script.
            Associate Professor of Finance and Economics
            University of Illinois
            www.julianreif.com

            Comment

            Working...
            X