-batcher- now on SSC: A simple way to parallelise tasks

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#1

-batcher- now on SSC: A simple way to parallelise tasks

03 Mar 2022, 06:47

Dear all

The -batcher- command is now finally mature enough to announce it here. It is a very low-level command to parallelise tasks. It works as follows:
You code your dofiles in such a way that they accept an iteration argument that decides what gets executed (this is very easy).

You tell batcher which iterations it should supply to the dofile (this is simple numlist option).

Batcher then starts Stata instances to run those iterations (option to limit simultaneous sessions)
There's a tracker to keep you up to date on what has worked and what hasn't

It is integrated with sendtoslack if you want to get updates on your smartphone or some other device

Separate logfiles are generated for each iteration

What batcher lacks in functionality, it makes up in ease-of-use (in my opinion). A minimal example could be:

exampleDofile

Code:

di `1'

batcher code

Code:

batcher path_to_exampleDofile, i(1/4) tempfolder("C:/temp")

For very large tasks on small computers, you might want to limit the number of parallel sessions:
batcher code

Code:

batcher path_to_exampleDofile, i(1/20) tempfolder("C:/temp") maxparallel(4)

Batcher differs from -parallel- because if I understand correctly, -parellel- works by running the same command on different slices of your dataset, whereas batcher allows any kind of differentation.

If you require more functionality or want to work on a real cluster rather than your own PC, consider JanDitzen -multishell-

I am very grateful to JanDitzen Sergiy Radyakin Oscar Ozfidan Guglielmo Ventura and of course Kit Baum for their assistance, comments and bug reports.
Tags: None

6 likes
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

04 Mar 2022, 01:31

Apparently parallel is a bit more advanced than last time I checked and also includes this functionality with some finessing of the command, see this tweet by Sergio Correia
https://twitter.com/Ogoun/status/1499411206733508610
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#3

04 Mar 2022, 03:19

Let me make sure I'm understanding this. I've heard of parallel and other related commands, but I never understood why I'd want them or what practically they'd be used for.

Let's say I'm estimating some command that involves me looping over different panel units. Say, we have 50 units, and my toy code is

Code:

forv i = 1/50 { reg y x if unit ==`i' }

The main benefit here is that Stata will open............ 50 times, and execute the loop?
Comment
JanDitzen

Join Date: Jan 2015

Posts: 349
#4

04 Mar 2022, 05:42

Say that the regression you cited takes 5 minutes for each unit. Thus to run all regressions you would need to wait 5*50 = 250 minutes. The idea of parallising this task is to use the X-number of cores of your computer to run the regressions in parallel. If you have for example 5 cores, the time would be now 5*50/5 = 50 minutes, so a huge time improvement. Starting Stata 50 times would be a bad idea, because the different Stata versions would compete for the available processing power of the cores. Thus it is often important to find the optimal number of parallel instances of Stata.

What batcher, multishell, psimulate2 or parallel do is run those simulations in parallel on a specific number of Stata instances at the same time and then collect the results.

Personal note: multishell literally saved me months when I was doing monte carlo simulations for my PhD....
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#5

04 Mar 2022, 21:26

JanDitzen Okay.... earlier, I ran a program that does pretty much what I said, but worse because it uses LASSO and cross validation=lots of computing time.

I think I had 19 treated units (let's round to 20) to loop through, and each one took 20 minutes-ish. Started around 4pm. Finished around 8:30pm. By this calculation, with 4 cores, I'm pretty much compressing what would be 400 minutes of work to like, 100 minutes?

Okay I'm convinced now. If I can get batcher or multishell to turn 400 minutes into 100 within the context of my program, I'll credit the authors of said programs as coauthors at this point, because what you're saying honestly sounds miraculous.
1 like
Comment
JanDitzen

Join Date: Jan 2015

Posts: 349
#6

05 Mar 2022, 01:58

Yes, that is what you can achieve with this type of programs and what they are designed for. You have some overhead costs, but those lie in the seconds.

If you need help with Multishell, let me know.

Last edited by JanDitzen; 05 Mar 2022, 02:04. Reason: Font size of last sentence was too small (no idea how that happened)
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#7

07 Mar 2022, 02:08

Originally posted by Jared Greathouse View Post

JanDitzen Okay.... earlier, I ran a program that does pretty much what I said, but worse because it uses LASSO and cross validation=lots of computing time.

I think I had 19 treated units (let's round to 20) to loop through, and each one took 20 minutes-ish. Started around 4pm. Finished around 8:30pm. By this calculation, with 4 cores, I'm pretty much compressing what would be 400 minutes of work to like, 100 minutes?

Okay I'm convinced now. If I can get batcher or multishell to turn 400 minutes into 100 within the context of my program, I'll credit the authors of said programs as coauthors at this point, because what you're saying honestly sounds miraculous.

Yes exactly. The crux of the matter is that it is very difficult for a computer to know which code can be ran in parellel and which only makes sense when ran sequentially. Therefore, almost all code is just ran line by line, one after the other (you wouldn't want your regressions to run before the data cleaning steps were executed). However, sometimes we as programmers know that a particular is "independent", in the sense that you don't need iteration 5 to run iteration 6. By using parallel/batcher/multishell, you basically supply that information to the computer, allowing it to run iterations on all cores simultaneously, rather than just using one repeatedly.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#8

07 Mar 2022, 05:33

Okay I see, and that was my main concern; much of my ado code runs in subroutines, where I have a main program that does basically all its work in sub-programs. So, provided I can use multishell/batcher on my subroutines within the context of an ado file, I'll look at both batcher and multishell, and I'll email you with any questions.
Comment
Tianrui Lai

Join Date: Mar 2022

Posts: 1
#9

20 Nov 2022, 13:43

Hi Jesse, I was wondering what the “temp folder” here refers to? Is it the output directory or input directory?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#10

20 Nov 2022, 15:45

In the output of help batcher we see

Code:

tempfolder(string)   folder to store logs (used to track progress)

and further down in the detailed descriptions of the options

Code:

tempfolder(string)  This option is required. You can simply point this to
your Stata working directory, or to some Dropbox folder, it really
does not matter.

Announcement

-batcher- now on SSC: A simple way to parallelise tasks

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment