Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -batcher- now on SSC: A simple way to parallelise tasks

    Dear all

    The -batcher- command is now finally mature enough to announce it here. It is a very low-level command to parallelise tasks. It works as follows:
    1. You code your dofiles in such a way that they accept an iteration argument that decides what gets executed (this is very easy).
    2. You tell batcher which iterations it should supply to the dofile (this is simple numlist option).
    3. Batcher then starts Stata instances to run those iterations (option to limit simultaneous sessions)
      1. There's a tracker to keep you up to date on what has worked and what hasn't
      2. It is integrated with sendtoslack if you want to get updates on your smartphone or some other device
      3. Separate logfiles are generated for each iteration
    What batcher lacks in functionality, it makes up in ease-of-use (in my opinion). A minimal example could be:

    exampleDofile
    Code:
    di `1'
    batcher code
    Code:
    batcher path_to_exampleDofile, i(1/4) tempfolder("C:/temp")

    For very large tasks on small computers, you might want to limit the number of parallel sessions:
    batcher code
    Code:
    batcher path_to_exampleDofile, i(1/20) tempfolder("C:/temp") maxparallel(4)
    • Batcher differs from -parallel- because if I understand correctly, -parellel- works by running the same command on different slices of your dataset, whereas batcher allows any kind of differentation.
    • If you require more functionality or want to work on a real cluster rather than your own PC, consider JanDitzen -multishell-

    I am very grateful to JanDitzen Sergiy Radyakin Oscar Ozfidan Guglielmo Ventura and of course Kit Baum for their assistance, comments and bug reports.

  • #2
    Apparently parallel is a bit more advanced than last time I checked and also includes this functionality with some finessing of the command, see this tweet by Sergio Correia
    https://twitter.com/Ogoun/status/1499411206733508610

    Comment


    • #3
      Let me make sure I'm understanding this. I've heard of parallel and other related commands, but I never understood why I'd want them or what practically they'd be used for.


      Let's say I'm estimating some command that involves me looping over different panel units. Say, we have 50 units, and my toy code is

      Code:
      forv i = 1/50 {
      reg y x if unit ==`i'
      }
      The main benefit here is that Stata will open............ 50 times, and execute the loop?

      Comment


      • #4
        Say that the regression you cited takes 5 minutes for each unit. Thus to run all regressions you would need to wait 5*50 = 250 minutes. The idea of parallising this task is to use the X-number of cores of your computer to run the regressions in parallel. If you have for example 5 cores, the time would be now 5*50/5 = 50 minutes, so a huge time improvement. Starting Stata 50 times would be a bad idea, because the different Stata versions would compete for the available processing power of the cores. Thus it is often important to find the optimal number of parallel instances of Stata.

        What batcher, multishell, psimulate2 or parallel do is run those simulations in parallel on a specific number of Stata instances at the same time and then collect the results.

        Personal note: multishell literally saved me months when I was doing monte carlo simulations for my PhD....

        Comment


        • #5
          JanDitzen Okay.... earlier, I ran a program that does pretty much what I said, but worse because it uses LASSO and cross validation=lots of computing time.

          I think I had 19 treated units (let's round to 20) to loop through, and each one took 20 minutes-ish. Started around 4pm. Finished around 8:30pm. By this calculation, with 4 cores, I'm pretty much compressing what would be 400 minutes of work to like, 100 minutes?


          Okay I'm convinced now. If I can get batcher or multishell to turn 400 minutes into 100 within the context of my program, I'll credit the authors of said programs as coauthors at this point, because what you're saying honestly sounds miraculous.

          Comment


          • #6
            Yes, that is what you can achieve with this type of programs and what they are designed for. You have some overhead costs, but those lie in the seconds.

            If you need help with Multishell, let me know.
            Last edited by JanDitzen; 05 Mar 2022, 02:04. Reason: Font size of last sentence was too small (no idea how that happened)

            Comment


            • #7
              Originally posted by Jared Greathouse View Post
              JanDitzen Okay.... earlier, I ran a program that does pretty much what I said, but worse because it uses LASSO and cross validation=lots of computing time.

              I think I had 19 treated units (let's round to 20) to loop through, and each one took 20 minutes-ish. Started around 4pm. Finished around 8:30pm. By this calculation, with 4 cores, I'm pretty much compressing what would be 400 minutes of work to like, 100 minutes?


              Okay I'm convinced now. If I can get batcher or multishell to turn 400 minutes into 100 within the context of my program, I'll credit the authors of said programs as coauthors at this point, because what you're saying honestly sounds miraculous.
              Yes exactly. The crux of the matter is that it is very difficult for a computer to know which code can be ran in parellel and which only makes sense when ran sequentially. Therefore, almost all code is just ran line by line, one after the other (you wouldn't want your regressions to run before the data cleaning steps were executed). However, sometimes we as programmers know that a particular is "independent", in the sense that you don't need iteration 5 to run iteration 6. By using parallel/batcher/multishell, you basically supply that information to the computer, allowing it to run iterations on all cores simultaneously, rather than just using one repeatedly.

              Comment


              • #8
                Okay I see, and that was my main concern; much of my ado code runs in subroutines, where I have a main program that does basically all its work in sub-programs. So, provided I can use multishell/batcher on my subroutines within the context of an ado file, I'll look at both batcher and multishell, and I'll email you with any questions.

                Comment


                • #9
                  Hi Jesse, I was wondering what the “temp folder” here refers to? Is it the output directory or input directory?

                  Comment


                  • #10
                    In the output of help batcher we see
                    Code:
                    tempfolder(string)   folder to store logs (used to track progress)
                    and further down in the detailed descriptions of the options
                    Code:
                    tempfolder(string)  This option is required. You can simply point this to
                    your Stata working directory, or to some Dropbox folder, it really
                    does not matter.

                    Comment

                    Working...
                    X