Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New package on SSC: multishell

    Thanks to Kit Baum I am happy to announce that multishell is now available on SSC. As usual it can be installed by typing
    Code:
     ssc install multishell
    in the command bar.

    What is the purpose of the program? multishell allows the efficient processing of loops and multiple do files across a single and multiple computers. It dissects forvalues and foreach loops and creates for each variation of the loop (tasks) a separate do file and batch file. Stata's build in winexec command is used to start a new instance of Stata using the .bat file. The instance is closed as soon as the task is completed (or failed, then it is reported) and a new instance processing the next task is started. One instance is reserved to organise the tasks and starts other instances. Multiple instances can be run in parallel on the same computer or across computers, mimicking a cluster. The computer which acts as a server will distribute the tasks to the different machines, given the maximum number of instances possible on each machine.

    How to run it? For example, it is common to use Monte Carlo simulations to assess the bias of an estimator. This is done by varying the number of observations, let's say from n=10 in steps of 10 to n = 130. Assume the DGP and the regression are part of the program MonteCarloSim. The number of observations is set as the only argument of the program and the estimated coefficient of variable x is returned as r(x). The program is saved in a do file, called example_MC.do together with the code containing a forvalues loop with different values of n:
    Code:
    forvalues n = 10 (10) 130 {
         simulate bx = r(x), reps(1000) : MonteCarloSim `n'
    }
    multishell creates for each of the variations of n (n=10, n=40,...,n=130) a do file and a .bat file. The files are then queued and consecutively processed by multiple instances of Stata on a single computer or by multiple computers.
    To start multishell a second do file is required. The do file contains the commands for the multishell environment, such as setting a temporary path, pointing to the Stata exe and an additional ado path and adds do files to the queue and runs multishell:
    Code:
    clear
    adopath ++ "C:\documents\multishell\ado\"
    multishell path "C:\documents\multishell\test\output\", clear
    multishell exepath "C:\Program Files (x86)\Stata14\StataSE-64.exe"
    multishell adopath "C:\documents\multishell\ado\"
    multishell add "C:\documents\multishell\simulation\example_MC.do"
    multishell run, threads(6) sleep(2000)
    An output window will appear and show the name of the do file, the number of tasks and a breakdown of all variations. At most 6 instances of Stata will be started in parallel (set by the option threads).

    It is possible to include another computer which has access to the folder set by multishell path and run multishell on both. On the second computer there is no need to add the do file(s) again, as they are created and managed by the first computer, the server. The command lines for the client are:
    Code:
    clear
    adopath ++ "C:\documents\multishell\ado\"
    multishell path "C:\documents\multishell\test\output\", clear
    multishell exepath "C:\Program Files (x86)\Stata14\StataSE-64.exe"
    multishell adopath "C:\documents\multishell\ado\"
    multishell run client, threads(6) sleep(2000)
    In total there are 12 instances of Stata running in parallel on two machines, speeding up processing loops.

    More examples are available in the help file and example do files are available as well.

    At the moment only Microsoft Windows is supported.


  • #2
    Jan: thanks for this. A point of clarification please, related to your simulate example. We need to be able to set the seed in order to make results reproducible. One could add a seed() to the option in your example's simulate call, but I suspect that multishell would then apply the same seed in each batch (which is not what we want). Alternatively, one could set the seed before the forvalues call, but would we still get the behaviour we want if we use multishell?

    Comment


    • #3
      Hi Stephen, thanks for pointing this out. multishell has a function which writes a seed at the very beginning of a do file. Seeds can be saved in a separate dataset, such that there is a specific seed for each variation of the loops. This function allows to re-run a simulation and making sure the seeds are the same.
      As an alternative, a "random" seed is allocated to each variation, making sure that the seed and thus the random draws are different. In this case it is possible to create a dataset with the seeds for each variation.

      Comment


      • #4
        The topic Stephen pointed at is important and I thought a better explanation of Stata's seed, the implications of it and how multishell is able to handle it might be useful.

        By default Stata has the same seed every time it is started or invoked by a batch file. This means, that if multiple instances of Stata are started, the seed will be the same and the drawn random numbers will be the same as well. In the example above with n changing from 10 to 130, the first 10 observations will be identical in all runs. Then the next 10 observations will be identical across the remaining runs etc. Obviously this is not necessarily desirable as there is only little variation across the different runs. In addition, it is common to save the seed of Monte Carlo simulations to be able to replicate results (as a further reading, Tim Morris gave a very good tutorial about it at the Stata User Group Meeting in 2016; see https://www.stata.com/meeting/uk16/slides/morris_uk16.pdf ).
        To make sure simulation results can be reproduced, a seed is set. If the example above is changed to:
        Code:
        set seed 123
        forvalues n = 10 (10) 130 {
            simulate bx = r(x), reps(1000) : MonteCarloSim `n'
        }
        Then it is easy to reproduce the simulation results because the seed is known. However, if multishell is used, then the seed for n=10, n=20,..., will always be "123" and the same random numbers are drawn. To avoid this behaviour, multishell has the function seed. The syntax is:
        Code:
        multishell seed type filename [, fill]
        where type is either save, load or create and the option fill is only available if create is used. Filename is the name of a dta which is created, seeds are saved or loaded from. The file is located in the folder set by multishell path. The dataset needs to contain a variable called id, which identifies the variation and a variable called seed, which contains the seed for each variation. mutlishell seed needs to be called after multishell path and multishell add.

        Code:
        multishell create filename, [fill]
        creates a dataset with a row for each variation (i.e. using the example above for n = 10, n = 20, etc.). It adds a column (or variable) called seed. This variable is empty and can be edited later. If the option fill is used, then the variable seed is filled with random numbers. Those random numbers are then used to set the seed within each variation. At the end of the multishell run, the file is altered and the actual seed (c(rngstate)) is saved. This means, multishell creates and saves a dataset with the seeds and thus it is possible to replicate the results for each variation.

        Code:
        multishell load filename
        loads the seeds for each variation from the dta specified in filename. This is useful if a simulation is run and afterwards changes are made. For example we want to estimate the model above by OLS and then by ML (just as an example!). It is possible to first run the simulation only estimating OLS and then re-run it with the same seeds and use ML.

        Code:
        multishell save filename
        saves for each variation the seeds. If this is used without any further specification of the seed, the saved seeds will be the same for all variations.

        The following two examples are equivalent and show how multishell create and multishell load are connected.

        Code:
         multishell seed create "seed file"
        use "C:\documents\multishell\test\output\seed file", clear
        replace seed = rnormal()
        save "C:\documents\multishell\test\output\seed file", replace
        multishell seed load "seed file"
        multishell run, threads(6) sleep(2000)
        and
        Code:
        multishell seed create "seed file", fill
        multishell run, threads(6) sleep(2000)
        Of course using rnormal() to create seeds is a bit "random", but it is the best I can think of. I would be very interested in any feedback or comments about it.

        Hopefully this clarifies any questions.




        Comment


        • #5
          Thanks to Kit Baum a new version of multishell is available on SSC. Version 2.0 includes many improvements and befitted crucially from discussions at the Stata User group meeting 2018 in London and this forum, for which I am very grateful.

          The main novelties and improvements are:
          • Seed, seed streams and random numbers: If Stata 15 or later is used, then multishell uses seed streams. Seed streams make sure that seed sequences do not overlap and therefore sequences of random numbers are not repeated. It is possible to deactivate this option if necessary. For Stata 14, random numbers are obtained from www.random.org which are used as seeds. This makes sure that the seed differs for each instance. The random number for the seed are obtained using an updated version of setrngseed. Sequences of random numbers may still overlap if no seed streams are used. A detailed discussion can be found in the helpfile
          • Maximum running time for tasks and multishell: Two options allow to control the time multishell runs. It is possible to restrict the time a task runs. Also, it is possible to stop multishell at a given date. multishell maintains in the background a list with all running Stata instances. If an instance runs over time, it will be closed and marked as “stopped”.
          • Skipping loops: multishell can now ignore loops. If loops are marked by /* multishell loop */, then only those loops are processed and dissected by multishell. All other loops are ignored.
          • Log files: Log files for each task are automatically created. As soon as a task is completed, stropped or aborts with an error, a clickable link to the log file appears in the multishell’s output. At the beginning of each log file, an overview is shown. The overview includes information about the filenames, folders, the variation of the loop and the state, type and stream of the random number generator.
          Before using multishell, I strongly recommend reading carefully the help file. When using multishell to run simulations or any type of calculations involving random numbers, I strongly recommend using Stata 15

          Please let me know any feedback and comments.

          Comment


          • #6
            Dear Jan, I found the following error message. Can you have a look?
            Click image for larger version

Name:	multishell-error.png
Views:	1
Size:	13.3 KB
ID:	1465928
            Ho-Chuan (River) Huang
            Stata 17.0, MP(4)

            Comment


            • #7
              Dear River,
              I am afraid, this is beyond my control as Kit Baum maintains SSC. I will get in touch with him.

              In the meantime, you find the file you are looking for in my own repository. You can find it if you type
              Code:
              net from http://www.ditzen.net/Stata
              and then select multishell.
              Best,
              Jan

              Comment


              • #8
                Dear Jan, Thank you for the reply.
                Ho-Chuan (River) Huang
                Stata 17.0, MP(4)

                Comment


                • #9
                  It is fixed and the files can now be downloaded from SSC as well.

                  Comment


                  • #10
                    Thanks to Kit Baum a new version is available on SSC. The update includes a fix to a bug finding the Stata.exe. See https://www.statalist.org/forums/for...d-my-stata-exe

                    Comment


                    • #11
                      Hi,
                      After running the multishell add command, I get the following error message. Might anyone know what is wrong?

                      Failed: mata mata matsave "\\multishell_overview" multishell_overview.
                      Failed: mata mata matuse "\multishell_overview".

                      I'm trying

                      multishell exepath "C:\Program Files\Stata16.exe"
                      multishell path "Y:\patient survey"
                      multishell add "Y:\patient survey\mymultishell.do"
                      multishell run, threads(3) sleep(2000)

                      The "mymultishell.do" file contains:

                      clear
                      forvalues n=172 100 75 52 {
                      simulate ageint=(myresults[4,113]<.05 | myresults[4,117]<.05 | myresults[4,119]<.05),reps(10): mysim10 `n'
                      save results_`n',replace
                      }

                      Thanks!

                      Comment

                      Working...
                      X