Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • cloud computing with Stata

    On a number of occasions, I have heard the issue of commercial cloud computing (think Amazon EC2 and the like) being raised for computationally intensive Stata commands, like the various mixed effects/multilevel models (currently, I am running one with bootstrap on top of it). Does the community have any expertise with it? I see several issues related to that.
    1. Licensing issue is unclear. Should it be my own license that I would need to run Stata with? Or should it be the cloud provider license? I don't think Stata Corp. has cloud computing as a license format.
    2. If the licensing issue is resolved as "Stas needs to supply his copy of Stata and install it on his instance of the cloud", what are the providers that allow for this? Stata is definitely an obscure title, as most of the time clouds run somebody's C or Python code that is being compiled on the cloud itself.
    3. What are other issues? I am pretty sure I am missing another three or so, at least
    Has anybody successfully implemented Stata computation on a cloud?
    Last edited by skolenik; 13 Jul 2014, 23:50.
    -- Stas Kolenikov || http://stas.kolenikov.name
    -- Principal Survey Scientist, Abt SRBI
    -- Opinions stated in this post are mine only


  • #2
    I am currently using dropbox (https://www.dropbox.com/) for some test data. A licensed Stata version is on each desktop/laptop and I do all computations locally. That is a different answer relative to your question as I am using the cloud only for sharing data across multiple computers. However, one issue for those using cloud computing for Stata or any other analysis and storage of healthcare data, these must meet HIPAA requirements for a secure environment. Extending that issue, if you have identifiable and sensitive data that do not need to meet HIPAA requirements, how secure are your data on the cloud?

    HIPAA reference: http://privacyruleandresearch.nih.gov/

    Comment


    • #3
      Of course no personally identifiable information (PII) or HIPAA-regulated information should ever leave the analyst's computer. I am assuming that the person behind the keyboard is reasonably informed about this.
      -- Stas Kolenikov || http://stas.kolenikov.name
      -- Principal Survey Scientist, Abt SRBI
      -- Opinions stated in this post are mine only

      Comment


      • #4
        Originally posted by skolenik View Post
        On a number of occasions, I have heard the issue of commercial cloud computing (think Amazon EC2 and the like) being raised for computationally intensive Stata commands, like the various mixed effects/multilevel models (currently, I am running one with bootstrap on top of it). Does the community have any expertise with it? I see several issues related to that.
        1. Licensing issue is unclear. Should it be my own license that I would need to run Stata with? Or should it be the cloud provider license? I don't think Stata Corp. has cloud computing as a license format.
        You (or your organization) needs to have a Stata license. For the purposes of the license agreement, you would treat a virtual machine in the cloud no different from a second machine in your office or at your location. If you have a single-user license, you are welcome to install a copy of it for your own use on your own virtual machine in the cloud.

        1. If the licensing issue is resolved as "Stas needs to supply his copy of Stata and install it on his instance of the cloud", what are the providers that allow for this? Stata is definitely an obscure title, as most of the time clouds run somebody's C or Python code that is being compiled on the cloud itself.
        You want to look for a provider which allows you to run a complete "computer" on the cloud. You then can install any application which supports the operating system available on that computer. Both Google Compute Engine (https://developers.google.com/comput...rating-systems) and Amazon EC2 (search for 'operating system' on this page: http://aws.amazon.com/ec2/faqs/) support a variety of Linux and Windows on their cloud machine instances.


        1. What are other issues? I am pretty sure I am missing another three or so, at least
        Has anybody successfully implemented Stata computation on a cloud?
        I think one of the best things you can do is just start playing. Set up an account for yourself with Amazon's EC2 cloud (or Google's GCE cloud if you like). Learn how to create and start up a machine instance.

        Learn how to turn it back off! (This is very important, because if you just leave it running, an idle instance is still costing you money!)

        There is a learning curve associated with getting set up to work in one of these environments. But, once you have set up the appropriate security key pairs and figured out how to start, log in to, and stop instances, you will find that working on a cloud computer isn't much different from working on any other computer attached to your local network.

        Comment


        • #5
          Thanks, Alan. This is very informative.
          -- Stas Kolenikov || http://stas.kolenikov.name
          -- Principal Survey Scientist, Abt SRBI
          -- Opinions stated in this post are mine only

          Comment


          • #6
            Alan:

            Would it be necessary to purchase a Stata license optimized for the number of processes/cores the virtual machine supports? For example, if the cloud service can support up to 24 processes/cores, would it make sense to install a license of Stata/MP for 24 cores? In this example, a license of Stata/MP for anything smaller than 24 cores wouldn't be able to take advantage of all 24 cores on the virtual machine, correct?

            Second, does Stata (or more specifically, the procedure you're running) care if the virtual machine is indeed a "single machine" or if it's running in a distributed / cluster environment? For example, a computationally intensive bootstrap procedure on a single machine needs to be written differently if run in a distributed environment (i.e., on multiple machines, wherein the results from each are merged later). Does the same requirement hold true when running Stata programs in a virtual environment?

            Much thanks. This information is very helpful.

            Kurt

            Comment


            • #7
              One issue I have had is memory management. This is a project using Stata 12.1 in a VMware Windows 7 virtual machine; I find that even though my virtual machine has 24gb of "RAM" allocated, Stata tends to shut down without notice when data in memory reach about 8-10gb.

              Comment


              • #8
                One issue I have found is the multicore licensing of Stata MP. The whole idea of a cloud computing is to allow users to share the fixed cost of high performance computing, while allowing them to use it on demand and paying only a variable cost. Amazon EC2 for example allows you to test applications with a computer that could cost you USD 20 thousand or more, by paying 2 dollars anhour.
                Unfortunately to take full advantage of a virtual machine with 16 or 32 cores you need a full Stata MP license for 16 or 32 cores, even if you would only use it for a few days.
                This eliminates any advantage that using EC2 has.

                Comment


                • #9
                  I just come across this topic. Yes, It's definitely beneficial to run Stata on the cloud. The advantage is that you can do programming anywhere on any device, as long as you can remotely log in to the cloud.
                  However one thing I don't do is using XStata mode. While I can use X11 forwarding to "see" the data viewer, it is much slower due to internet bandwidth.
                  I used to use Virtual Private Server to do the execution, but now I shifted to the dedicated Server, which is much cheaper (only 20% cost compared to the cloud computing).
                  I think cpu cores is a very misleading concept. For Stata-MP license, I consider physical cpu number is much more pertinent. The number of real processors is the critical factor.
                  Two physical cpu on dedicated server might be faster than 16 cores in cloud computing (seriously!). In that case, buying 16-cores Stata-MP license with 16 cores cloud computing might not be right choice. 8-cores Stata-MP license with two physical cpu in the motherboard on dedicated server might be faster.
                  My operation is a daily cron job. If you just want to execute Stata operation for a short time, It's a different scenario.
                  For memory management, not only can I used total RAM of the server (on batch mode, run several Stata processes at the same time). I even use swap to create more memory. Because RAM is expensive than hard disk. And I can tolerate the slower speed because I run it 24 hours a day.
                  Another critical reason to run Stata on the server is due to how to gather original data source. If it's from internet, like real-time financial data, it's better to have at least 2Mb/s guaranteed bandwidth. And only server environment with computer facilities can afford it.
                  Stata/MP is a flavor of Stata that can perform symmetric multiprocessing (SMP) on a computer with multiple processors or cores. Although Stata/MP licenses are not platform specific, certain operating systems are supported.

                  Comment


                  • #10
                    skolenik If your IT guys are able to hook you up with some server space you can get the same effect by installing Stata on that server and providing permissions for authorized users to connect via SSH or other secure communications protocol. I've used the X server capabilities and while it may eat up some bandwidth, it wasn't too bad. Additionally, I think you may need X server available if you're generating graphs. In the cloud computing environment (e.g., someone else's servers) the cost could start to become prohibitive depending on runtime and the power of the server being used. If you go with one of the minimal instances, it will cost you about $4/month for the server to be up and idle (I had set up something for my wife to do some test design work and it doesn't require many cycles).

                    alejoforero I would disagree that all utility from cloud computing environments is removed, but would agree that it is less than optimal for the typical use case.

                    Kurt Heisler the second question is pretty interesting. I know Sergiy Radyakin has done some work with distributed computing environments and Stata. Maybe Alan Riley (StataCorp) or someone else from the development team could share their thoughts on the distributed computing vs single machine question you raised.

                    Comment


                    • #11
                      As William has mentioned, I have done some prototyping on separating large Stata jobs into multiple machines. Demonstration was performed in Boston in 2014.
                      Here is the presentation link:
                      https://ideas.repec.org/p/boc/scon14/5.html

                      The system allows submission of jobs manually (via a GUI interface) and programmatically by running Stata's do files that create jobs.

                      The demonstration was performed using 1 coordinating node, 2 computational nodes, and 1 client machine all running on a local WiFi network (no internet required). StataCorp has graciously supplied temporary Stata licenses for this demonstration, for which I am very thankful.

                      StataCorp has since worked out a licencing strategy for this kind of use. So, please consult with them on the licenses. The technical capabilities exist and I find it a very attractive instrument for various engines (not just Stata). For example, I often need to convert and aggregate 20k-30k separate tab-delimited files into a single Stata file. Having a distributed converter could be one way of doing this.

                      Best, Sergiy Radyakin

                      Comment


                      • #12
                        I just set up Google Compute Engine to speed up my synthetic control. Because synthetic control is embarrassingly parallel, I was expecting a massive improvement. However, the results were extremely disappointing. It turns out that an 8 core Ryzen 7 2700X is faster than a 48 core Google Compute Xeon.

                        Here are my benchmarks and Stata code:

                        48 core Xeon, using 38 threads: 619.11 seconds
                        8 core Ryzen 7 2700X, using 8 threads: 608.17 seconds
                        8 core Ryzen 7 2700X, using 16 threads: 499.29 seconds
                        4 core Core i7-7700, using 4 threads: 1011.97 seconds
                        4 core Core i7-7700, using 8 threads: 763.14 seconds

                        (The 48 core Xeon used only 38 thread because the example synthetic control dataset has 1 treatment group and 37 control groups, so the maximum number of threads is 38.)

                        Xeon = Google Compute Engine, Ubuntu 18.04 LTS, Stata 15/SE for Linux
                        Core = Core i7 7700, Windows 10 Enterprise, Stata 14/SE for Windows
                        Ryzen = Ryzen 7 2700X, Windows 10 Pro, Stata 15/SE for Windows

                        Code:
                        * Report execution time
                        set rmsg on, permanently
                        
                        * Install synth
                        ssc install synth, replace all
                        
                        * Install synth_runner
                        cap ado uninstall synth_runner //in-case already installed
                        net install synth_runner, from(https://raw.github.com/bquistorff/synth_runner/master/) replace
                        
                        * Install parallel
                        net install parallel, from(https://raw.github.com/gvegayon/parallel/stable/) replace
                        mata mata mlib index
                        
                        * Use the synth_smoking example dataset that comes with synth_runner
                        sysuse synth_smoking
                        tsset state year
                        
                        /*
                        To test synth_runner, use the example smoking synthetic control regression used in the synth_runner
                        help file, but add the "nested allopt" parameters to increase the precision and execution time.
                        And add the "parallel" parameter to utilize multithreading.
                        And remove the "gen_vars" parameter because there's no need to generate output.
                        
                        Thus, the synth_runner help file gives this as an example:
                        synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) gen_vars
                        
                        I use this instead:
                        synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) parallel nested allopt
                        */
                        
                        * maximum of 96 threads, with nested allopt
                        parallel setclusters 96
                        synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) parallel nested allopt
                        
                        * 4 threads, with nested allopt
                        parallel setclusters 4
                        * the same code as before
                        
                        * 8 threads, with nested allopt
                        parallel setclusters 8
                        * the same code as before
                        
                        * 16 threads, with nested allopt
                        parallel setclusters 16
                        * the same code as before
                        
                        parallel clean
                        Last edited by Michael Makovi; 29 Mar 2019, 12:29.

                        Comment


                        • #13
                          Are you using a dedicated host? If your instance is VM based it could be a problem with the overall load on the hardware. Additionally, I think there may be some issues with the way that parallel operates in different operating system environments.

                          Comment


                          • #14
                            I was using the "n1-highcpu-96," described here: "High-CPU machine types are ideal for tasks that require more vCPUs relative to memory. High-CPU machine types have 0.90 GB of memory per vCPU."

                            Down below that, describing a different set of machines, it says, "Shared-core machine types provide one vCPU that is allowed to run for a portion of the time on a single hardware hyper-thread on the host CPU running your instance. Shared-core instances can be more cost-effective for running small, non-resource intensive applications than standard, high-memory or high-CPU machine types."

                            So I think this means that the n1-highcpu-96 machine is not a shared-core machine.

                            As for the different system environments, good point. My Ryzen system is a dual-boot, so yeah, I should boot into Linux, install Stata, and benchmark it there too.
                            Last edited by Michael Makovi; 31 Mar 2019, 14:41.

                            Comment

                            Working...
                            X