Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata server hardware for modeling work: cores vs clock speed

    We’re in the enviable position of looking to add another Stata server instance and I am in need of a little advice. We use Stata for a range of purposes, but are finding that our current hardware is starting to struggle with more complex statistical modeling work as we’ve started to throw larger datasets at it. The modeling work is a mix of robust regressions of various types and Bayesian analysis. If you had to trade-off between number of cores and clock speed (on a Windows based server), which would you invest more in?

    Thanks for any help or tips!

    Chris


    P.S. I am a first time poster, long time lurker. I did try searching the archives (both new and old) first, but didn't quite find the information I was looking for, but if I missed something please let me know! Thanks!

  • #2
    Someone asked the same question a few months ago: Choosing a computer for Stata: Cores or CPU speed?

    There is also a Stata FAQ: What are the hardware requirements to run Stata?

    Lastly, StataCorp should be able to help you make a decision.

    Comment


    • #3
      Thanks for the "choosing a computer link"! In the few years that we've been running Stata on an existing server we've frequently seen that our models are often CPU bound, and that the work doesn't seem to always get spread across multiple CPU units. We have gotten to the point now that we're maxing out the server basically because of the volume of jobs being submitted, not because of the jobs themselves distributing the work across CPUs, so many of our jobs will take upwards of a day.

      I should have also mentioned that I did track down all the official documentation from Stata's site that I could before heading over hear, and one of our IT staff is also reaching out to StataCorp, but neither of us really understands in depth what Stata is doing behind the scenes.

      In any case, talking to StataCorp sounds like the next step. Thanks again for the help!

      Chris

      Comment


      • #4
        I will like to share some personal experience on this subject. Today I got a new equipment with a Xeon E5-2650 v3 (10 cores at 2.3 GHz) with 64Gb (2133 DDR4) RAM in replacement of a Xeon E5-1650 (6 cores at 3.2 GHz) with 64Gb (1600 DDR3).

        The data set has 22,467,230 observations organized as strongly balanced panel of 211,949 groups with 106 observations each. Instructions used are XTREGAR and XTHTAYLOR; and in both cases the old machine produces faster results than new machine. My initial guess is individual speed matters more than number of cores, even with slower RAM.

        According Stata/MP Performance Report, XTREGAR is 70% parallelized and XTHTAYLOR is 83% parallelized: nevertheless speed matters over cores in highly parallelized instructions.

        Comment


        • #5
          Hola Farid!

          To extend on his comments, I also had a very similar experience where a beefy Xeon with hundreds of GB of RAM and 32 cores underperformed a 2yr old desktop. I don't know if it's the fault of the Xeon (which I suspect), or maybe the error-correcting-RAM is slower, or the lack of SSDs hurts a lot during all those preserves, but nevertheless my advise would be i) get an SSD, ii) don't worry with Xeons or with many cores.

          Best,
          Sergio

          Comment


          • #6
            Originally posted by Farid Matuk View Post
            I will like to share some personal experience on this subject. Today I got a new equipment with a Xeon E5-2650 v3 (10 cores at 2.3 GHz) with 64Gb (2133 DDR4) RAM in replacement of a Xeon E5-1650 (6 cores at 3.2 GHz) with 64Gb (1600 DDR3).

            The data set has 22,467,230 observations organized as strongly balanced panel of 211,949 groups with 106 observations each. Instructions used are XTREGAR and XTHTAYLOR; and in both cases the old machine produces faster results than new machine. My initial guess is individual speed matters more than number of cores, even with slower RAM.

            According Stata/MP Performance Report, XTREGAR is 70% parallelized and XTHTAYLOR is 83% parallelized: nevertheless speed matters over cores in highly parallelized instructions.

            Both speed and cores matter, and as you discovered, both relative speed of processors and the non-parallelized vs. parallelized regions of commands matter.

            Before I go on, it is important to realize that clock speed alone isn't really an indication of how fast a particular computing core can accomplish a given task. Lot's of other factors matter, including memory cache, pipelining, and the like. Nevertheless, clock speed is an easy-to-understand indicator of how much work a particular core can do in a given amount of time.

            Let's take your first example, xtregar, which is 70% parallelized. You have two computers -- one with 10 cores @ 2.3 GHz each, and another with 6 cores @ 3.2 GHz each.

            Let's assume that there are 1000 units of work (whatever those units of work are don't actually matter -- let's just call them 'work units') that it takes to run your analysis with xtregar. If you were using a single 3.2 GHz core, those 1000 units of work could be done in 1000/3.2 = 312.5 time units (again, whatever those time units might be doesn't actually matter -- let's just call them 'time units'). A single 3.2 GHz core can get 1000 work units done in 312.5 time units. A single 2.3 GHz core can get 1000 work units done in 434.78 time units.

            But, you have 6 cores on that 3.2 GHz machine. We know that xtregar is 70% parallelized, which means that 30% of it is not. 300 of the original 1000 time units must be handled by a single core, so that can be done in 300 / 3.2 = 93.75 time units. The remaining 700 work units are handled by all 6 cores running in parallel, so they can be done in 700 / (3.2 * 6) = 30.43 time units. The total time for the task is 93.75 + 30.43 = 124.18 time units.

            Let's see how long it will take to do the same task on a machine with 10 2.3 GHz cores: (300/2.3) + (700 / (2.3 * 10)) = 160.87 time units.

            If we perform the same type of calculations with your second example, xthtaylor, which is 83% parallelized, we find that the 6-core 3.2 GHz machine takes (170/3.2) + (830/(3.2*6)) = 96.35 time units, while the 10-core 2.3 GHz machine takes (170/2.3) + (830/(2.3*10)) = 110 time units.

            Note that in all cases, parallelization made a huge difference in execution time. Recall that our 1000 "work unit" task would take 312.5 "time units" on a single 3.2 GHz core, and would take 434.78 "time units" on a single 2.3 GHz core. By using Stata/MP, that was cut down for xtregar to 124.18 time units on the 3.2 GHz machine and 160.87 time units on the 2.3 GHz machine. Likewise, the time it took to run xthtaylor, which is even more parallelized, was cut from 312.5 to 96.35 on the 3.2 GHz machine by using 6 cores, and from 434.78 to 110 on the 2.3 GHz machine by using 10 cores. Stata/MP on either machine offers huge time savings over what would be possible using only a single core.

            Comment


            • #7
              Thanks for your prompt answer Alan. I agree with your "work unit" concept but I found RAM speed is not included for it. I am highlighting this because we assumed (maybe wrongly) faster RAM will produce faster results in identical configuration; therefore 2133 MHz RAM will produce 33% faster results than 1600 MHz RAM

              Comment


              • #8
                Originally posted by Farid Matuk View Post
                Thanks for your prompt answer Alan. I agree with your "work unit" concept but I found RAM speed is not included for it. I am highlighting this because we assumed (maybe wrongly) faster RAM will produce faster results in identical configuration; therefore 2133 MHz RAM will produce 33% faster results than 1600 MHz RAM

                Given all other factors being equal, then faster RAM will indeed produce faster results. However, 2133 MHz RAM won't produce 33% faster results than 1600 MHz RAM. All faster RAM does is speed up access to your data -- it doesn't speed up computations on the chip. It is hard to say how much difference faster RAM can make because it depends on what you are doing. If you are doing something that is not very computationally intense, such as summarize-ing your data, it will make a bigger difference than if you are doing something very computationally intense. Even so, it won't make as big of a difference as the ratio of the RAM speeds.

                Comment


                • #9
                  Thanks again Alan, we will maximize RAM size on new equipment instead of RAM speed. BTW, I have found MERGE instruction creates temporary files on the hard disk. If I want to create RAM disk in order to speed up merging files, what rule of thumb may I follow to size it?

                  Comment


                  • #10
                    I am now at Oracle conference in Brasil, and I have been offered an equipment that runs Xeon E5-2600 v3 family. On Intel site, they say:
                    • Up to 1.9X higher performance gains for enterprise workloads with Intel® AVX2 – Intel AVX2 with new Fused Multiply-Add (FMA) instructions in Intel Xeon Processor E5-2600 v3 product family doubles the floating point operations (Flops) from first generation Intel AVX, and doubles the width of vector integer instructions to 256 bits, expanding the benefits of Intel AVX2 into enterprise computing.
                    Has anyone experience migrating from v2 to v3 CPU family? What I am curious is if doubling floating points operations from v1 and doubling vector integer instruction will have any impact on Stata code.

                    Thanks

                    http://www.intel.com.br/content/dam/...n-e5-brief.pdf
                    Last edited by Farid Matuk; 24 Jun 2015, 20:56. Reason: Link is wrongly modified when copied

                    Comment

                    Working...
                    X