Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Benchmarking Stata to Test Speed Across Computers and Versions

    After helping my son upgrade his gaming machine, I decided it was time to upgrade my 8-year old desktop. I wanted to see how my new computer compares to my existing one, so I wrote a program implementing 10 commands (10 times each). I'm presently running Stata 15 MP(4), so for comparison purposes I chose 5 commands that scale 1:1 with the number of cores and 5 commands that do not scale at all by cores (according to Stata's analysis). I'm curious if anyone has any better ideas.

    It was an interesting project, and unfortunately I don't have Stata 15 SE running (though I'm thinking about installing it since I have it, just for the test). I have Stata 13SE on my desktop, so I ran it in that version as well. Here's the code and the results.

    I first create 4 random variables (X's; 500,000 obs) and then create a Y based on the X's and a random disturbance. I also create a Z variable of 50 observations to conduct a time series model. And, I create a 500x500 matrix to run a matrix command. I run each command 10 times to get a distribution, since at times I experienced some outliers for some commands (though not so much now). I think the median is probably the best indicator, since the mean can be affected by the outliers.

    COMPUTER: I7-870 (4 cores; no overclock); Sandisk 512gb SSD; 32GB Ram (DDR3); Video AMD HD 5770 (a pair in SLI for a 4-monitor rig).

    My upgrade includes a Ryzen 2700X processor (8 core), an Adata NVMe drive, and 32GB ram (DDR4). Same video cards (I don't game).

    Thoughts welcome.

    ***********************************************
    clear all
    set obs 500000

    /************************************************** ***************************/
    /***** CREATE A DATASET *******************************************/
    /************************************************** ***************************/
    forv i = 1(1)4 {
    qui g x`i' = rnormal()
    }

    *For Time Series Operations
    qui g t = _n
    tsset t
    qui g z = 0.9*l.x2 + rnormal() if t<=51

    *For matrix calculations
    mata
    M = rnormal(500,500,0,1)
    st_matrix("M2",M)
    end

    /************************************************** ***************************/
    /***** CREATE A STORE MATRIX *******************************************/
    /************************************************** ***************************/
    mata: st_matrix("R" , J(10,10,0))
    matrix colnames R = replace correl regress predict bootstrap ///
    mvtest xtile expand_drop arfima eigenv
    /************************************************** ***************************/
    /***** TIMER PROGRAMS *******************************************/
    /************************************************** ***************************/
    capture program drop tstart
    program tstart, rclass
    timer clear 1
    timer on 1
    end

    capture program drop tend
    program tend, rclass
    timer off 1
    qui timer list 1
    scalar r = r(t1)
    end
    /************************************************** ***************************/
    /***** PROPORTIONAL TO CORES *******************************************/
    /************************************************** ***************************/
    qui g y = 0
    qui g yt = 0

    *replace
    forv i = 1(1)10 {
    qui replace yt = 1 + 0.1*x1 -0.25*x2 + 0.4*x3 -0.15*x4 + rnormal()
    tstart
    qui replace y = yt
    tend
    matrix R[`i',1] = scalar(r)
    tstart
    qui correl y x1 x2 x3 x4
    tend
    matrix R[`i',2] = scalar(r)
    *regress
    qui reg y x1 x2 x3 x4 // first time is always very slow
    tstart
    qui reg y x1 x2 x3 x4
    tend
    matrix R[`i',3] = scalar(r)
    *predict
    tstart
    qui predict cooksd, cooksd
    tend
    drop cooksd
    matrix R[`i',4] = scalar(r)
    *bootstrap
    tstart
    qui bootstrap , reps(25): reg y x1 x2 x3 x4
    tend
    matrix R[`i',5] = scalar(r)
    }

    /************************************************** ***************************/
    /***** MULTICORE HAS NO EFFECT *****************************************/
    /************************************************** ***************************/

    forv i = 1(1)10 {
    *mvtest normality
    tstart
    qui mvtest normality y x1
    tend
    matrix R[`i',6] = scalar(r)
    *xtile
    tstart
    qui xtile tempv = y , nq(4)
    tend
    capture drop tempv
    matrix R[`i',7] = scalar(r)
    * expand, drop if
    tstart
    qui expand 2 , g(ex)
    qui drop if ex==1
    tend
    capture drop ex
    matrix R[`i',8] = scalar(r)
    *arfima
    tstart
    qui arfima z
    tend
    matrix R[`i',9] = scalar(r)
    *matrix eigenvalues
    tstart
    qui matrix eigenvalues re im = M2
    tend
    matrix R[`i',10] = scalar(r)
    }

    /************************************************** ***************************/
    /***** SUMMARIZE RESULTS *****************************************/
    /************************************************** ***************************/
    svmat R , names(col)
    summ replace regress predict correl bootstrap mvtest xtile expand_drop arfima eigenv
    tabstat replace regress predict correl bootstrap ///
    mvtest xtile expand_drop arfima eigenv , stats(median mean sd min max) columns(s)
    Click image for larger version

Name:	STATABENCHDEC18003.jpg
Views:	1
Size:	248.4 KB
ID:	1475469


  • #2
    I don't have Stata 15 SE running
    I once reported a performance problem with Stata/SE to Stata Technical services, and in the back-and-forth I learned that the person helping me was able to reproduce my Stata/SE problem on their Stata/MP system with
    Code:
    set processors 1
    You might find this adequate for performance comparisons. My guess is that there is a common code base, and the difference between Stata/SE and Stata/MP is that the license key for Stata/MP sets processors_lic to whatever the license allows, while Stata/SE sets it to 1.

    Comment


    • #3
      Here are the results with Stata 15 MP (set processors 1).
      Click image for larger version

Name:	STATABENCHDEC18005.jpg
Views:	1
Size:	102.0 KB
ID:	1475545

      Comment


      • #4
        Very interesting.

        Am I correct in assuming that the Stata/MP version 15 results represent your new computer, and the Stata/SE 13 results represent your old computer?

        I'd think it would be interesting to fit a model to your performance data with log(time) as the dependent variable and categorical variables for the computer (old v. new), the cpus (1 v. 4), and the command, and an indicator of whether the command is expected to scale. That would allow you to estimate percentage improvement from the upgrade and that from increasing the number of processors.

        While it doesn't matter for the purposes of this topic, for future topics you may well improve the likelihood of response with improved presentation. Take a few moments to review the Statalist FAQ linked to from the top of the page, and note especially sections 9-12 on how to best pose your question. Screenshots and other pictures are discouraged, and It's recommended to copy commands and output from the Stata Results window and paste them into the Statalist post using code delimiters [CODE] and [/CODE], and also to use the dataex command when providing sample data. You'll find that the more you help others understand your problem, the more likely others are to be able to help you solve your problem.


        Comment


        • #5
          If you haven't read it, you will probably find this page interesting.

          Stata FAQ: Hardware requirements to run Stata

          In particular, after RAM, "the next greatest effect on the performance of Stata is the processor. The faster the clock speed and the more cache a processor has, the faster Stata will run."

          In some simple tests I found that performance was more strongly associated with the clock speed of the processor than the cost of the processor. A 3-year old i3 running at 3 GHz outperformed a new i7 running at 2 GhZ, despite the latter costing considerably more. The relative performance obviously depends on the extent to which the commands make use of multiple cores/threads, but I found that even with commands that scale by number of cores it was the clock speed mattered more than the number of cores (exactly as written in the FAQ).

          Comment


          • #6
            William:
            I haven't yet run it on the updated machine because I was going to perform that upgrade next week. This allowed me to get some comments first to improve the program.
            Thanks for the thoughts on presentation. I'll look that over and do better in the future.

            Paul:
            That's good information. I found that 13SE used much more memory than 15MP, but not sure if that's the cores or the newer versions (I suspect the cores). The i7-870 is a 2.95Ghz processor. The Ryzen 7 2700x is a 3.7 with overclock to 4.3 (I plan to overclock).
            Last edited by George Ford; 19 Dec 2018, 19:39.

            Comment


            • #7
              Update is complete. Ryzen 7 2700X, 32GB 2666 DDR4 Ram. Adata 1TB M.2 NVMe Drive (not used during the program). Overclocking is "Auto" mode on the ASUS B450-F Motherboard (averaging 4.01 GHz on a base clock of 3.7 GHz for 8 cores). Average CPU usage for the program was about 17% and memory usage about 25%. Drive usage 0%. Stata MP4.

              Former CPU had a base close of 2.9 GHz and 4 cores. I suspect Stata will run fine on 16GB RAM, so the 32GB was probably overkill (though maybe not for really large data sets).

              On average, processing time was 50% of the pre-upgrade level. The range of improvements was 30% to 70%. MP-affected commands improved more than the others. Bootstrap was only about 32% faster, which again was less than hoped for and consistent with the CPU clock-speed increase. The Adata NVMe drive is super fast, but apparently played no role (but might with less RAM).

              Another project I was working on had a two-way FE regression with about 300,000 observations. Before upgrade it took 165 seconds and now it takes 97 seconds (40% improvement). I was hoping for a little more, but I'll take it. Probably $800 in the upgrade since everything had to be changed but the video card (new CPU required new Motherboard which required new RAM). Drive was $200, so $600 for the speed bump.


              Click image for larger version

Name:	benchnew.jpg
Views:	1
Size:	45.2 KB
ID:	1477019


              Comment


              • #8
                Properly formatted code.

                Code:
                ***********************************************
                clear all
                set obs 500000
                
                /************************************************** ***************************/
                /***** CREATE A DATASET *******************************************/
                /************************************************** ***************************/
                forv i = 1(1)4 {
                qui g x`i' = rnormal()
                }
                
                *For Time Series Operations
                qui g t = _n
                tsset t
                qui g z = 0.9*l.x2 + rnormal() if t<=51
                
                *For matrix calculations
                mata
                M = rnormal(500,500,0,1)
                st_matrix("M2",M)
                end
                
                /************************************************** ***************************/
                /***** CREATE A STORE MATRIX *******************************************/
                /************************************************** ***************************/
                mata: st_matrix("R" , J(10,10,0))
                matrix colnames R = replace correl regress predict bootstrap ///
                mvtest xtile expand_drop arfima eigenv
                /************************************************** ***************************/
                /***** TIMER PROGRAMS *******************************************/
                /************************************************** ***************************/
                capture program drop tstart
                program tstart, rclass
                timer clear 1
                timer on 1
                end
                
                capture program drop tend
                program tend, rclass
                timer off 1
                qui timer list 1
                scalar r = r(t1)
                end
                /************************************************** ***************************/
                /***** PROPORTIONAL TO CORES *******************************************/
                /************************************************** ***************************/
                qui g y = 0
                qui g yt = 0
                
                *replace
                forv i = 1(1)10 {
                qui replace yt = 1 + 0.1*x1 -0.25*x2 + 0.4*x3 -0.15*x4 + rnormal()
                tstart
                qui replace y = yt
                tend
                matrix R[`i',1] = scalar(r)
                tstart
                qui correl y x1 x2 x3 x4
                tend
                matrix R[`i',2] = scalar(r)
                *regress
                qui reg y x1 x2 x3 x4 // first time is always very slow
                tstart
                qui reg y x1 x2 x3 x4
                tend
                matrix R[`i',3] = scalar(r)
                *predict
                tstart
                qui predict cooksd, cooksd
                tend
                drop cooksd
                matrix R[`i',4] = scalar(r)
                *bootstrap
                tstart
                qui bootstrap , reps(25): reg y x1 x2 x3 x4
                tend
                matrix R[`i',5] = scalar(r)
                }
                
                /************************************************** ***************************/
                /***** MULTICORE HAS NO EFFECT *****************************************/
                /************************************************** ***************************/
                
                forv i = 1(1)10 {
                *mvtest normality
                tstart
                qui mvtest normality y x1
                tend
                matrix R[`i',6] = scalar(r)
                *xtile
                tstart
                qui xtile tempv = y , nq(4)
                tend
                capture drop tempv
                matrix R[`i',7] = scalar(r)
                * expand, drop if
                tstart
                qui expand 2 , g(ex)
                qui drop if ex==1
                tend
                capture drop ex
                matrix R[`i',8] = scalar(r)
                *arfima
                tstart
                qui arfima z
                tend
                matrix R[`i',9] = scalar(r)
                *matrix eigenvalues
                tstart
                qui matrix eigenvalues re im = M2
                tend
                matrix R[`i',10] = scalar(r)
                }
                
                /************************************************** ***************************/
                /***** SUMMARIZE RESULTS *****************************************/
                /************************************************** ***************************/
                svmat R , names(col)
                summ replace regress predict correl bootstrap mvtest xtile expand_drop arfima eigenv
                tabstat replace regress predict correl bootstrap ///
                mvtest xtile expand_drop arfima eigenv , stats(median mean sd min max) columns(s)

                Comment


                • #9
                  I also ran the Benchmark program on Amazon's Cloud Service (AWS, EC2, 4 cores, 16GB storage). It ran much slower than my desktop, but still faster than my old machine. Run with MP4.

                  And, an error from 2 posts earlier. The dataset with the FE regression had 1.7 million observations.

                  Click image for larger version

Name:	benchmarkAWS.jpg
Views:	1
Size:	47.6 KB
ID:	1477041

                  Comment


                  • #10
                    I guess that in post #1 you were using the Stata/MP Performance Report to determine which commands could be expected to scale with the number of cores. In that document, Appendix E tells us

                    Replication-based prefix commands, such as bootstrap, fracpoly, jackknife, mfp, permute,rolling, simulate, statsby, and stepwise, were not explicitly assessed. These commands run another target command repeatedly; to the extent the target command’s performance is improved for a particular problem size, a similar improvement will be obtained when it is run repeatedly by the prefix command.
                    Coupled with your experience, that suggests to me that the bootstrap command itself is not parallelized. That is, to take your case for an example, I think Stata/MP does not run multiple regress commands in parallel, but each regress command will be parallelized.

                    It is not clear to me, did you upgrade your Stata/MP to support 8 cores? I expect that doing so would have added a substantial bump to the costs you report.

                    Comment


                    • #11
                      I used the Performance Report to pick the commands. There's a 40% improvement in bootstrap speed when going from 1 to 4 cores (using: set processor 1). Not as big as the other commands which more closely track the 4:1 ratio.

                      I have not upgraded to MP8. I am thinking about it now that I have 8 cores, but it is often the bootstrap calculations that I'm waiting on.

                      Comment


                      • #12
                        I thought I'd chime in here. I ran George's code on my rig--I've had this setup for about 9 months I believe.

                        I have Stata/MP 15.1, 8-core license

                        CPU is an AMD Threadripper 1950x

                        128 GB of DD4-2933 ram (set at stock)

                        All solid state drives (.do file placed in and ran from C:\data folder --same SSD as Stata install)


                        1st run with procs set at 4, and CPU set at stock 3.4 GHz...
                        variable p50 mean sd min max
                        replace .0045 .0045 .000527 .004 .005
                        regress .034 .0343 .000483 .034 .035
                        predict .0205 .0212 .0056529 .015 .035
                        correl .019 .0206 .0064842 .018 .039
                        bootstrap 13.3065 13.3003 .051586 13.232 13.384
                        mvtest .1275 .1276 .0040056 .121 .135
                        xtile .999 .9958 .0206763 .951 1.028
                        expand_drop .0845 .0844 .0030623 .08 .089
                        arfima 4.5875 4.6083 .0842048 4.547 4.837
                        eigenv .301 .3024 .0070585 .293 .317



                        2nd run with procs set at 8, and CPU set at stock 3.4 GHz...
                        variable p50 mean sd min max
                        replace .003 .0026 .0005164 .002 .003
                        regress .025 .0253 .0016364 .023 .029
                        predict .014 .0144 .0025033 .011 .018
                        correl .01 .0108 .0013166 .01 .014
                        bootstrap 11.914 11.9006 .1679069 11.694 12.116
                        mvtest .109 .1095 .0043269 .105 .117
                        xtile .992 .9832 .0218724 .93 1.003
                        expand_drop .0865 .0866 .004274 .08 .092
                        arfima 4.369 4.3912 .0656147 4.34 4.563
                        eigenv .318 .3169 .0136663 .299 .342



                        3rd run with procs set at 8, and CPU overclocked to 3.8 GHz...
                        variable p50 mean sd min max
                        replace .002 .0021 .0003162 .002 .003
                        regress .024 .0249 .001792 .023 .028
                        predict .017 .0158 .0027809 .011 .019
                        correl .011 .0111 .0015239 .01 .015
                        bootstrap 11.816 11.8176 .1770593 11.6 12.066
                        mvtest .1105 .1102 .0035528 .103 .115
                        xtile .9685 .9649 .0189294 .914 .982
                        expand_drop .086 .0857 .003653 .079 .091
                        arfima 4.4595 4.4783 .0634999 4.432 4.649
                        eigenv .3055 .3091 .0137957 .292 .331


                        4th run with procs set at 8, and CPU overclocked to 4.0 GHz (it's getting hot in here...)
                        variable p50 mean sd min max
                        replace .002 .002 0 .002 .002
                        regress .024 .0236 .0005164 .023 .024
                        predict .013 .0133 .003335 .008 .017
                        correl .01 .0098 .0004216 .009 .01
                        bootstrap 11.262 11.2414 .1593279 11.057 11.425
                        mvtest .107 .1063 .0035292 .1 .112
                        xtile .932 .93 .0187498 .884 .958
                        expand_drop .082 .0821 .0033813 .075 .087
                        arfima 4.4255 4.4484 .0729325 4.404 4.648
                        eigenv .2875 .2877 .0082334 .278 .304


                        I tried a 5th run to push to 4.1 GHz but it froze up... like most TR owners, it pretty hard to move higher without higher voltage, and better cooling (I'm using an AIO).

                        I am hoping this is useful to someone. One day I may splurge and upgrade to MP16 since I technically have the cores for it!

                        Cheers!

                        Comment


                        • #13
                          Looks interesting. For me, it's the bootstrap I'm always waiting for. I'm surprised the speed didn't improve that much over the set.

                          Comment


                          • #14
                            Some updated numbers...

                            Stata v15 with 8 cores, Threadripper 1950x OC'd to 3.9 GHz (nice and cool)..
                            replace 0.002 0.0022 0.00103 0.001 0.005
                            regress 0.025 0.025 0.00125 0.023 0.027
                            predict 0.0125 0.0119 0.00173 0.009 0.014
                            correl 0.01 0.0102 0.00063 0.009 0.011
                            bootstrap 11.733 11.7148 0.2648 11.407 12.014
                            mvtest 0.106 0.1064 0.0047 0.095 0.112
                            xtile 0.979 0.9756 0.01403 0.942 0.992
                            expand_drop 0.084 0.0829 0.00208 0.078 0.085
                            arfima 7.358 7.4134 0.18192 7.318 7.926
                            eigenv 0.264 0.2636 0.00143 0.261 0.266
                            Stata v16 with 8 cores, Threadripper 1950x OC'd to 3.9 GHz (nice and cool).
                            replace 0.002 0.002 0.00047 0.001 0.003
                            regress 0.026 0.0265 0.00227 0.025 0.032
                            predict 0.012 0.0118 0.00132 0.009 0.013
                            correl 0.011 0.0113 0.00149 0.01 0.015
                            bootstrap 10.8605 10.8442 0.05976 10.748 10.927
                            mvtest 0.1055 0.105 0.00189 0.103 0.108
                            xtile 0.9445 0.941 0.01277 0.912 0.954
                            expand_drop 0.083 0.0831 0.00328 0.077 0.088
                            arfima 7.1685 7.2483 0.25245 7.141 7.964
                            eigenv 0.2675 0.2679 0.00179 0.266 0.271

                            Stata v16 with 16 cores, Threadripper 1950x OC'd to 3.9 GHz (nice and cool).
                            replace 0.002 0.0017 0.00048 0.001 0.002
                            regress 0.022 0.0221 0.0011 0.02 0.024
                            predict 0.007 0.0071 0.00088 0.006 0.009
                            correl 0.008 0.0082 0.00079 0.007 0.009
                            bootstrap 10.289 10.3043 0.06268 10.231 10.445
                            mvtest 0.107 0.106 0.00343 0.1 0.111
                            xtile 0.9525 0.9497 0.00959 0.927 0.96
                            expand_drop 0.086 0.0857 0.00254 0.079 0.088
                            arfima 7.521 7.5972 0.24834 7.499 8.303
                            eigenv 0.2645 0.2648 0.00123 0.264 0.268
                            --Chris

                            Comment


                            • #15
                              I had contemplated the Threadripper, but it does not appear to have much advantage for longer processes.

                              Comment

                              Working...
                              X