Benchmarking Stata to Test Speed Across Computers and Versions

George Ford

Join Date: Aug 2014

Posts: 3148
#1

Benchmarking Stata to Test Speed Across Computers and Versions

18 Dec 2018, 15:16

After helping my son upgrade his gaming machine, I decided it was time to upgrade my 8-year old desktop. I wanted to see how my new computer compares to my existing one, so I wrote a program implementing 10 commands (10 times each). I'm presently running Stata 15 MP(4), so for comparison purposes I chose 5 commands that scale 1:1 with the number of cores and 5 commands that do not scale at all by cores (according to Stata's analysis). I'm curious if anyone has any better ideas.

It was an interesting project, and unfortunately I don't have Stata 15 SE running (though I'm thinking about installing it since I have it, just for the test). I have Stata 13SE on my desktop, so I ran it in that version as well. Here's the code and the results.

I first create 4 random variables (X's; 500,000 obs) and then create a Y based on the X's and a random disturbance. I also create a Z variable of 50 observations to conduct a time series model. And, I create a 500x500 matrix to run a matrix command. I run each command 10 times to get a distribution, since at times I experienced some outliers for some commands (though not so much now). I think the median is probably the best indicator, since the mean can be affected by the outliers.

COMPUTER: I7-870 (4 cores; no overclock); Sandisk 512gb SSD; 32GB Ram (DDR3); Video AMD HD 5770 (a pair in SLI for a 4-monitor rig).

My upgrade includes a Ryzen 2700X processor (8 core), an Adata NVMe drive, and 32GB ram (DDR4). Same video cards (I don't game).

Thoughts welcome.

***********************************************
clear all
set obs 500000

/************************************************** ***************************/
/***** CREATE A DATASET *******************************************/
/************************************************** ***************************/
forv i = 1(1)4 {
qui g x`i' = rnormal()
}

*For Time Series Operations
qui g t = _n
tsset t
qui g z = 0.9*l.x2 + rnormal() if t<=51

*For matrix calculations
mata
M = rnormal(500,500,0,1)
st_matrix("M2",M)
end

/************************************************** ***************************/
/***** CREATE A STORE MATRIX *******************************************/
/************************************************** ***************************/
mata: st_matrix("R" , J(10,10,0))
matrix colnames R = replace correl regress predict bootstrap ///
mvtest xtile expand_drop arfima eigenv
/************************************************** ***************************/
/***** TIMER PROGRAMS *******************************************/
/************************************************** ***************************/
capture program drop tstart
program tstart, rclass
timer clear 1
timer on 1
end

capture program drop tend
program tend, rclass
timer off 1
qui timer list 1
scalar r = r(t1)
end
/************************************************** ***************************/
/***** PROPORTIONAL TO CORES *******************************************/
/************************************************** ***************************/
qui g y = 0
qui g yt = 0

*replace
forv i = 1(1)10 {
qui replace yt = 1 + 0.1*x1 -0.25*x2 + 0.4*x3 -0.15*x4 + rnormal()
tstart
qui replace y = yt
tend
matrix R[`i',1] = scalar(r)
tstart
qui correl y x1 x2 x3 x4
tend
matrix R[`i',2] = scalar(r)
*regress
qui reg y x1 x2 x3 x4 // first time is always very slow
tstart
qui reg y x1 x2 x3 x4
tend
matrix R[`i',3] = scalar(r)
*predict
tstart
qui predict cooksd, cooksd
tend
drop cooksd
matrix R[`i',4] = scalar(r)
*bootstrap
tstart
qui bootstrap , reps(25): reg y x1 x2 x3 x4
tend
matrix R[`i',5] = scalar(r)
}

/************************************************** ***************************/
/***** MULTICORE HAS NO EFFECT *****************************************/
/************************************************** ***************************/

forv i = 1(1)10 {
*mvtest normality
tstart
qui mvtest normality y x1
tend
matrix R[`i',6] = scalar(r)
*xtile
tstart
qui xtile tempv = y , nq(4)
tend
capture drop tempv
matrix R[`i',7] = scalar(r)
* expand, drop if
tstart
qui expand 2 , g(ex)
qui drop if ex==1
tend
capture drop ex
matrix R[`i',8] = scalar(r)
*arfima
tstart
qui arfima z
tend
matrix R[`i',9] = scalar(r)
*matrix eigenvalues
tstart
qui matrix eigenvalues re im = M2
tend
matrix R[`i',10] = scalar(r)
}

/************************************************** ***************************/
/***** SUMMARIZE RESULTS *****************************************/
/************************************************** ***************************/
svmat R , names(col)
summ replace regress predict correl bootstrap mvtest xtile expand_drop arfima eigenv
tabstat replace regress predict correl bootstrap ///
mvtest xtile expand_drop arfima eigenv , stats(median mean sd min max) columns(s)
Tags: None

1 like
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

18 Dec 2018, 19:04

I don't have Stata 15 SE running

I once reported a performance problem with Stata/SE to Stata Technical services, and in the back-and-forth I learned that the person helping me was able to reproduce my Stata/SE problem on their Stata/MP system with

Code:

set processors 1

You might find this adequate for performance comparisons. My guess is that there is a common code base, and the difference between Stata/SE and Stata/MP is that the license key for Stata/MP sets processors_lic to whatever the license allows, while Stata/SE sets it to 1.
1 like
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#3

19 Dec 2018, 07:04

Here are the results with Stata 15 MP (set processors 1).
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

19 Dec 2018, 08:03

Very interesting.

Am I correct in assuming that the Stata/MP version 15 results represent your new computer, and the Stata/SE 13 results represent your old computer?

I'd think it would be interesting to fit a model to your performance data with log(time) as the dependent variable and categorical variables for the computer (old v. new), the cpus (1 v. 4), and the command, and an indicator of whether the command is expected to scale. That would allow you to estimate percentage improvement from the upgrade and that from increasing the number of processors.

While it doesn't matter for the purposes of this topic, for future topics you may well improve the likelihood of response with improved presentation. Take a few moments to review the Statalist FAQ linked to from the top of the page, and note especially sections 9-12 on how to best pose your question. Screenshots and other pictures are discouraged, and It's recommended to copy commands and output from the Stata Results window and paste them into the Statalist post using code delimiters [CODE] and [/CODE], and also to use the dataex command when providing sample data. You'll find that the more you help others understand your problem, the more likely others are to be able to help you solve your problem.
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#5

19 Dec 2018, 12:07

If you haven't read it, you will probably find this page interesting.

Stata FAQ: Hardware requirements to run Stata

In particular, after RAM, "the next greatest effect on the performance of Stata is the processor. The faster the clock speed and the more cache a processor has, the faster Stata will run."

In some simple tests I found that performance was more strongly associated with the clock speed of the processor than the cost of the processor. A 3-year old i3 running at 3 GHz outperformed a new i7 running at 2 GhZ, despite the latter costing considerably more. The relative performance obviously depends on the extent to which the commands make use of multiple cores/threads, but I found that even with commands that scale by number of cores it was the clock speed mattered more than the number of cores (exactly as written in the FAQ).
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#6

19 Dec 2018, 19:27

William:
I haven't yet run it on the updated machine because I was going to perform that upgrade next week. This allowed me to get some comments first to improve the program.
Thanks for the thoughts on presentation. I'll look that over and do better in the future.

Paul:
That's good information. I found that 13SE used much more memory than 15MP, but not sure if that's the cores or the newer versions (I suspect the cores). The i7-870 is a 2.95Ghz processor. The Ryzen 7 2700x is a 3.7 with overclock to 4.3 (I plan to overclock).

Last edited by George Ford; 19 Dec 2018, 19:39.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#7

03 Jan 2019, 14:40

Update is complete. Ryzen 7 2700X, 32GB 2666 DDR4 Ram. Adata 1TB M.2 NVMe Drive (not used during the program). Overclocking is "Auto" mode on the ASUS B450-F Motherboard (averaging 4.01 GHz on a base clock of 3.7 GHz for 8 cores). Average CPU usage for the program was about 17% and memory usage about 25%. Drive usage 0%. Stata MP4.

Former CPU had a base close of 2.9 GHz and 4 cores. I suspect Stata will run fine on 16GB RAM, so the 32GB was probably overkill (though maybe not for really large data sets).

On average, processing time was 50% of the pre-upgrade level. The range of improvements was 30% to 70%. MP-affected commands improved more than the others. Bootstrap was only about 32% faster, which again was less than hoped for and consistent with the CPU clock-speed increase. The Adata NVMe drive is super fast, but apparently played no role (but might with less RAM).

Another project I was working on had a two-way FE regression with about 300,000 observations. Before upgrade it took 165 seconds and now it takes 97 seconds (40% improvement). I was hoping for a little more, but I'll take it. Probably $800 in the upgrade since everything had to be changed but the video card (new CPU required new Motherboard which required new RAM). Drive was $200, so $600 for the speed bump.
Comment

George Ford

Join Date: Aug 2014
Posts: 3148

03 Jan 2019, 14:42

Properly formatted code.

Code:

***********************************************
clear all
set obs 500000

/************************************************** ***************************/
/***** CREATE A DATASET *******************************************/
/************************************************** ***************************/
forv i = 1(1)4 {
qui g x`i' = rnormal()
}

*For Time Series Operations
qui g t = _n
tsset t
qui g z = 0.9*l.x2 + rnormal() if t<=51

*For matrix calculations
mata
M = rnormal(500,500,0,1)
st_matrix("M2",M)
end

/************************************************** ***************************/
/***** CREATE A STORE MATRIX *******************************************/
/************************************************** ***************************/
mata: st_matrix("R" , J(10,10,0))
matrix colnames R = replace correl regress predict bootstrap ///
mvtest xtile expand_drop arfima eigenv
/************************************************** ***************************/
/***** TIMER PROGRAMS *******************************************/
/************************************************** ***************************/
capture program drop tstart
program tstart, rclass
timer clear 1
timer on 1
end

capture program drop tend
program tend, rclass
timer off 1
qui timer list 1
scalar r = r(t1)
end
/************************************************** ***************************/
/***** PROPORTIONAL TO CORES *******************************************/
/************************************************** ***************************/
qui g y = 0
qui g yt = 0

*replace
forv i = 1(1)10 {
qui replace yt = 1 + 0.1*x1 -0.25*x2 + 0.4*x3 -0.15*x4 + rnormal()
tstart
qui replace y = yt
tend
matrix R[`i',1] = scalar(r)
tstart
qui correl y x1 x2 x3 x4
tend
matrix R[`i',2] = scalar(r)
*regress
qui reg y x1 x2 x3 x4 // first time is always very slow
tstart
qui reg y x1 x2 x3 x4
tend
matrix R[`i',3] = scalar(r)
*predict
tstart
qui predict cooksd, cooksd
tend
drop cooksd
matrix R[`i',4] = scalar(r)
*bootstrap
tstart
qui bootstrap , reps(25): reg y x1 x2 x3 x4
tend
matrix R[`i',5] = scalar(r)
}

/************************************************** ***************************/
/***** MULTICORE HAS NO EFFECT *****************************************/
/************************************************** ***************************/

forv i = 1(1)10 {
*mvtest normality
tstart
qui mvtest normality y x1
tend
matrix R[`i',6] = scalar(r)
*xtile
tstart
qui xtile tempv = y , nq(4)
tend
capture drop tempv
matrix R[`i',7] = scalar(r)
* expand, drop if
tstart
qui expand 2 , g(ex)
qui drop if ex==1
tend
capture drop ex
matrix R[`i',8] = scalar(r)
*arfima
tstart
qui arfima z
tend
matrix R[`i',9] = scalar(r)
*matrix eigenvalues
tstart
qui matrix eigenvalues re im = M2
tend
matrix R[`i',10] = scalar(r)
}

/************************************************** ***************************/
/***** SUMMARIZE RESULTS *****************************************/
/************************************************** ***************************/
svmat R , names(col)
summ replace regress predict correl bootstrap mvtest xtile expand_drop arfima eigenv
tabstat replace regress predict correl bootstrap ///
mvtest xtile expand_drop arfima eigenv , stats(median mean sd min max) columns(s)

Comment

George Ford

Join Date: Aug 2014

Posts: 3148
#9

03 Jan 2019, 17:52

I also ran the Benchmark program on Amazon's Cloud Service (AWS, EC2, 4 cores, 16GB storage). It ran much slower than my desktop, but still faster than my old machine. Run with MP4.

And, an error from 2 posts earlier. The dataset with the FE regression had 1.7 million observations.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

03 Jan 2019, 19:06

I guess that in post #1 you were using the Stata/MP Performance Report to determine which commands could be expected to scale with the number of cores. In that document, Appendix E tells us

Replication-based prefix commands, such as bootstrap, fracpoly, jackknife, mfp, permute,rolling, simulate, statsby, and stepwise, were not explicitly assessed. These commands run another target command repeatedly; to the extent the target command’s performance is improved for a particular problem size, a similar improvement will be obtained when it is run repeatedly by the prefix command.

Coupled with your experience, that suggests to me that the bootstrap command itself is not parallelized. That is, to take your case for an example, I think Stata/MP does not run multiple regress commands in parallel, but each regress command will be parallelized.

It is not clear to me, did you upgrade your Stata/MP to support 8 cores? I expect that doing so would have added a substantial bump to the costs you report.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#11

04 Jan 2019, 01:56

I used the Performance Report to pick the commands. There's a 40% improvement in bootstrap speed when going from 1 to 4 cores (using: set processor 1). Not as big as the other commands which more closely track the 4:1 ratio.

I have not upgraded to MP8. I am thinking about it now that I have 8 cores, but it is often the bootstrap calculations that I'm waiting on.
Comment

Christopher Roebuck

Join Date: May 2019
Posts: 2

#12

22 May 2019, 10:25

I thought I'd chime in here. I ran George's code on my rig--I've had this setup for about 9 months I believe.

I have Stata/MP 15.1, 8-core license

CPU is an AMD Threadripper 1950x

128 GB of DD4-2933 ram (set at stock)

All solid state drives (.do file placed in and ran from C:\data folder --same SSD as Stata install)

1st run with procs set at 4, and CPU set at stock 3.4 GHz...

variable	p50	mean	sd	min	max

replace	.0045	.0045	.000527	.004	.005
regress	.034	.0343	.000483	.034	.035
predict	.0205	.0212	.0056529	.015	.035
correl	.019	.0206	.0064842	.018	.039
bootstrap	13.3065	13.3003	.051586	13.232	13.384
mvtest	.1275	.1276	.0040056	.121	.135
xtile	.999	.9958	.0206763	.951	1.028
expand_drop	.0845	.0844	.0030623	.08	.089
arfima	4.5875	4.6083	.0842048	4.547	4.837
eigenv	.301	.3024	.0070585	.293	.317

2nd run with procs set at 8, and CPU set at stock 3.4 GHz...

variable	p50	mean	sd	min	max

replace	.003	.0026	.0005164	.002	.003
regress	.025	.0253	.0016364	.023	.029
predict	.014	.0144	.0025033	.011	.018
correl	.01	.0108	.0013166	.01	.014
bootstrap	11.914	11.9006	.1679069	11.694	12.116
mvtest	.109	.1095	.0043269	.105	.117
xtile	.992	.9832	.0218724	.93	1.003
expand_drop	.0865	.0866	.004274	.08	.092
arfima	4.369	4.3912	.0656147	4.34	4.563
eigenv	.318	.3169	.0136663	.299	.342

3rd run with procs set at 8, and CPU overclocked to 3.8 GHz...

variable	p50	mean	sd	min	max

replace	.002	.0021	.0003162	.002	.003
regress	.024	.0249	.001792	.023	.028
predict	.017	.0158	.0027809	.011	.019
correl	.011	.0111	.0015239	.01	.015
bootstrap	11.816	11.8176	.1770593	11.6	12.066
mvtest	.1105	.1102	.0035528	.103	.115
xtile	.9685	.9649	.0189294	.914	.982
expand_drop	.086	.0857	.003653	.079	.091
arfima	4.4595	4.4783	.0634999	4.432	4.649
eigenv	.3055	.3091	.0137957	.292	.331

4th run with procs set at 8, and CPU overclocked to 4.0 GHz (it's getting hot in here...)

variable	p50	mean	sd	min	max

replace	.002	.002	0	.002	.002
regress	.024	.0236	.0005164	.023	.024
predict	.013	.0133	.003335	.008	.017
correl	.01	.0098	.0004216	.009	.01
bootstrap	11.262	11.2414	.1593279	11.057	11.425
mvtest	.107	.1063	.0035292	.1	.112
xtile	.932	.93	.0187498	.884	.958
expand_drop	.082	.0821	.0033813	.075	.087
arfima	4.4255	4.4484	.0729325	4.404	4.648
eigenv	.2875	.2877	.0082334	.278	.304

I tried a 5th run to push to 4.1 GHz but it froze up... like most TR owners, it pretty hard to move higher without higher voltage, and better cooling (I'm using an AIO).

I am hoping this is useful to someone. One day I may splurge and upgrade to MP16 since I technically have the cores for it!

Cheers!

Comment

George Ford

Join Date: Aug 2014

Posts: 3148
#13

07 Aug 2019, 10:59

Looks interesting. For me, it's the bootstrap I'm always waiting for. I'm surprised the speed didn't improve that much over the set.
Comment

Christopher Roebuck

Join Date: May 2019
Posts: 2

#14

05 Nov 2019, 13:28

Some updated numbers...

Stata v15 with 8 cores, Threadripper 1950x OC'd to 3.9 GHz (nice and cool)..

replace	0.002	0.0022	0.00103	0.001	0.005
regress	0.025	0.025	0.00125	0.023	0.027
predict	0.0125	0.0119	0.00173	0.009	0.014
correl	0.01	0.0102	0.00063	0.009	0.011
bootstrap	11.733	11.7148	0.2648	11.407	12.014
mvtest	0.106	0.1064	0.0047	0.095	0.112
xtile	0.979	0.9756	0.01403	0.942	0.992
expand_drop	0.084	0.0829	0.00208	0.078	0.085
arfima	7.358	7.4134	0.18192	7.318	7.926
eigenv	0.264	0.2636	0.00143	0.261	0.266

Stata v16 with 8 cores, Threadripper 1950x OC'd to 3.9 GHz (nice and cool).

replace	0.002	0.002	0.00047	0.001	0.003
regress	0.026	0.0265	0.00227	0.025	0.032
predict	0.012	0.0118	0.00132	0.009	0.013
correl	0.011	0.0113	0.00149	0.01	0.015
bootstrap	10.8605	10.8442	0.05976	10.748	10.927
mvtest	0.1055	0.105	0.00189	0.103	0.108
xtile	0.9445	0.941	0.01277	0.912	0.954
expand_drop	0.083	0.0831	0.00328	0.077	0.088
arfima	7.1685	7.2483	0.25245	7.141	7.964
eigenv	0.2675	0.2679	0.00179	0.266	0.271

Stata v16 with 16 cores, Threadripper 1950x OC'd to 3.9 GHz (nice and cool).

replace	0.002	0.0017	0.00048	0.001	0.002
regress	0.022	0.0221	0.0011	0.02	0.024
predict	0.007	0.0071	0.00088	0.006	0.009
correl	0.008	0.0082	0.00079	0.007	0.009
bootstrap	10.289	10.3043	0.06268	10.231	10.445
mvtest	0.107	0.106	0.00343	0.1	0.111
xtile	0.9525	0.9497	0.00959	0.927	0.96
expand_drop	0.086	0.0857	0.00254	0.079	0.088
arfima	7.521	7.5972	0.24834	7.499	8.303
eigenv	0.2645	0.2648	0.00123	0.264	0.268

--Chris

Comment

George Ford

Join Date: Aug 2014

Posts: 3148
#15

21 Feb 2020, 13:08

I had contemplated the Threadripper, but it does not appear to have much advantage for longer processes.
Comment

Announcement