Terrible parallelization performance in Mata

David Roodman

Join Date: Jul 2014

Posts: 477
#1

Terrible parallelization performance in Mata

30 Nov 2020, 19:09

I just ran an example from the SJ paper about boottest ("Fast and Wild") and was surprised at its poor performance. I'm fortunate enough to be working in Stata/MP, and I discovered that the more processors I enabled it to use, the slower it got. I'm using a Dell XPS 17 9700, which has a pretty good cooling system. Its CPU is an Intel i7-10875H, which has 8 cores and hyperthreading. I'm running Stata/MP 12-core 16.1. It's got 64GB of RAM and Windows 10 Pro.

Here is the log from a distilled demonstration. It sets the number of cores to 1, 2, ..., 12. On each iteration it calls a program that creates a 2500 x 1 matrix X and then computes X + X :* X 10,000 times. Simplifying that calculation to X + X or X :* X makes the problem go away.

I'm wondering if anyone else with access to Stata/MP gets similar results, or has insights. Possibly it doesn't happen on all computers. I understand that implementing invisible parallelization in a compiler is a tricky business. But Stata/MP doesn't come cheap!

Code:

cap mata mata drop demo() mata mata set matastrict on mata set matalnum off mata set mataoptimize on void demo() { real matrix X; real scalar i X = runiform(2500,1) for (i=10000; i; i--) (void) X + X :* X } end timer clear forvalues p=1/12 { qui set processors `p' set seed 1202938431 timer on `p' mata demo() timer off `p' } timer list

Output:

Code:

. timer list 1: 0.14 / 1 = 0.1390 2: 0.16 / 1 = 0.1640 3: 1.63 / 1 = 1.6330 4: 2.02 / 1 = 2.0150 5: 2.47 / 1 = 2.4680 6: 2.92 / 1 = 2.9210 7: 3.38 / 1 = 3.3780 8: 3.84 / 1 = 3.8370 9: 4.26 / 1 = 4.2640 10: 4.70 / 1 = 4.7040 11: 5.21 / 1 = 5.2100 12: 5.63 / 1 = 5.6260

That's right: using 1 core takes 0.14 seconds. Using 8 cores takes 3.84 seconds. Using 12 (with hyperthreading) takes 5.63 seconds.

Here's output I get from Stata 15.0--it's actually better! But still bad:

Code:

1: 0.13 / 1 = 0.1280 2: 0.14 / 1 = 0.1440 3: 1.09 / 1 = 1.0900 4: 1.35 / 1 = 1.3460 5: 1.55 / 1 = 1.5540 6: 1.78 / 1 = 1.7810 7: 2.10 / 1 = 2.1010 8: 2.35 / 1 = 2.3540 9: 2.59 / 1 = 2.5920 10: 2.89 / 1 = 2.8890 11: 3.16 / 1 = 3.1570 12: 3.41 / 1 = 3.4120

I monitored CPU usage during these tests and saw no evidence of throttling.

I'm worried that my Mata-based programs are getting seriously slowed down.

If you've got MP and can run this test, I'd be interested in the results.

Last edited by David Roodman; 30 Nov 2020, 19:19.
Tags: None

1 like

Ben Jann

Join Date: Sep 2014
Posts: 269

01 Dec 2020, 06:23

Here are timings from Stata 16.1/MP-4 on some Windows server I have access too (don't ask me what kind exactly; I could find out, though, if necessary). I observe a similar pattern.

Code:

cap mata mata drop demo()

mata
mata set matastrict on
mata set matalnum off
mata set mataoptimize on

void demo() {
    real matrix X; real scalar i
    X = runiform(2500,1)
    for (i=10000; i; i--)
        (void) X + X :* X
}
end

timer clear
forvalues p=1/4 {
  qui set processors `p'
  set seed 1202938431
  timer on `p'
  mata demo()
  timer off `p'
}
timer list

Output:

Code:

. timer list
   1:      0.15 /        1 =       0.1480
   2:      0.20 /        1 =       0.1980
   3:      1.32 /        1 =       1.3230
   4:      1.71 /        1 =       1.7140

I also do observe that the problem goes away if simplifying the computation to X+X or X:*X. However, doing the computation in two steps does not seem to help:

Code:

cap mata mata drop demo()

mata
mata set matastrict on
mata set matalnum off
mata set mataoptimize on

void demo() {
    real matrix X, Y; real scalar i
    X = runiform(2500,1)
    for (i=10000; i; i--) {
        Y = X :* X
        Y = X + Y
    }
}
end

timer clear
forvalues p=1/4 {
  qui set processors `p'
  set seed 1202938431
  timer on `p'
  mata demo()
  timer off `p'
}
timer list

Output:

Code:

. timer list
   1:      0.15 /        1 =       0.1550
   2:      0.20 /        1 =       0.1970
   3:      1.27 /        1 =       1.2720
   4:      1.73 /        1 =       1.7280

ben

Comment

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014
Posts: 348

01 Dec 2020, 08:05

I am able to reproduce on my Windows machine.

Code:

. timer list
   1:      0.16 /        1 =       0.1630
   2:      0.46 /        1 =       0.4560
   3:      3.22 /        1 =       3.2200
   4:      4.14 /        1 =       4.1410

And as Ben Jann observed, that change code to

Code:

     for (i=10000; i; i--) {        
          Y = X :* X        
          Y = X + Y    
     }

does not help. But if you separate them into two loops, the problem goes away:

Code:

mata:
void demo1() {
    real matrix X; real scalar i
    X = runiform(2500,1)
    for (i=10000; i; i--) {
        (void) X :* X
    }

    for (i=10000; i; i--) {
        (void) X + X
    }    
}
end

. timer list
   1:      0.15 /        1 =       0.1460
   2:      0.47 /        1 =       0.4710
   3:      0.48 /        1 =       0.4780
   4:      0.47 /        1 =       0.4650

Note that the size of the problem size is tiny. Hence the overhead can overpower the benefit of parallelization, and adding more cores can make the performance worse. But something is definitely going on given that X + X and X*X do not appear to have the issue. Anyway, we will investigate and report back.

Last edited by Hua Peng (StataCorp); 01 Dec 2020, 08:35.

Comment

David Roodman

Join Date: Jul 2014

Posts: 477
#4

01 Dec 2020, 09:20

Thank you, Hua. I agree the individual calculations are small. Still, I think the use case is realistic: one may repeat the same calculation many times on small data sets for Monte Carlo or bootstrapping purposes. So I'm glad you're investigating. This example is derived from the wild2() program in "Fast and Wild," which I think is pretty realistic and might provide a good test bed. I think there are at least 3 lines in that little program exhibiting the same behavior.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

01 Dec 2020, 11:12

This is not the first time this issue has arisen on Statalist. I was able to locate a previous topic I participated in (which did not draw the attention of anyone from StataCorp) at

https://www.statalist.org/forums/for...-stata-mp-15-1

and this topic also involved Mata's performance on mutiprocessor systems.

In post #16 at the top of the second page of this earlier topic I summarized my conclusions as "multiprocessing is slow in long loops of small calculations" due to the overhead of setting up the multiple processes. Certainly the Stata 16 experience reported in post #1 of today's topic suggests each additional processor utilized beyond the second requires 0.4 seconds of setup time; all this for a calculation that takes but 0.13 seconds on a single processor. My conclusion in the previous topic was that it "suggests that when using Stata/MP an early step in evaluating poor performance should be to run it on a single processor and see if your code is perhaps spending too much time preparing small tasks for multiprocessing."

From today's topic I learn that, based on the experience with the simplifications, Stata seems to apply heuristics to try to determine if the gain in performance is likely to be worth the pain of multiprocessing. It gets the right answer for the simplified expressions, and the wrong answer for the slightly-more-complex expressions.

I'm looking forward to what light StataCorp can shed on this. Certainly, the lesson is that Stata/MP does not, and realistically can not, guarantee that run times on multiple processors are bounded above by single processor performance. What we can hope is that StataCorp is able to incrementally improve the heuristics that make the choice to utilize additional processors.

Added in edit: Another earlier topic can be found at

https://www.statalist.org/forums/for...timing-mystery

which I mention only because it seems to support a recollection of mine - which I have not been able to track down on Statalist - of a performance issue of some sort that only affected Stata for Windows, because the task required OS support and linux and macOS handle that task more efficiently than does Windows.

Last edited by William Lisowski; 01 Dec 2020, 11:20.
2 likes
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 348
#6

01 Dec 2020, 12:29

Ok, I think I found the problem. For Mata's colon operator (help m2_op_colon) in Stata/MP, we have a bug in the setup routine for the number of threads to use when the number of cores/processors available is larger than or equal to 4. In the case David found, the size of the matrix is 2500x1, Stata/MP should only use 2 threads even if the number of cores/processors available is larger than 2. But due to the bug, Stata/MP launches all 12 threads (numbers allowed by David's machine and license). And only the first two are used for calculation, and all the rest are just launched and go away. Hence the large overhead.

Note: the bug only negatively affects the performance of Mata colon operator for small size problems on Stata/MP with more than 2 core/processors. The numeric results are not affected, i.e., the results are correct.

We will get this fixed in a future Stata update.

Last edited by Hua Peng (StataCorp); 01 Dec 2020, 12:35.
2 likes
Comment
David Roodman

Join Date: Jul 2014

Posts: 477
#7

01 Dec 2020, 12:39

Hua Peng (StataCorp) In my case, would it always be launching 12 threads, or p threads, where p is set by the loop in the demo? If it always launched 12, then I wouldn't expect the time cost to rise steadily with p as it does on my computer.
Comment
David Roodman

Join Date: Jul 2014

Posts: 477
#8

01 Dec 2020, 12:46

Thanks, William Lisowski for the links to the old posts, including one that was mine that I completely forgot! I found it a very interesting read...

Last edited by David Roodman; 01 Dec 2020, 13:36.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 348
#9

01 Dec 2020, 12:51

David Roodman, no, it will always launch the number of processors available. If -set processors 4-, the number of processors available will become 4 instead of 12.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 348
#10

15 Dec 2020, 09:54

The issue is fixed in today's update. Type:

Code:

update all

to apply the update.
2 likes
Comment
David Roodman

Join Date: Jul 2014

Posts: 477
#11

15 Dec 2020, 10:17

Excellent, thanks!
Comment

Announcement