mean() is very slow in long loops (Stata MP 15.1)

Mauricio Caceres

Join Date: Sep 2015
Posts: 130

mean() is very slow in long loops (Stata MP 15.1)

30 May 2020, 14:36

I recently found that Mata's mean function is very slow when called on a long loop, and prohibitively so on Windows. I initially thought this might be due to mean() doing computations in quad precision, but quadsum() doesn't seem to have this issue. Here is an example of what I mean:

Code:

mata
real scalar function my_mean (real vector x)
{
    return(sum(x) / (length(x) - missing(x)))
}

real scalar function my_quadmean (real vector x)
{
    return(quadsum(x) / (length(x) - missing(x)))
}

n = 500000
x = runiform(n, 1)
x[selectindex(x :< 0.1)] = J(sum(x :< 0.1), 1, missingof(x))
ixl = (1::n) :- 100
ixu = (1::n) :+ 100
ixl[selectindex(ixl :< 1)] = J(sum(ixl :< 1), 1, 1)
ixu[selectindex(ixu :> n)] = J(sum(ixu :> n), 1, n)
y1 = J(n, 1, .)
y2 = J(n, 1, .)
y3 = J(n, 1, .)
end

timer clear
timer on 1
mata
for (i = 1; i <= n; i++) {
    y1[i] = mean(x[|ixl[i] \ ixu[i]|])
}
end
timer off 1

timer on 2
mata
for (i = 1; i <= n; i++) {
    y2[i] = my_mean(x[|ixl[i] \ ixu[i]|])
}
end
timer off 2

timer on 3
mata
for (i = 1; i <= n; i++) {
    y3[i] = my_quadmean(x[|ixl[i] \ ixu[i]|])
}
end
timer off 3

mata assert(all(y1 :== y3))

On Windows

Code:

. timer list
   1:     90.75 /        1 =      90.7530
   2:      4.57 /        1 =       4.5670
   3:      4.85 /        1 =       4.8490

On Linux

Code:

. timer list
   1:      6.11 /        1 =       6.1070
   2:      1.76 /        1 =       1.7610
   3:      1.94 /        1 =       1.9430

I use Stata MP 15.1 on both systems. You can see that doing the mean "by hand" with either sum() or quadsum() takes significantly less time. The difference in Linux is 3x whereas on WIndows it's close to 20x. Does anyone have an insight on why this is happening?

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

30 May 2020, 15:35

There are unfortunately three differences between your setup and mine.
You are running Stata 15.1 and I am running Stata 16.1

You are running Stata/MP and I am running Stata/SE

You are running Stata for Windows and Stata for Linux and I am running Stata for Mac (on a MacBook Air, not the fastest horse in that stable)

With that said, I copied your example code into the do-file editor, preceded it with clear all and followed it with time list, and ran it.

Code:

. timer list 1: 3.47 / 1 = 3.4730 2: 1.25 / 1 = 1.2460 3: 1.50 / 1 = 1.5040

Again, Mata's mean() was slower than than the two custom versions. What astonishes me is how much faster everything I ran was than it was on your systems. I scaled up the problem by a factor of 10 and got the following results.

Code:

. timer list 1: 33.62 / 1 = 33.6160 2: 13.22 / 1 = 13.2250 3: 18.19 / 1 = 18.1860

These are consistent with my first set. But the speed disadvantage of mean() is more like 2.5x on my system, rather than the 3.5x on your Linux system or the whatever-in-the-world was going on in your Windows system.

I hope that other readers will weigh in with additional data. It's easy to run.

For what it's worth, these are the salient details of my system. To that, I will add that Activity Monitor tells me there's no pressure on memory while Stata is processing the do-file. And were there memory pressure, paging would be to an SSD drive, not a spinning platter.

Code:

Model Name: MacBook Air Model Identifier: MacBookAir8,1 Processor Name: Dual-Core Intel Core i5 Processor Speed: 1.6 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Hyper-Threading Technology: Enabled Memory: 8 GB
Comment
Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#3

31 May 2020, 05:49

My results are the following for Stata 14.2 IC on Windows 7 on a 10 year old Samsumg Laptop

Code:

. timer list 1: 6.69 / 1 = 6.6870 2: 2.89 / 1 = 2.8930 3: 3.25 / 1 = 3.2470

and if I increase n by a factor of 10 like William

Code:

. timer list 1: 66.34 / 1 = 66.3430 2: 30.23 / 1 = 30.2350 3: 30.26 / 1 = 30.2590

So for me the difference between the first approach and the other two is only two times.
At least in Stata 14, Mata's mean()-function uses the quadcross()-function to sum up the values in the input vector. Maybe quadcross() is slower than quadsum(). I have not tested it further yet, but later I could do more benchmarking, if needed.
1 like
Comment

Stephen Jenkins

Join Date: Apr 2014
Posts: 1435

31 May 2020, 12:04

FWIW running Mauricio's code:

Code:


. timer list
   1:     18.14 /        1 =      18.1410
   2:      1.87 /        1 =       1.8730
   3:      2.01 /        1 =       2.0080

. 
end of do-file

. about

Stata/MP 16.1 for Windows (64-bit x86-64)
Revision 20 May 2020
Copyright 1985-2019 StataCorp LLC

Total physical memory:       16.00 GB
Available physical memory:    9.70 GB

Single-user 4-core Stata perpetual license:

Run on a 6 year old Dell Precision PC

Comment

Sven-Kristjan Bormann

Join Date: Jul 2018

Posts: 310
#5

31 May 2020, 14:00

After some further benchmarking, my explanation for the results is that quadcross() take at least two times longer to calculate the result than quadsum(). We have to remember that quadcross() is X'*Z . For the mean, X is equal to 1 and Z is the x-vector is in the demo-code. Therefore, the mean()-function makes not only n additions but also n multiplications.
This seems to be a bit overkill at the first look, but the mean()-function allows weights to be applied, whereas the demo code with quadsum() does not deal with weights.
Comment

Mauricio Caceres

Join Date: Sep 2015
Posts: 130

31 May 2020, 14:13

Thanks everyone for the data points!

Sven-Kristjan Bormann I think you hit the nail on the head. I didn't realize "quadcross" was used even if no weights were specified. If I implement this in terms of "quadcross" (or "cross" for that matter) I get basically the same running time. The help page for "cross" has an example of computing the mean, which is what I used. I guess the mystery now is why "cross" is slow in this case. If I implement "X' Z" by literally typing that, it runs a lot faster.

Code:

mata
y4 = J(n, 1, .)
real scalar function my_quadcrossmean(real vector x)
{
    cp = quadcross(1, 0, x, 1)
    return(cp[1] / cp[2])
}

y5 = J(n, 1, .)
real scalar function my_crossmean(real vector x)
{
    cp = cross(1, 0, x, 1)
    return(cp[1] / cp[2])
}

y6 = J(n, 1, .)
real scalar function my_mean2(real vector x)
{
    xsel = x[selectindex(!rowmissing(x))]
    cp   = J(length(xsel), 1, 1)' * (xsel, J(length(xsel), 1, 1))
    return(cp[1] / cp[2])
}
end

timer on 4
mata
for (i = 1; i <= n; i++) {
    y4[i] = my_quadcrossmean(x[|ixl[i] \ ixu[i]|])
}
end
timer off 4

timer on 5
mata
for (i = 1; i <= n; i++) {
    y5[i] = my_crossmean(x[|ixl[i] \ ixu[i]|])
}
end
timer off 5

timer on 6
mata
for (i = 1; i <= n; i++) {
    y6[i] = my_mean2(x[|ixl[i] \ ixu[i]|])
}
end
timer off 6

mata assert(all(y1 :== y4))
mata assert(all(y5 :== y6))

On my Windows machine I now get

Code:

. timer list
   1:     90.75 /        1 =      90.7530
   2:      4.85 /        1 =       4.8490
   3:      4.57 /        1 =       4.5670
   4:     89.39 /        1 =      89.3850
   5:     86.29 /        1 =      86.2900
   6:     11.80 /        1 =      11.7980

Comment

Mauricio Caceres

Join Date: Sep 2015
Posts: 130

31 May 2020, 14:57

Sven-Kristjan Bormann I think even with weights it's a bit of a mystery why cross is slower:

Code:

mata
y7 = J(n, 1, .)
real vector function my_wmean(real matrix x, | real vector w)
{
    if ( args() == 1 ) w = J(rows(x), 1, 1)
    xwsel = selectindex(!rowmissing((x, w)))
    cp    = w[xwsel]' * (x[xwsel, .], J(rows(xwsel), 1, 1))
    return(cp[1::cols(x)] :/ cp[cols(x) + 1])
}
end

timer on 7
mata
for (i = 1; i <= n; i++) {
    y7[i] = my_wmean(x[|ixl[i] \ ixu[i]|])
}
end
timer off 7

mata max(reldif(y1, y7))

On my Windows machine,

Code:

. timer list
   1:     90.75 /        1 =      90.7530
   2:      4.85 /        1 =       4.8490
   3:      4.57 /        1 =       4.5670
   4:     89.39 /        1 =      89.3850
   5:     86.29 /        1 =      86.2900
   6:     11.80 /        1 =      11.7980
   7:     16.24 /        1 =      16.2420

While I don't really understand why there isn't a weighted version and an unweighted version (cross seems unnecessary to use without weights), even accounting for weights "mean()" seems fairly slow on my system.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

31 May 2020, 15:06

I note that the examples of extreme speed differences are on Windows Systems running Stata/MP. Those of us running Stata/SE or Stata/IC do not see the dramatic differences.

This suggests to me the possibility that the use of multiple processors produces a much greater gain for sum() and quadsum() than for mean() via quadcross(). However that explanation seems to falter in the face of the Stata/MP Performance Report, the latest version of which states that quadcross() and cross() are fully parallelized while saying nothing about quadsum() and sum().

I would find it interesting if Mauricio Caceres and Stephen Jenkins were to replicate their results with the following commands added to the front of the do-file.

Code:

set processors 1

My expectation is that the speed penalty for mean() via quadcross() compared to sum() on MP systems running on a single processor will look more like the penalty (2x-3x) for those of us with single processor licenses. What I'm really interested in seeing is the absolute speed comparison to the results under MP.
Comment

Stephen Jenkins

Join Date: Apr 2014
Posts: 1435

31 May 2020, 16:37

I reran the original code except with -set processors 1-

Code:

. timer list
   1:      4.69 /        1 =       4.6900
   2:      1.88 /        1 =       1.8770
   3:      2.02 /        1 =       2.0230

[/CODE]

I then did -cscript- and -set processors 4- and reran the code again, as follows. Note how timer #1 now longer!

Code:

. timer list
   1:     16.49 /        1 =      16.4900
   2:      1.80 /        1 =       1.7970
   3:      2.03 /        1 =       2.0340

Comment

Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#10

31 May 2020, 17:19

William Lisowski Stephen Jenkins I think the multiprocessing is exactly the issue! Thanks. If I run "set processors 1" and then run the code in my first post, I get this on Windows:

Code:

. timer list 1: 9.93 / 1 = 9.9300 2: 4.15 / 1 = 4.1520 3: 4.35 / 1 = 4.3470

No change in sum/quadsum but mean is 10x faster. Even if I just run "set processors 2" I go back to the slow performance of mean. I think that the conclusion is that cross/quadcross is very slow in parallel on Windows in Stata MP.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

31 May 2020, 18:33

Mauricio Caceres Stephen Jenkins -

First, many thanks to both of you for running this. With a single processor system, I've "got no dog in this hunt" as it were, but curiosity was killing me.

I think that the conclusion is that cross/quadcross is very slow in parallel on Windows in Stata MP.

That is exactly what I hoped we'd see. One of my earliest questions on Statalist revolved around something I was doing to avoid really slow performance of some command. The answers I got were "not slow for me!". I sent it along to Stata Tech Support, who discovered the reverse of the current problem - something that worked well with processors>1 did not work so well on a single processor. So this is the mirror of that. Glad I remembered that discussion.

Mauricio, I would recommend - since you originated this question - that you submit it to Stata Technical Services and point them to this topic for details.

As a side note, I was disappointed that the Stata/MP Performance Report gave detailed multiprocessor comparisons for various Stata commands, but limit their discussion of Mata functions to a single page summary. I've got to think that someone would have noticed a graph where the ratio of multiprocessor performance to single processor performance has a negative slope moving up from 1 processor.
1 like
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1669
#12

01 Jun 2020, 02:12

Function mean() is written in the Mata language and has and additional if statements to take care of the weights. The if statements is evaluated each time when the mean function is called. Further, as pointed out by Sven-Kristjan Bormann, quddcross is computationally more complex that quadsum.

Code:

*! version 1.0.2 10feb2016 version 14.0 mata: real rowvector mean(real matrix X, |real colvector w) { real rowvector CP real scalar n if (cols(X)==0) return(J(1, 0, .)) if (rows(X)==0) return(J(1, cols(X), .)) if (args()==1) w = 1 CP = quadcross(w,0, X,1) n = cols(CP) return(CP[|1\n-1|] :/ CP[n]) } end

Last edited by Attaullah Shah; 01 Jun 2020, 02:29.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

01 Jun 2020, 06:06

Attaullah Shah -

Does your experience with Mata development on asrol and asreg shed any light on the source of the major problem on Windows/MP systems shown in post #1 and discussed in posts #8-11: the fact that means() (and presumably quadcross()) is in absolute terms slower when run in Stata/MP with multiple processors enabled than when run with just a single processor enabled? Since those two programs focus in part on speed improvements, did you experienced this performance degradation in your development work?
Comment
Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#14

01 Jun 2020, 11:07

Attaullah Shah As noted in this post above the speed issue goes away if I run "set processors 1" so I don't think it's the added computational complexity of weights and the if statement. I think cross and quadcross are slow with multiprocessing but normal on single-core for some reason.

PS: There is no reason to use quadcross if weights are not specified; since the function needs if statements anyway I don't understand why quadcross is used even without weights (unless whomever wrote this function did not expect quadcross to ever had this speed issue).
Comment
Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#15

01 Jun 2020, 12:33

William Lisowski I reported the issue to Stata tech support, but they're saying I should upgrade to Stata 16 for support.
Comment

Announcement

mean() is very slow in long loops (Stata MP 15.1)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment