Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • mean() is very slow in long loops (Stata MP 15.1)

    I recently found that Mata's mean function is very slow when called on a long loop, and prohibitively so on Windows. I initially thought this might be due to mean() doing computations in quad precision, but quadsum() doesn't seem to have this issue. Here is an example of what I mean:

    Code:
    mata
    real scalar function my_mean (real vector x)
    {
        return(sum(x) / (length(x) - missing(x)))
    }
    
    real scalar function my_quadmean (real vector x)
    {
        return(quadsum(x) / (length(x) - missing(x)))
    }
    
    n = 500000
    x = runiform(n, 1)
    x[selectindex(x :< 0.1)] = J(sum(x :< 0.1), 1, missingof(x))
    ixl = (1::n) :- 100
    ixu = (1::n) :+ 100
    ixl[selectindex(ixl :< 1)] = J(sum(ixl :< 1), 1, 1)
    ixu[selectindex(ixu :> n)] = J(sum(ixu :> n), 1, n)
    y1 = J(n, 1, .)
    y2 = J(n, 1, .)
    y3 = J(n, 1, .)
    end
    
    timer clear
    timer on 1
    mata
    for (i = 1; i <= n; i++) {
        y1[i] = mean(x[|ixl[i] \ ixu[i]|])
    }
    end
    timer off 1
    
    timer on 2
    mata
    for (i = 1; i <= n; i++) {
        y2[i] = my_mean(x[|ixl[i] \ ixu[i]|])
    }
    end
    timer off 2
    
    timer on 3
    mata
    for (i = 1; i <= n; i++) {
        y3[i] = my_quadmean(x[|ixl[i] \ ixu[i]|])
    }
    end
    timer off 3
    
    mata assert(all(y1 :== y3))
    On Windows
    Code:
    . timer list
       1:     90.75 /        1 =      90.7530
       2:      4.57 /        1 =       4.5670
       3:      4.85 /        1 =       4.8490
    On Linux
    Code:
    . timer list
       1:      6.11 /        1 =       6.1070
       2:      1.76 /        1 =       1.7610
       3:      1.94 /        1 =       1.9430
    I use Stata MP 15.1 on both systems. You can see that doing the mean "by hand" with either sum() or quadsum() takes significantly less time. The difference in Linux is 3x whereas on WIndows it's close to 20x. Does anyone have an insight on why this is happening?

  • #2
    There are unfortunately three differences between your setup and mine.
    • You are running Stata 15.1 and I am running Stata 16.1
    • You are running Stata/MP and I am running Stata/SE
    • You are running Stata for Windows and Stata for Linux and I am running Stata for Mac (on a MacBook Air, not the fastest horse in that stable)
    With that said, I copied your example code into the do-file editor, preceded it with clear all and followed it with time list, and ran it.

    Code:
    . timer list
       1:      3.47 /        1 =       3.4730
       2:      1.25 /        1 =       1.2460
       3:      1.50 /        1 =       1.5040
    Again, Mata's mean() was slower than than the two custom versions. What astonishes me is how much faster everything I ran was than it was on your systems. I scaled up the problem by a factor of 10 and got the following results.

    Code:
    . timer list
       1:     33.62 /        1 =      33.6160
       2:     13.22 /        1 =      13.2250
       3:     18.19 /        1 =      18.1860
    These are consistent with my first set. But the speed disadvantage of mean() is more like 2.5x on my system, rather than the 3.5x on your Linux system or the whatever-in-the-world was going on in your Windows system.

    I hope that other readers will weigh in with additional data. It's easy to run.

    For what it's worth, these are the salient details of my system. To that, I will add that Activity Monitor tells me there's no pressure on memory while Stata is processing the do-file. And were there memory pressure, paging would be to an SSD drive, not a spinning platter.

    Code:
      Model Name:    MacBook Air
      Model Identifier:    MacBookAir8,1
      Processor Name:    Dual-Core Intel Core i5
      Processor Speed:    1.6 GHz
      Number of Processors:    1
      Total Number of Cores:    2
      L2 Cache (per Core):    256 KB
      L3 Cache:    4 MB
      Hyper-Threading Technology:    Enabled
      Memory:    8 GB

    Comment


    • #3
      My results are the following for Stata 14.2 IC on Windows 7 on a 10 year old Samsumg Laptop
      Code:
      . timer list
         1:      6.69 /        1 =       6.6870
         2:      2.89 /        1 =       2.8930
         3:      3.25 /        1 =       3.2470
      and if I increase n by a factor of 10 like William
      Code:
      . timer list
         1:     66.34 /        1 =      66.3430
         2:     30.23 /        1 =      30.2350
         3:     30.26 /        1 =      30.2590
      So for me the difference between the first approach and the other two is only two times.
      At least in Stata 14, Mata's mean()-function uses the quadcross()-function to sum up the values in the input vector. Maybe quadcross() is slower than quadsum(). I have not tested it further yet, but later I could do more benchmarking, if needed.

      Comment


      • #4
        FWIW running Mauricio's code:

        Code:
        
        . timer list
           1:     18.14 /        1 =      18.1410
           2:      1.87 /        1 =       1.8730
           3:      2.01 /        1 =       2.0080
        
        . 
        end of do-file
        
        . about
        
        Stata/MP 16.1 for Windows (64-bit x86-64)
        Revision 20 May 2020
        Copyright 1985-2019 StataCorp LLC
        
        Total physical memory:       16.00 GB
        Available physical memory:    9.70 GB
        
        Single-user 4-core Stata perpetual license:
        Run on a 6 year old Dell Precision PC

        Comment


        • #5
          After some further benchmarking, my explanation for the results is that quadcross() take at least two times longer to calculate the result than quadsum(). We have to remember that quadcross() is X'*Z . For the mean, X is equal to 1 and Z is the x-vector is in the demo-code. Therefore, the mean()-function makes not only n additions but also n multiplications.
          This seems to be a bit overkill at the first look, but the mean()-function allows weights to be applied, whereas the demo code with quadsum() does not deal with weights.

          Comment


          • #6
            Thanks everyone for the data points!

            Sven-Kristjan Bormann I think you hit the nail on the head. I didn't realize "quadcross" was used even if no weights were specified. If I implement this in terms of "quadcross" (or "cross" for that matter) I get basically the same running time. The help page for "cross" has an example of computing the mean, which is what I used. I guess the mystery now is why "cross" is slow in this case. If I implement "X' Z" by literally typing that, it runs a lot faster.

            Code:
            mata
            y4 = J(n, 1, .)
            real scalar function my_quadcrossmean(real vector x)
            {
                cp = quadcross(1, 0, x, 1)
                return(cp[1] / cp[2])
            }
            
            y5 = J(n, 1, .)
            real scalar function my_crossmean(real vector x)
            {
                cp = cross(1, 0, x, 1)
                return(cp[1] / cp[2])
            }
            
            y6 = J(n, 1, .)
            real scalar function my_mean2(real vector x)
            {
                xsel = x[selectindex(!rowmissing(x))]
                cp   = J(length(xsel), 1, 1)' * (xsel, J(length(xsel), 1, 1))
                return(cp[1] / cp[2])
            }
            end
            
            timer on 4
            mata
            for (i = 1; i <= n; i++) {
                y4[i] = my_quadcrossmean(x[|ixl[i] \ ixu[i]|])
            }
            end
            timer off 4
            
            timer on 5
            mata
            for (i = 1; i <= n; i++) {
                y5[i] = my_crossmean(x[|ixl[i] \ ixu[i]|])
            }
            end
            timer off 5
            
            timer on 6
            mata
            for (i = 1; i <= n; i++) {
                y6[i] = my_mean2(x[|ixl[i] \ ixu[i]|])
            }
            end
            timer off 6
            
            mata assert(all(y1 :== y4))
            mata assert(all(y5 :== y6))
            On my Windows machine I now get
            Code:
            . timer list
               1:     90.75 /        1 =      90.7530
               2:      4.85 /        1 =       4.8490
               3:      4.57 /        1 =       4.5670
               4:     89.39 /        1 =      89.3850
               5:     86.29 /        1 =      86.2900
               6:     11.80 /        1 =      11.7980

            Comment


            • #7
              Sven-Kristjan Bormann I think even with weights it's a bit of a mystery why cross is slower:

              Code:
              mata
              y7 = J(n, 1, .)
              real vector function my_wmean(real matrix x, | real vector w)
              {
                  if ( args() == 1 ) w = J(rows(x), 1, 1)
                  xwsel = selectindex(!rowmissing((x, w)))
                  cp    = w[xwsel]' * (x[xwsel, .], J(rows(xwsel), 1, 1))
                  return(cp[1::cols(x)] :/ cp[cols(x) + 1])
              }
              end
              
              timer on 7
              mata
              for (i = 1; i <= n; i++) {
                  y7[i] = my_wmean(x[|ixl[i] \ ixu[i]|])
              }
              end
              timer off 7
              
              mata max(reldif(y1, y7))
              On my Windows machine,

              Code:
              . timer list
                 1:     90.75 /        1 =      90.7530
                 2:      4.85 /        1 =       4.8490
                 3:      4.57 /        1 =       4.5670
                 4:     89.39 /        1 =      89.3850
                 5:     86.29 /        1 =      86.2900
                 6:     11.80 /        1 =      11.7980
                 7:     16.24 /        1 =      16.2420
              While I don't really understand why there isn't a weighted version and an unweighted version (cross seems unnecessary to use without weights), even accounting for weights "mean()" seems fairly slow on my system.

              Comment


              • #8
                I note that the examples of extreme speed differences are on Windows Systems running Stata/MP. Those of us running Stata/SE or Stata/IC do not see the dramatic differences.

                This suggests to me the possibility that the use of multiple processors produces a much greater gain for sum() and quadsum() than for mean() via quadcross(). However that explanation seems to falter in the face of the Stata/MP Performance Report, the latest version of which states that quadcross() and cross() are fully parallelized while saying nothing about quadsum() and sum().

                I would find it interesting if Mauricio Caceres and Stephen Jenkins were to replicate their results with the following commands added to the front of the do-file.
                Code:
                set processors 1
                My expectation is that the speed penalty for mean() via quadcross() compared to sum() on MP systems running on a single processor will look more like the penalty (2x-3x) for those of us with single processor licenses. What I'm really interested in seeing is the absolute speed comparison to the results under MP.

                Comment


                • #9
                  I reran the original code except with -set processors 1-

                  Code:
                  . timer list
                     1:      4.69 /        1 =       4.6900
                     2:      1.88 /        1 =       1.8770
                     3:      2.02 /        1 =       2.0230
                  [/CODE]

                  I then did -cscript- and -set processors 4- and reran the code again, as follows. Note how timer #1 now longer!

                  Code:
                  . timer list
                     1:     16.49 /        1 =      16.4900
                     2:      1.80 /        1 =       1.7970
                     3:      2.03 /        1 =       2.0340

                  Comment


                  • #10
                    William Lisowski Stephen Jenkins I think the multiprocessing is exactly the issue! Thanks. If I run "set processors 1" and then run the code in my first post, I get this on Windows:

                    Code:
                    . timer list
                       1:      9.93 /        1 =       9.9300
                       2:      4.15 /        1 =       4.1520
                       3:      4.35 /        1 =       4.3470
                    No change in sum/quadsum but mean is 10x faster. Even if I just run "set processors 2" I go back to the slow performance of mean. I think that the conclusion is that cross/quadcross is very slow in parallel on Windows in Stata MP.

                    Comment


                    • #11
                      Mauricio Caceres Stephen Jenkins -

                      First, many thanks to both of you for running this. With a single processor system, I've "got no dog in this hunt" as it were, but curiosity was killing me.

                      I think that the conclusion is that cross/quadcross is very slow in parallel on Windows in Stata MP.
                      That is exactly what I hoped we'd see. One of my earliest questions on Statalist revolved around something I was doing to avoid really slow performance of some command. The answers I got were "not slow for me!". I sent it along to Stata Tech Support, who discovered the reverse of the current problem - something that worked well with processors>1 did not work so well on a single processor. So this is the mirror of that. Glad I remembered that discussion.

                      Mauricio, I would recommend - since you originated this question - that you submit it to Stata Technical Services and point them to this topic for details.

                      As a side note, I was disappointed that the Stata/MP Performance Report gave detailed multiprocessor comparisons for various Stata commands, but limit their discussion of Mata functions to a single page summary. I've got to think that someone would have noticed a graph where the ratio of multiprocessor performance to single processor performance has a negative slope moving up from 1 processor.

                      Comment


                      • #12
                        Function mean() is written in the Mata language and has and additional if statements to take care of the weights. The if statements is evaluated each time when the mean function is called. Further, as pointed out by Sven-Kristjan Bormann, quddcross is computationally more complex that quadsum.

                        Code:
                        *! version 1.0.2  10feb2016
                        version 14.0
                        mata:
                        
                        real rowvector mean(real matrix X, |real colvector w)
                        {
                                real rowvector  CP
                                real scalar     n
                        
                                if (cols(X)==0) return(J(1, 0, .))
                                if (rows(X)==0) return(J(1, cols(X), .))
                        
                                if (args()==1) w = 1
                        
                                CP = quadcross(w,0, X,1)
                                n  = cols(CP)
                                return(CP[|1\n-1|] :/ CP[n])
                        }
                        
                        end
                        Last edited by Attaullah Shah; 01 Jun 2020, 02:29.
                        Regards
                        --------------------------------------------------
                        Attaullah Shah, PhD.
                        Associate Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                        www.FinTechProfessor.com
                        If you use MS Word, do check my asdoc program that easily sends Stata output to MS Word

                        Comment


                        • #13
                          Attaullah Shah -

                          Does your experience with Mata development on asrol and asreg shed any light on the source of the major problem on Windows/MP systems shown in post #1 and discussed in posts #8-11: the fact that means() (and presumably quadcross()) is in absolute terms slower when run in Stata/MP with multiple processors enabled than when run with just a single processor enabled? Since those two programs focus in part on speed improvements, did you experienced this performance degradation in your development work?

                          Comment


                          • #14
                            Attaullah Shah As noted in this post above the speed issue goes away if I run "set processors 1" so I don't think it's the added computational complexity of weights and the if statement. I think cross and quadcross are slow with multiprocessing but normal on single-core for some reason.

                            PS: There is no reason to use quadcross if weights are not specified; since the function needs if statements anyway I don't understand why quadcross is used even without weights (unless whomever wrote this function did not expect quadcross to ever had this speed issue).

                            Comment


                            • #15
                              William Lisowski I reported the issue to Stata tech support, but they're saying I should upgrade to Stata 16 for support.

                              Comment

                              Working...
                              X