Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using mata operators efficiently

    Dear Statalist,

    I am using stata and mata for data manipulation and analysis using big data (~5-40gb). I am trying to understand at a relatively low level what is going on behind the scenes when programs are run, since minor changes to syntax can make a large difference in processing time and memory usage.
    1. One issue that I've been running into is having to have two copies of data in memory at once. See testcfcn1(), testfcn2(), and testfcn3() in the code block below. I ran each function while looking at task manager and found that memory usage from testfcn 1 and 2 was going up to 1.1gb to generate data that would ultimately only take up 377mb! (100mb from 10^8 bytes = 277mb overhead). The only alternate I can think of that would use less memory is to loop through the rows of the view (see testfcn3()), but this takes 15 seconds, relative to the 3 seconds that testfcn 1 and 2 take. A conceptual alternative would be to do something like V[.,.] := 1, but this syntax doesn't exist.
    2. This question is similar. What is the most efficient syntax for incriminating a vector? I have some code where I have an integer offset and use b[A:+offset] = J(length(A),1,1), where b is a vector and A:+offset is a vector that refers to indices of b. Timing the different parts of this line, I found that a significant amount of time was being taken by the A:+offset operation. Is there a better way to do this?
    3. This is pretty much the same as 2. but from a different angle. While I don't know C, I was looking at some C code and saw the += operator. Eg if A = (3,4,5)' and B = (1,3,6)', --A += B-- would be a more efficient version of writing --A = A + B--. Does mata support this kind of operator? In my current case, I have data, where for example, A = (1,1,3,4,4,4)'. I can very efficiently determine the levels of A using --B = J(4,1,0)--, then -- B[A] = J(rows(A),1,1)--, then the levels of A are --selectindex(B)) // returns (1,3,4)--. However, I would like to count the number of 1s, the number of 2s, etc (like "tabulate"), using syntax like B[A] += 1 without having to loop through the data.
    Code:
    clear all
    
    mata
    void testfcn1() {
        
        real colvector V
        
        st_addobs(10^8)
        st_view(V, ., st_addvar("byte", "var1", 1))
        V[.] = J(rows(V),1,1)
        // memory goes up to 1.1gb, then down to 377mb, takes 3 seconds
        
    }
    void testfcn2() {
            
        st_addobs(10^8)
        st_store(.,st_addvar("byte", "var1", 1),J(st_nobs(),1,1))
        // like testfcn(), memory goes up to 1.1gb, then down to 377mb, takes 3 seconds
        
    }
    void testfcn3() {
            
        st_addobs(10^8)
        st_view(V, ., st_addvar("byte", "var1", 1))
        for (i=1;i<=rows(V);i++) V[i] = 1
        // memory only ever goes to 377mb, but takes 15 seconds
        
    }
    
    end
    
    timer on 1
    mata testfcn1()
    timer off 1
    
    timer list

    Thank you,

    Andrew Maurer

  • #2
    Regarding question #2, does your offset change? Does the size of A change?

    If both are constant, it seems that you can get faster incrementing with matrix addition if the offset matrix is cached.

    Here is a simple test to support that.

    3 ways to increment are used in these Mata functions:
    Code:
    void inc1(n) {
        V = J(8, 1, 0)
        for (i = 1; i <= n; i++) {
            V = V :+ 1
        }
    }
    
    void inc2(n) {
        V = J(8, 1, 0)
        for (i = 1; i <= n; i++) {
            V = V + J(8, 1, 1)
        }
    }
    
    void inc3(n) {
        V = J(8, 1, 0)
        j8 = J(8, 1, 1)
        for (i = 1; i <= n; i++) {
            V = V + j8
        }
    }
    The above functions are timed with temp.do
    Code:
    args n
    
    timer on 1
    mata: inc1(`n')
    mata: inc1(`n')
    mata: inc1(`n')
    timer off 1
    
    timer on 2
    mata: inc2(`n')
    mata: inc2(`n')
    mata: inc2(`n')
    timer off 2
    
    timer on 3
    mata: inc3(`n')
    mata: inc3(`n')
    mata: inc3(`n')
    timer off 3
    
    noi timer list 1
    noi timer list 2
    noi timer list 3
    
    timer clear 1
    timer clear 2
    timer clear 3
    Here is Stata output:
    Code:
    . run temp.do 1e5
       1:      0.09 /        1 =       0.0850
       2:      0.08 /        1 =       0.0840
       3:      0.06 /        1 =       0.0550
    . run temp.do 1e6
       1:      0.77 /        1 =       0.7700
       2:      0.84 /        1 =       0.8370
       3:      0.54 /        1 =       0.5400

    Comment


    • #3
      Regarding question #3: If I understand what you're saying, you can do what you want without using a += operator.

      Code:
      : b = J(6, 1, 0)
      
      : a = (1, 1, 3, 4, 4, 4)'
      
      : b[a] = b[a] + J(6, 1, 1)
      
      : b
             1
          +-----+
        1 |  1  |
        2 |  0  |
        3 |  1  |
        4 |  1  |
        5 |  0  |
        6 |  0  |
          +-----+

      However, if what you're trying to get back is
      Code:
      : b
             1
          +-----+
        1 |  2  |
        2 |  0  |
        3 |  1  |
        4 |  3  |
        5 |  0  |
        6 |  0  |
          +-----+
      then I don't think a += operator will help you. Typically x += y is equivalent to x = x + y.

      And, to answer the stated question, it doesn't seem that the += operator exists in Mata:
      Code:
      : b[a] += J(6, 1, 1)
      invalid lval
      r(3000);

      Comment


      • #4
        Thanks for these responses, James.

        To clarify regarding question #2, neither the offset nor the size of A change. There are 2 issues - memory usage and processing time. The issue with inc2() and inc3() is that if the length of the vectors is large (eg 10^9), then V takes up 8gb and j8 takes up another 8gb. I would think that intuitively there should be a more efficient way of incrementing a vector than having to create a vector of ones, a billion long, and adding the vector of ones to V. Internally, the goal is to replace V[1] = V[1] + 1, V[2] = V[2] + 1, V[3] = V[3] + 1, etc. I would think that at a low level, it would be more work to do V[1] = V[1] + j8[1], V[2] = V[2]+j8[2], V[3] = V[3] + j8[3], etc. The overall idea would be something like V++, where V is a vector. (ie, i++ is more efficient that i=i+1. Can the same kind of syntax be done for matrices?)

        Regarding question 3, I'll give a bit more context to explain why something like += would help. Below is a function I wrote that returns the levels of a vector, A \footnote{Thanks to Aljar Meesters for the brilliant idea of the key line b[A:+offset,1] = J(length(A),1,1)}. I would like to do something similar that tabulates A using similarly efficient syntax.

        intlevelsof: a much quicker mata alternate to stata's levelsof when the range of variable A is "small" (memory usage is around (rangeA + 2*lengthA)*8 bytes.)
        Code:
        clear all
        mata
        real vector intlevelsof(real vector A)
        {
        /*
        - A must only contain integers.
        - The idea here is that if A = (4,5,5,3,3,1)', then b is created such that
        1s are placed in indices corresponding to the values of A. Ie:
        b = (1,0,1,1,1)'. The levels are then returned with selectindex(b).
        - offset is used to support negative values in A and to more efficiently
        ​store the b vector if the minimum value in A is much greater than 1.
        */
        
        real scalar maxA, minA, rangeA, offset
        real vector minmaxA, b
        
        minmaxA = minmax(A)
        minA = minmaxA[1,1]
        maxA = minmaxA[1,2]
        rangeA = maxA-minA+1
        offset = -minA+1
        
        if (rangeA > 10^9) _error(9,"range of vector must be less than 1 billion")
        // 10^9 is 8GB
        
        b = J(rangeA, 1, 0)
        b[A:+offset,1] = J(length(A),1,1)
        
        return(selectindex(b):-offset)
        
        }
        end
        intlevelsof() usage:
        Code:
        mata {
        /*
        Generate some data: N integers that range between 1 and s
        */
        N = 10^3
        s = 10^3
        R = ceil(s*runiform(N,1))
        }
        
        // count number of distinct integers in R
        timer on 1
        mata intlevelsof(R)
        timer off 1
        
        timer list
        Just to hammer out the idea here, the key line is very fast, relative to looping over elements of A and I think that the efficiency is a result of how subscripting in mata is optimized. If b = J(5,1,0), then --b[(1,3,2,2,2,1)'] = J(6,1,1)-- will change b from (0,0,0,0,0)' to (1,1,1,0,0)'. It's essentially doing: b[1]=1, then b[3]=1, then b[2]=1, then b[2]=1, then b[2]=1, then b[1]=1. I'd like to use similarly efficient syntax to tabulate A (eg, return (2,3,1,0,0)'). The reason I asked about the += operator is because tabulating A could be achieved by doing b[A:+offset,1] += J(length(A),1,1) if it were possible. Is there another way to do this without looping through the elements of A?

        Comment


        • #5
          I see. In #2 you are wanting most time-efficient incrementing but within some memory constraints. That makes sense. I suppose I focused on time-efficiency since you talked about timing in that question. In any case, I only know of three ways to increment a vector in Mata:
          1. with the :+ syntax you are using
          2. by adding a constant vector
          3. by incrementing each element in a loop
          Maybe someone else knows of another way.


          The reason I asked about the += operator is because tabulating A could be achieved by doing b[A:+offset,1] += J(length(A),1,1) if it were possible.
          It seems to me that there are two wishes behind this statement.
          1. That the += operator exist in Mata.
          2. That the += behave in such a way that b[A:+offset,1] += J(length(A),1,1) would tabulate A.
          The operator seems to not exist in Mata. If it did, it probably would not behave the way you want. Typically, x += y is equivalent to x = x + y, and b[A:+offset] = b[A:+offset] + J(length(A),1,1) does not do what you want.

          Comment


          • #6
            1. with the :+ syntax you are using
            2. by adding a constant vector
            3. by incrementing each element in a loop
            Maybe someone else knows of another way.
            My suggestion would be to use a hybrid form of 2 and 3. In that way you can speed up the loop with only a little bit of extra memory usage. As an example I have created testfcn4 (see below). As a matter of fact, on my machine it turns out to even run faster than testfcn1 and testfcn2.

            Code:
            void testfcn4() {
                st_addobs(10^8)
                st_view(V, ., st_addvar("byte", "var1", 1))
                stepsize = 1000
                ones = J(stepsize, 1, 1)
                
                i = 1
                if(rows(V) > stepsize)
                    for (;i <= (rows(V) - stepsize); i = i + stepsize) V[|i, 1\ i + stepsize - 1, 1|] = ones
                for (;i<=(rows(V));++i) V[i] = 1
                
            }

            Comment


            • #7
              Originally posted by Aljar Meesters View Post
              My suggestion would be to use a hybrid form of 2 and 3. In that way you can speed up the loop with only a little bit of extra memory usage. As an example I have created testfcn4 (see below). As a matter of fact, on my machine it turns out to even run faster than testfcn1 and testfcn2.
              Thanks very much for this response, Aljar. This looks like a good way to ensure that memory usage doesn't get out of control.

              I still don't fully understand how some simple built-in functions, like J(), can be so much faster than mata-programmed equivalents. In the introduction to StataCorp writes, "[mata] is fast enough to work at the element level". However, it seems like looping through elements in mata is still significantly slower than underlying C functions. Eg, in the examples below, jtest2() is on the order of 20x slower than jtest0() and jtest1() Even after "set processors 1" to remove any effects of parallelization, jtest2 is still 10x slower than jtest1 and 20x slower than jtest0.

              What factors contribute to this discrepancy? Is the primary factor that the built-in C functions have been compiled into machine code using a more efficient/optimized compiler than how the mata functions get compiled? Eg, still on the example of J(), is there any more efficient way of coding it at the C-level than something like -for (i=0;i<N;V[++i] = 1);-?

              Code:
              clear all
              
              mata
              void jtest0(real scalar t) {
                  real scalar N
                  real vector V
                  
                  N = 10^7
                  V = J(N,1,0) // initialize V for fair comparison
                  
                  timer_on(t)
                  V = J(N,1,1)
                  timer_off(t)
              }
              
              void jtest1(real scalar t) {
                  real scalar N
                  real vector V
                  
                  N = 10^7
                  V = J(N,1,0) // initialize V for fair comparison
                  
                  timer_on(t)
                  V = V:+1
                  timer_off(t)
              }
              
              void jtest2(real scalar t) {
                  real scalar N, i
                  real vector V
                  
                  N = 10^7
                  V = J(N,1,0) // initialize V for fair comparison
                  
                  timer_on(t)
                  for (i=0;i<N;V[++i] = 1);
                  timer_off(t)
              }
              end
              
              local reps 5
              forval i = 1/`reps' {
                  mata jtest0(1)
                  mata jtest1(11)
                  mata jtest2(21)
              }
              
              timer list
              returns:
              Code:
              . timer list
                 1:      0.19 /        5 =       0.0376
                11:      0.27 /        5 =       0.0534
                21:      5.80 /        5 =       1.1600

              Comment


              • #8
                There are three big issues.

                First, you are comparing to J(), and J() is written in very efficiently. Andrew asked if there is a more efficient way to write something like for (i=0; i<N; V[++i]=1). Yes there is.

                A double-precision vector of 8 elements is an 8*N area of memory. So begin by copying 1 to the first two elements (16 bytes) of memory.

                Then copy the first 16 bytes to next 16 bytes using C's (and the hardware's) memcpy().

                Then copy the first 32 bytes to the next 32 bytes.

                Then copy the first 64 bytes to the next 64 bytes.

                Then copy the first 128 bytes to the next 128 bytes.

                Then copy the first 256 bytes to the next 256 bytes.

                Should I keep going? The powers of 2 are 512, 1024, 2048, 4096, ... They grow impressively. It is true that copying 4096 bytes takes longer than copying 8 bytes, but it doesn't take anything like copying 8 bytes 512 times, which is what would otherwise be required, and we haven't even accounted for all the ++i operations we would have performed.

                In terms of our double precision numbers, we copied two numbers both at once after the set up then we copied 4 numbers at once, then 8, then 16, then 32, then 64, then 128, then 256, then 512.

                We can copy 10,000 elements in just 12 block copies, although we'll have to to perform a another "short" copy to fill in elements 8,192 to 10,000.

                The next big issue is that Mata does not have true integers. When we write for (i=0; i<N; V[++i]=1) in Mata, i is a double. C has true integers. Someday we plan for Mata to have true integers, but not until all Statas are 64-bit. Coprocessors make doubles fast, but integers are many times faster.

                And the third issue is that Mata compiles into P-code, not into machine language. P-code has to be executed in Mata's virtual machine. The theoretical minimum overhead of roughly a factor of 2 to 4 in a threaded-code approach (op codes are subroutine addresses). Mata actually achieves that in a few special cases. Those special cases were selected by us for "up front in-line" implementation. These special cases include loading a scalar and incrementing it. In general, however, Stata is running with an overhead around 5 to 6. Which is to say, to perform one machine-language instruction on your behalf, Stata has to execute 4 or 5 other machine-language instructions.

                These three factors explain the results, and each is roughly equally important.

                Offsetting all that is that Mata has built-in subroutines than run not only at the full speed C is capable of, but are written to be especially fast. In real-world statistical problems (say a routine to run a regression from start to finish), Mata runs at about half the speed C code. If we are lucky and have the right built-in subroutines, Mata runs even faster than that. If we unlucky, execution will take longer.

                At StataCorp, we do all statistical development in Mata. After the feature is implemented, we test performance. In some cases, we get lucky and performance is good enough. In other cases, it's not. In those cases, we analyze the performance to find the bottle necks, and we write a relatively small amount of C code as built-in Mata functions for our developed-and-working Mata solution to use.

                Interpretors like Stata's ado language, meantime, run with overheads of 50 to 200. Mata runs 10 to 40 times faster than ado code.

                -- BIll




                Comment

                Working...
                X