Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Common elements in a list?

    Dear Listers --

    I am trying to create a fast function that outputs all of the the common elements in two vectors. Currently, I have code that works as follows:
    Code:
    mata:
    real colvector CommonVals(real colvector Z, real colvector Y) {
        List=J(0,1,.)
        for (i=1;i<=rows(Z);i++) if (any(Y:==Z[i])) List=List \ Z[i]
        return(List)
    }
    end
    Running:
    Code:
    X=1\2\3\4\5\6
    Y=2\4\5\7
    CommonVals(X,Y)
    Produces the numbers 2,4, and 5. While the function thus does exactly what I would like it to - it returns the common elements of the vectors Z and Y - it requires looping over all the elements of Z. My problem is that in settings where Z has thousands or even hundreds of thousands of elements, looping over all of the entries in Z can be extremely costly in terms of computational time. In fact, in my application, this turns what might be an n^2 running time algorithm into more or less an n^3 one. So, I was wondering: does anyone know how to get the common values without looping over all the entries of one or the other vector?

    Best,

    Matt Baker




  • #2
    You can use associative arrays to decrease the searching time. You may also want to look at an old list server thread that had a similar question (http://www.stata.com/statalist/archi.../msg00811.html).
    Best,

    Aljar

    Comment


    • #3
      I'll just follow up and say that Aljar's method in the thread he linked to is very good.

      Note the required conditions for the fastest method (vec_inlist()):
      • All elements in your list are integers
      • You have sufficient memory (memory required is proportional to the range of the list. Eg if your list is {1,2,2,2,2,2,3}, you'll need 3*8 bytes = 24bytes + some ((range of 3) * 8 bytes per element). If your list is {1,1billion} you'll need 8gb of memory + some ((range of 1billion) * 8 bytes per element).
      If those two conditions are satisfied, Aljar's mata function vec_inlist() is very efficient relative to other methods. I use this kind of method in a lot of big data applications to avoid sorting (one other application is in intlevelsof in this thread: http://www.statalist.org/forums/foru...rs-efficiently)

      One fairly quick change that would require 1/8th the memory, but more time would be to store the list as a stata byte variable and st_view() it. (It would be nice to see byte format available in mata in Stata 14!)

      Comment


      • #4
        Andrew and Aljar --

        A very nice thread that works through the issues quite thoroughly. I've experimented a bit with Aljar's function and it really is quite quick!

        Thanks!

        Matt Baker

        Comment


        • #5
          I saw under -help mata any()- that:

          "anyof(P, s) returns 1 if any element of P equals s and returns 0 otherwise. anyof(P, s) is faster and consumes less memory than the equivalent any(P:==s)."

          When I ran your code above on two vectors of length 40e3, but using anyof(Y, Z[i]) instead of any(Y:==Z[i]), the code ran about 7X faster, and appeared to give the same results.

          I didn't compare this to the time for Aljar's code, so I can't speak to that.

          Regards, Mike


          Comment


          • #6
            Aljar, Andrew, and Mike (and anyone else who might be interested!) --

            Perhaps file this under the heading of "waste of time," but I put together a little do file that builds Aljar's function, my naive function, and my function with Mike's suggested improvement. I create a little test example for timing that involves finding matching values of vectors with 50000 integer entries (sorted).

            On my computer, I find that Mike's improvement makes the naive function go about 10 times faster. However, Aljar's function is a very clear winner - it runs about 100 times faster than Mike's on my computer. do-file is attached.

            Best,

            Matt Baker
            Attached Files

            Comment


            • #7
              The previous post didn't work, so here is my code:
              Code:
              /* Tests of functions finding common vals */
              /* Test vectors */
              clear all
              set seed 5150
              mata: 
              /* First Aljar's function - note it uses mm_which and returns values, unlike Aljar's */
              real colvector vec_inlist(real colvector B, real colvector L)
              {
                  real colvector b, l
                  real scalar minrows, answer
              
                  b = J(max(B), 1, 0)
                  b[B] = J(rows(B), 1, 1)
              
                  l = J(max(L), 1, 0)
                  l[L] = J(rows(L), 1, 1)
              
                  minrows = min((rows(b), rows(l)))
                  answer = J(rows(b), 1, 0)
                  answer[|1, 1 \ minrows, 1 |] = b[|1, 1 \ minrows, 1 |] :* l[|1, 1 \ minrows, 1 |]
              
                  answer= answer[B]
                  return(answer)
              }
              /* Second my naive function */
              real colvector vec_entries1(real colvector B, real colvector L)
              {
                  real matrix List
                  real scalar i
                  
                  List=J(0,1,.)
                  for (i=1;i<=rows(L);i++) if (anyof(B,L[i])) List=List \ L[i]
                  return(List)
              }    
              /* My naive function with Mike's improvement */
              real colvector vec_entries2(real colvector B, real colvector L)
              {
                  real matrix List
                  real scalar i
                  List=J(0,1,.)
                  for (i=1;i<=rows(L);i++) if (any(B:==L[i])) List=List \ L[i]
                  return(List)
              }    
              /* A wrapper for the above three cases */
              real colvector pos_inlist(X,Y,method)
              {
                  real matrix answer
                  if (method==1) {
                      answer=vec_inlist(X,Y)
                      return(X[mm_which(answer)])
                  }
                  else if (method==2) return(vec_entries1(X,Y))
                  else return(vec_entries2(X,Y))
              }
              end
              
              /* Some "big" test vectors - ordered vectors with integer entries */
              
              mata: 
              B=round(10*runiform(50000,1))
              B=uniqrows(runningsum(B))
              
              L=round(10*runiform(50000,1))
              L=uniqrows(runningsum(L))
              
              for (i=1;i<=5;i++) {
                  timer_on(i)
                  Check=pos_inlist(B,L,1)
                  timer_off(i)
              }
              for (i=6;i<=10;i++) {
                  timer_on(i)
                  Check=pos_inlist(L,B,2)
                  timer_off(i)
              }
              for (i=11;i<=15;i++) {
                  timer_on(i)
                  Check=pos_inlist(B,L,3)
                  timer_off(i)
              }
              timer()
              
              
              end

              Comment


              • #8
                Hi Matthew,

                Thank you for sharing the code. I'm relatively new to Mata and attempting to create a function like the one you were interested in, but for string values in two column vectors. Any advice as to how I might modify your code for that?

                Thank you!

                Comment


                • #9
                  My elabel package (SSC) has a function, aandb() (read: a and b), that works for both real and string row vectors. It appears to be about 10 times slower than Aljar's code; it could be slightly faster if you recompile the Mata source code (it comes pre-compiled under Stata 11.2). Here is an example:

                  Code:
                  * ssc install elabel
                  
                  mata :
                  
                  a = "foo"\ "bar"\ "foobar"
                  b = "FOOBAR"\ "foo"\ "Bar"
                  
                  _aandb(a', b')'
                  aandb(a', b')'
                  
                  end
                  and output

                  Code:
                  : a = "foo"\ "bar"\ "foobar"
                  
                  : b = "FOOBAR"\ "foo"\ "Bar"
                  
                  : 
                  : _aandb(a', b')'
                         1
                      +-----+
                    1 |  1  |
                    2 |  0  |
                    3 |  0  |
                      +-----+
                  
                  : aandb(a', b')'
                    foo
                  Best
                  Daniel

                  Comment

                  Working...
                  X