Common elements in a list?

Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#1

Common elements in a list?

24 Aug 2014, 17:00

Dear Listers --

I am trying to create a fast function that outputs all of the the common elements in two vectors. Currently, I have code that works as follows:

Code:

mata: real colvector CommonVals(real colvector Z, real colvector Y) { List=J(0,1,.) for (i=1;i<=rows(Z);i++) if (any(Y:==Z[i])) List=List \ Z[i] return(List) } end

Running:

Code:

X=1\2\3\4\5\6 Y=2\4\5\7 CommonVals(X,Y)

Produces the numbers 2,4, and 5. While the function thus does exactly what I would like it to - it returns the common elements of the vectors Z and Y - it requires looping over all the elements of Z. My problem is that in settings where Z has thousands or even hundreds of thousands of elements, looping over all of the entries in Z can be extremely costly in terms of computational time. In fact, in my application, this turns what might be an n^2 running time algorithm into more or less an n^3 one. So, I was wondering: does anyone know how to get the common values without looping over all the entries of one or the other vector?

Best,

Matt Baker
Tags: None
Aljar Meesters

Join Date: Apr 2014

Posts: 30
#2

25 Aug 2014, 03:47

You can use associative arrays to decrease the searching time. You may also want to look at an old list server thread that had a similar question (http://www.stata.com/statalist/archi.../msg00811.html).
Best,

Aljar
Comment
Andrew Maurer

Join Date: Apr 2014

Posts: 28
#3

25 Aug 2014, 11:13

I'll just follow up and say that Aljar's method in the thread he linked to is very good.

Note the required conditions for the fastest method (vec_inlist()):
All elements in your list are integers

You have sufficient memory (memory required is proportional to the range of the list. Eg if your list is {1,2,2,2,2,2,3}, you'll need 3*8 bytes = 24bytes + some ((range of 3) * 8 bytes per element). If your list is {1,1billion} you'll need 8gb of memory + some ((range of 1billion) * 8 bytes per element).

If those two conditions are satisfied, Aljar's mata function vec_inlist() is very efficient relative to other methods. I use this kind of method in a lot of big data applications to avoid sorting (one other application is in intlevelsof in this thread: http://www.statalist.org/forums/foru...rs-efficiently)

One fairly quick change that would require 1/8th the memory, but more time would be to store the list as a stata byte variable and st_view() it. (It would be nice to see byte format available in mata in Stata 14!)
Comment
Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#4

25 Aug 2014, 11:56

Andrew and Aljar --

A very nice thread that works through the issues quite thoroughly. I've experimented a bit with Aljar's function and it really is quite quick!

Thanks!

Matt Baker
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2411
#5

26 Aug 2014, 14:13

I saw under -help mata any()- that:

"anyof(P, s) returns 1 if any element of P equals s and returns 0 otherwise. anyof(P, s) is faster and consumes less memory than the equivalent any(P:==s)."

When I ran your code above on two vectors of length 40e3, but using anyof(Y, Z[i]) instead of any(Y:==Z[i]), the code ran about 7X faster, and appeared to give the same results.

I didn't compare this to the time for Aljar's code, so I can't speak to that.

Regards, Mike
Comment
Matthew J. Baker

Join Date: Mar 2014

Posts: 126
#6

27 Aug 2014, 08:11

Aljar, Andrew, and Mike (and anyone else who might be interested!) --

Perhaps file this under the heading of "waste of time," but I put together a little do file that builds Aljar's function, my naive function, and my function with Mike's suggested improvement. I create a little test example for timing that involves finding matching values of vectors with 50000 integer entries (sorted).

On my computer, I find that Mike's improvement makes the naive function go about 10 times faster. However, Aljar's function is a very clear winner - it runs about 100 times faster than Mike's on my computer. do-file is attached.

Best,

Matt Baker

Attached Files

testsOfFinders.do (1.8 KB, 1 view)
Comment

Matthew J. Baker

Join Date: Mar 2014
Posts: 126

27 Aug 2014, 08:20

The previous post didn't work, so here is my code:

Code:

/* Tests of functions finding common vals */
/* Test vectors */
clear all
set seed 5150
mata: 
/* First Aljar's function - note it uses mm_which and returns values, unlike Aljar's */
real colvector vec_inlist(real colvector B, real colvector L)
{
    real colvector b, l
    real scalar minrows, answer

    b = J(max(B), 1, 0)
    b[B] = J(rows(B), 1, 1)

    l = J(max(L), 1, 0)
    l[L] = J(rows(L), 1, 1)

    minrows = min((rows(b), rows(l)))
    answer = J(rows(b), 1, 0)
    answer[|1, 1 \ minrows, 1 |] = b[|1, 1 \ minrows, 1 |] :* l[|1, 1 \ minrows, 1 |]

    answer= answer[B]
    return(answer)
}
/* Second my naive function */
real colvector vec_entries1(real colvector B, real colvector L)
{
    real matrix List
    real scalar i
    
    List=J(0,1,.)
    for (i=1;i<=rows(L);i++) if (anyof(B,L[i])) List=List \ L[i]
    return(List)
}    
/* My naive function with Mike's improvement */
real colvector vec_entries2(real colvector B, real colvector L)
{
    real matrix List
    real scalar i
    List=J(0,1,.)
    for (i=1;i<=rows(L);i++) if (any(B:==L[i])) List=List \ L[i]
    return(List)
}    
/* A wrapper for the above three cases */
real colvector pos_inlist(X,Y,method)
{
    real matrix answer
    if (method==1) {
        answer=vec_inlist(X,Y)
        return(X[mm_which(answer)])
    }
    else if (method==2) return(vec_entries1(X,Y))
    else return(vec_entries2(X,Y))
}
end

/* Some "big" test vectors - ordered vectors with integer entries */

mata: 
B=round(10*runiform(50000,1))
B=uniqrows(runningsum(B))

L=round(10*runiform(50000,1))
L=uniqrows(runningsum(L))

for (i=1;i<=5;i++) {
    timer_on(i)
    Check=pos_inlist(B,L,1)
    timer_off(i)
}
for (i=6;i<=10;i++) {
    timer_on(i)
    Check=pos_inlist(L,B,2)
    timer_off(i)
}
for (i=11;i<=15;i++) {
    timer_on(i)
    Check=pos_inlist(B,L,3)
    timer_off(i)
}
timer()


end

Comment

Andrew Kretz

Join Date: Sep 2019

Posts: 3
#8

09 Sep 2019, 13:21

Hi Matthew,

Thank you for sharing the code. I'm relatively new to Mata and attempting to create a function like the one you were interested in, but for string values in two column vectors. Any advice as to how I might modify your code for that?

Thank you!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3842
#9

10 Sep 2019, 00:31

My elabel package (SSC) has a function, aandb() (read: a and b), that works for both real and string row vectors. It appears to be about 10 times slower than Aljar's code; it could be slightly faster if you recompile the Mata source code (it comes pre-compiled under Stata 11.2). Here is an example:

Code:

* ssc install elabel mata : a = "foo"\ "bar"\ "foobar" b = "FOOBAR"\ "foo"\ "Bar" _aandb(a', b')' aandb(a', b')' end

and output

Code:

: a = "foo"\ "bar"\ "foobar" : b = "FOOBAR"\ "foo"\ "Bar" : : _aandb(a', b')' 1 +-----+ 1 | 1 | 2 | 0 | 3 | 0 | +-----+ : aandb(a', b')' foo

Best
Daniel
Comment

Nicolas Paris

Join Date: Apr 2024
Posts: 1

#10

19 Apr 2024, 16:25

Thank you very much, Matthew.

I incorporated Aljarr's function into one of my codes and encountered an issue. Specifically, when one of the vectors being compared contains a zero. Here's an example code:

Code:

clear all
mata:
real colvector vec_inlist(real colvector B, real colvector L)
{
    real colvector b, l
    real scalar minrows, answer

    b = J(max(B), 1, 0)
    b[B] = J(rows(B), 1, 1)

    l = J(max(L), 1, 0)
    l[L] = J(rows(L), 1, 1)

    minrows = min((rows(b), rows(l)))
    answer = J(rows(b), 1, 0)
    answer[|1, 1 \ minrows, 1 |] = b[|1, 1 \ minrows, 1 |] :* l[|1, 1 \ minrows, 1 |]

    answer= answer[B]
    return(answer)
}

A=(1\2\3\4\5\6)
B=(1\2\3\4\5)
C=(0\2\3\4\5\6)

vec_inlist(A,B)
vec_inlist(C,A)

end

The function is giving me errors when I compare a colvector that has a zero.

Code:

: vec_inlist(A,B)
       1
    +-----+
  1 |  1  |
  2 |  1  |
  3 |  1  |
  4 |  1  |
  5 |  1  |
  6 |  0  |
    +-----+

: vec_inlist(C,A)
            vec_inlist():  3301  subscript invalid
                 &amp;lt;istmt&amp;gt;:     -  function returned error
(1 line skipped)

I tried to fix it but couldn't figure out how to do it while following Aljarr's way. Can anyone help me figure out how to change the code?

Best,

Nicolás

Comment

Brian Bradfield

Join Date: Feb 2020
Posts: 3

#11

22 Apr 2024, 10:24

Hey Nicolas,

Here is a version that would work for that instance:

Code:

real colvector inlist_numcol(real colvector haystack,
                             real colvector needle)
{
  real scalar    minnum              // Minimum value of haystack and needle
  real colvector new_hay , new_need  // New haystack & needle if minimum is 0
  real colvector mask_hay, mask_need // Mask vectors
  real scalar    minrows             // Minimum number of rows
  real colvector output              // Returned result

  minnum = min((min(haystack), min(needle)))

  if(minnum)
  {
    mask_hay           = J(max(haystack) , 1, 0)
    mask_hay[haystack] = J(rows(haystack), 1, 1)

    mask_need         = J(max(needle) , 1, 0)
    mask_need[needle] = J(rows(needle), 1, 1)

    minrows = min((rows(mask_hay), rows(mask_need)))
    output  = J(rows(mask_hay), 1, 0)

    output[|1, 1 \ minrows, 1 |] = mask_hay[|1, 1 \ minrows, 1 |] :* mask_need[|1, 1 \ minrows, 1 |]

    return(output[haystack])
  }
  else
  {
    new_hay  = haystack :+ (1 - minnum)
    new_need = needle   :+ (1 - minnum)

    mask_hay          = J(max(new_hay) , 1, 0)
    mask_hay[new_hay] = J(rows(new_hay), 1, 1)

    mask_need           = J(max(new_need) , 1, 0)
    mask_need[new_need] = J(rows(new_need), 1, 1)

    minrows = min((rows(mask_hay), rows(mask_need)))
    output  = J(rows(mask_hay), 1, 0)

    output[|1, 1 \ minrows, 1 |] = mask_hay[|1, 1 \ minrows, 1 |] :* mask_need[|1, 1 \ minrows, 1 |]

    return(output[new_hay])
  }
}

Last edited by Brian Bradfield; 22 Apr 2024, 10:27.

Announcement