Construct matrix counting pairwise agreements

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#1

Construct matrix counting pairwise agreements

31 Aug 2019, 14:18

I'm interested in a set of problems in which the data from each individual is a vector of integer-valued responses, and I want to know the number of times that each distinct pair of individuals gives responses that agree with each other. Thus, I want an agreement matrix A, where A[i,j] contains the number of responses on which subject i agrees with (is identical to) subject j. It's like a similarity matrix, where similarity is the number of agreements across the response vector. I could not figure out a way to do this with - matrix dissimilarity-, which seems to implement a matching option only for binary data. I found no reasonable way to create this agreement matrix in Stata per se, and I need to work with it in Mata anyway.

The Mata code below is the best I came up with for speed. I'm interested in any suggestions or critique. Avoiding looping over measurements certainly helps, and it would be nice to avoid explicitly looping over observations as well.

Code:

// Simulate data set seed 1243 mata mata clear local nmeas = 15 // number of measurements local top = 4 // measurements go from 1, ... top local nobs = 500 local nreps = 50 // for timing only mata: X = ceil(`top ' * runiform(`nobs', `nmeas')) // // Count up agreements mata: N = rows(X) A = J(N, N, .) timer_clear(1) timer_on(1) for (reps = 1; reps <= `nreps' ; reps++) { // outer loop just for timing for (i = 1; i <= N - 1; i++) { for (j = i+1; j <= N; j++) { A[i,j] = rowsum(X[i,.] :== X[j,.]) // avoid loop over measurements } } } timer_off(1) timer() end
Tags: None

1 like
John Mullahy

Join Date: Dec 2016

Posts: 752
#2

31 Aug 2019, 18:32

Mike: This is an interesting question. Here's something I came up with that's based on the idea that agreement means element x[i,j] times the reciprocal of element x[m,j] equals 1.

Code:

mata function agree(z) { real matrix az,zz,iz,zzz az=z:+(abs(min(z))+1) zz=az#(1:/az) iz=range(1,cols(zz),cols(z)+1) zzz=rowshape(rowsum(zz[.,iz]:==1),rows(z)) return(zzz) } end

The key is using the Kronecker product and then selecting out its appropriate elements. (Note: The line in the function definition that defines az is meant to take care of non-positive entries.)

I only road-tested this a couple times so expect there are glitches, but maybe it suggests a start. When you come up with something more elegant (or that works correctly) I hope you'll post it.
Comment
John Mullahy

Join Date: Dec 2016

Posts: 752
#3

31 Aug 2019, 18:45

P.S. If the dimensions of your matrix are large, then the computation of the Kronecker product may take a long time (or consume all available memory), so your approach could well be superior.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

31 Aug 2019, 18:54

This is a thought-provoking question. What you need - and what I would argue Mata could and should offer - is what the programming language APL (and I assume its successor J) calls a "generalized inner product" operator that computed something akin to the inner product of an m by p matrix and an p by n matrix yielding an m by n result.

The syntax consisted of two operators separated by a dot. The usual inner product of two matrices A and B was written as A +•x B which indicated that the i,j element of the result was calculated by taking each of the p pairs of elements from row i of A and column j of B, multiplying the pair together, and then summing the p products.

So what you need is the m by m matrix X +•== X' where the i,j element is calculated by taking the corresponding elements of rows i and j of X, calculating a 1 if the pair are equal and a zero otherwise, and then summing these 1's and 0's.

My APL manual dated 1976 tells me "in present implementations of APL there are 441 distinct inner products available".

I'm sorry I can't come up wiith a Mata-specific solution to your problem. With an elegant solution liike generalized inner product, anything else seems like a hopeless hack.

Last edited by William Lisowski; 31 Aug 2019, 19:09.
1 like
Comment
John Mullahy

Join Date: Dec 2016

Posts: 752
#5

01 Sep 2019, 07:47

Mike: I just compared what I proposed with your original approach. Even with its looping your approach appears to be much faster. I'm not sure whether what I suggested is slowed by the Kronecker product computation, but some other calculation, or by an inefficiently written function.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#6

01 Sep 2019, 10:07

Thanks to both of you for chiming in here. I did some timing experiments, too, and found that both John's approach and mine take time O(N²), as might be expected, but my approach seems to be consistently faster. With number of measures = 15 and N = 500, my approach used about 1/6 the time of John's. As it turns out, my approach is not affected much by number of measurements, but John's is: My approach took about 40% more time with 40 measurements than it did with 10, but John's took about 1200% more time for the same comparison. I'm surprised because using purely matrix expressions as opposed to looping is generally much faster in Mata.

I do wonder, though, why 1) the matrix dissimilarity command in Stata doesn't have a Mata implementation; and 2) why the matrix dissimilarity command in Stata, which has a wide range of similarity and distance measure, only provides "matching" as a similarity measure for binary variables.
Comment
John Mullahy

Join Date: Dec 2016

Posts: 752
#7

01 Sep 2019, 10:50

Thanks Mike. This all makes sense. My idea involves multiplication then assessing equalities whereas yours directly considers equalities. It seems like the extra computational work involved in doing all those multiplications with the Kronecker product may explain much of the timing differences (this, and the fact that the Kronecker product is doing LOTS of irrelevant multiplications as well).

I share what I sense is your frustration that resorting to looping is like waving a white flag, but sometimes—maybe here—it's the best approach.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#8

02 Sep 2019, 16:19

It is late in Germany so I might be missing something but I believe that the inner loop is not necessary. You can use subscripts more efficiently

Code:

mata: N = rows(X) A = J(N, N, .) timer_clear(1) timer_on(1) for (reps = 1; reps <= `nreps' ; reps++) { // outer loop just for timing for (i=1; i<N; ++i) A[i, (i+1)..N] = rowsum((X[i, ]:==X[(i+1)::N, ]))' } timer_off(1) timer() end

Best
Daniel
1 like
Comment

Sergio Correia

Join Date: Apr 2014
Posts: 420

02 Sep 2019, 18:09

Hi Mike, (John, William),

That's quite an interesting question. I think the trick to speeding this up lies in avoiding looping over N and instead loop over the answers.

For instance, in pseudo code it would be like this:

Create a matrix with the respondent id and the answers to the first question. Sort it by the first answer.
Call panelsetup() using the answer column as the ID column. This gives us the indices for the when the answer is 1, 2, etc.
Call panelsubmatrix() to get the list of respondents for each answer. Let's call this vector "idx"
Add 1 to those respondents. This can be done by working with A[idx, idx]

Now, in terms of code, the answer below uses the ftools package (ssc install ftools):

Code:

include ftools.mata, adopath // ssc install ftools

mata:
real matrix agreement_matrix(real matrix data) {
    class Factor scalar F
    real N, K, i, j
    matrix A
    vector index, idx, sorted_index
    
    N = rows(data)
    K = cols(data)
    index = 1::N
    A = -K * I(N)

    for (j=1; j<=cols(data); j++) {
        F = _factor(data[., j], 1, 0, "", 0, 1, ., 0) // Slighlty slower: F = _factor(data[., j])
        sorted_index = F.sort(index)
        for (i=1; i<=F.num_levels; i++) {
            idx = panelsubmatrix(sorted_index, i, F.info)
            A[idx,idx] = A[idx,idx] + J(rows(idx), rows(idx), 1)
        }
    }
    return(A)
}

It returns the same answer as your code, but takes 0.7s instead of 3.4s (20% runtime).

It can be improved further if we reshape the data so instead of having a loop for each question and another for each answer, we have one for each (question, answer) tuple:

Code:

real matrix agreement_matrix2(real matrix data) {
    class Factor scalar F
    real N, K, i
    matrix A, vec_data
    vector index, idx, sorted_index
    
    N = rows(data)
    K = cols(data)

    // Reshape the data to long format    
    vec_data = ((1::K) # J(N, 1, 1)), vec(data)
    index = J(K, 1, 1::N)
    // index, vec_data // 3 columns: respondent, question, answer

    A = -K * I(N)

    F = _factor(vec_data, 1, 0, "", 0, 1, ., 0) // Slighlty slower: F = _factor(data[., j])
    sorted_index = F.sort(index)
    for (i=1; i<=F.num_levels; i++) {
        idx = panelsubmatrix(sorted_index, i, F.info)
        A[idx,idx] = A[idx,idx] + J(rows(idx), rows(idx), 1)
    }
    return(A)
}

This speeds things up by 10% over my first attempt, bringing the total runtime to ~18%. I also played with the number of obs, and it seems that the code works in linear time wrt to it.

Comment

Sergio Correia

Join Date: Apr 2014

Posts: 420
#10

02 Sep 2019, 18:11

PS: Full do-file available here; it seems I can't upload attachments...
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#11

03 Sep 2019, 17:07

Thanks to all the contributors here. Both Daniel's and Sergio's approaches amazed me. The subscripting in Daniel's code is something I didn't realize was possible, and Sergio's algorithm is something I never would have intuited to be a faster approach. I can't imagine anyone is going to find something faster, but until now, that's what I thought about my own code!

For a task with N = 500, and n measurements = 15, my approach took about 4.2 sec for 50 reps., Daniel's about 1.2, and Sergio's about 0.9 sec. However, at least on my machine, doubling the N but leaving the number of measures and the number of categories the same increased the run time by a little less than 4X for *all* of our codes, as you'd expect for a problem that I'd think is inherently O(N²) . I wonder if there is something going on with parallelizing that is different on Sergio's machine vs. mine, since he seems to find a linear relation. (I have a two processor Windows machine, Stata 15.1.)

Paired comparisons on a 500 X 500 matrix in under 0.02 seconds (1 sec per 50 reps) must be close to what a straight C code could do.

This whole thing is a good lesson in how slow explicit looping is in Mata. I went back and found the original code I started with a few years ago, which I'd think of as the "natural" way to do this task (i.e., explicit loops over all pairs and all measurements). The code below took 107 sec. on my machine to do what took Sergio's code less than 1 sec. This is a good thing to know, especially for those among us (not me) who are smart enough to figure out how to avoid loops. <grin>

Code:

mata: N = rows(X) nmeas = cols(X) A = J(N, N, 0) for (i = 1; i <= N - 1; i++) { for (j = i+1; j <= N; j++) { for (k=1; k <= nmeas; k++) { A[i,j] = A[i,j] + (X[i,k] == X[j,k]) } } } end
1 like
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#12

03 Sep 2019, 21:49

Agree that the runtime should be linear, so I might have messed up something somewhere.

In terms of runtime, two things of note:
If you increase N from e.g. 500 to 2000 then Daniel's approach becomes faster. OTOH, if you increase the number of categories/questions it's the other way. So it seems there is no faster answer independently of these parameters.

The big O(N2) problem in my code is that at some point I have to subscript the A matrix and fill it in. That will be super slow with large N, due to Stata having to access cells that are increasingly more spaced out. Maybe a sparse matrix approach would have helped there, but I doubt it.

Best,
S
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#13

03 Sep 2019, 22:01

Sergio--Just to clarify: Wouldn't you think that the runtime would have to be at best quadratic in N? No matter how it's done there are fundamentally N*(N-1)/2 pairs of individuals to compare.
Comment
John Mullahy

Join Date: Dec 2016

Posts: 752
#14

03 Sep 2019, 22:08

Colleagues: I couldn't give up on the idea of a no-loop solution better than what I suggested in #2 and came up with this, which is fairly intuitive

Code:

mata function agreenew(z) { real matrix a,b,c a=J(rows(z),1,z) b=colshape(vec(J(1,rows(z),z)'),cols(z)) c=rowshape(rowsum((a-b):==0),rows(z)) return(c) } end

Yet when I time this compared with the super-fast solutions proposed in #8 and #9 it does not do very well, which surprised me since I assumed that J(...) and colshape/rowshape would be pretty efficient.

Maybe the lesson is that looping (over obs. or over outcomes) is really efficient compared with other matrix operations (e.g. J(...))? If anyone has insights I'd be grateful to know them. Thanks.
Comment
John Mullahy

Join Date: Dec 2016

Posts: 752
#15

04 Sep 2019, 08:10

For what it's worth I just realized that the expression that defines b in #14 can be simplified to

Code:

b=colshape(J(1,rows(z),z),cols(z))

This doesn't change the run time very much, however.
Comment

Announcement