Avoiding double-counting competitors when summing overlapping dummy variables

daniel klein

Join Date: Mar 2014

Posts: 3886
#16

17 Jul 2025, 03:23

Obviously, a 300k by 300k matrix exceeds memory limits, so the matrix multiplication in #15 isn't feasible. However, we can set this up as a blocked matrix multiplication (thanks for the tip, ChatGPT):

Code:

putmata X = (ins*) mata : N = rows(X) n_competitors = J(N,1,0) blocksize = 1000 Xt = X' for (from=1; from<=N; from=from+blocksize) { to = min((from+blocksize-1,N)) xt = Xt[,from..to] n_competitors[from::to] = (colsum((X*xt):>0):-1)' } n_competitors = rowmax((n_competitors,J(N,1,0))) end getmata n_competitors = n_competitors

Now, the time complexity is still \(O\left(n^2\right)\). But, on my machine, that code handles the example data in #12 instantly, as opposed to the 2 seconds the original code in #12 takes. Expanding from 1,000 to 5,000 observations, the blocked matrix multiplication still runs in less than a second, whereas the slightly revised code in #13 takes almost 10 seconds, and the original code takes more than half a minute. I timed the blocked matrix multiplication somewhere between 1.5 and 4 minutes, depending on the block size, for 50,000 observations (I didn't time any of the other approaches on that dataset). I assume the 300k problem will still take much longer, but at least it might be feasible. Otherwise, someone should look into parallelization.

Last edited by daniel klein; 17 Jul 2025, 03:31.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2423
#17

17 Jul 2025, 09:04

(Edit: Whoops -- I posted this before I saw Daniel's posting at #16, which covers similar ground. I *think* my simple approach below works, but perhaps I'm wrong.)

Daniel's first suggestion at #15, with a small modification, made it possible on my machine to solve the original problem (N = 300k firms and 105 sectors) in about 1/2 hour!

The modification involves adapting to the fact that (X*X') requires 8 * (N^2) bytes of memory in Mata, clearly impractical for N = 300e3. To deal with that, one can successively compare the whole list of firms to a subset of its peers, and add up the number of competitors along the way. That is, one counts up the competitors the N firms have in (say) the first 1000 others, the second 1000 others, ..., last 1000. This approach is still O(N^2) in time, but it's apparently very efficient in Mata, requiring less than 1.0 sec. on the N = 5,000 example cited above.

Code:

putmata X = (serv*) mata: interval = 1000 N = rows(X) n_competitors = J(N, 1, 0) start = 1 while (start <= N) { stop = start + interval -1 if (stop > N) stop = N n_competitors = n_competitors + rowsum((X * X[start..stop, .]'):>0):-1 start = stop + 1 } end getmata n_competitors= n_competitors

Last edited by Mike Lacy; 17 Jul 2025, 09:23.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#18

17 Jul 2025, 13:38

Mike Lacy's code in #17 is almost there - and it's simpler and slightly faster than my code in #16.

The one problem with the code in #17 is that it subtracts 1 ("self") for every interval/slice/block when it should be subtracted once per row. Here's the revised code

Code:

putmata X = (serv*) mata: interval = 1000 N = rows(X) n_competitors = J(N, 1, 0) start = 1 while (start <= N) { stop = start + interval -1 if (stop > N) stop = N n_competitors = n_competitors + rowsum((X * X[start..stop, .]'):>0) start = stop + 1 } n_competitors = rowmax((n_competitors:-1,J(N,1,0))) end getmata n_competitors= n_competitors

Last edited by daniel klein; 17 Jul 2025, 14:09.
3 likes
Comment
Giacomo Buzzao

Join Date: Nov 2020

Posts: 5
#19

18 Jul 2025, 04:00

(Edit: Whoops -- I posted this before I saw your last posts. We'll now work on the new codes and let you know ASAP)

Dear all,

Thank you very much for the extensive and detailed help!
We ran Daniel's last code on a "mock dataset" we created to mimic the one we will use. It runs in about 2 hours (with a 24GB machine), which makes it feasible in the context where we will work. The solution of using the Mata environment and working with matrices is definitely the most efficient for our purposes (we'll keep it in mind also for future work!). Thanks a lot to the stimuli from everyone! Mike Lacy Andrew Musau daniel klein Clyde Schechter

Last edited by Giacomo Buzzao; 18 Jul 2025, 04:06.
Comment
Giacomo Buzzao

Join Date: Nov 2020

Posts: 5
#20

18 Jul 2025, 06:36

Here I am again,

We tried the new approach you suggested (in #18) on our mock data, and it works wonderfully: 15 minutes on my same machine (Ram is 24 GB). We are grateful for your help, thanks (also on behalf of my coauthors)!!!

Last edited by Giacomo Buzzao; 18 Jul 2025, 06:42.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment