Avoiding double-counting competitors when summing overlapping dummy variables

Giacomo Buzzao

Join Date: Nov 2020
Posts: 5

Avoiding double-counting competitors when summing overlapping dummy variables

09 Jul 2025, 15:21

Dear Statalist,

I am working on a measure of "competitive density" for each firm in a dataset, based on the number of other firms active in the same sectors. Each sector is represented as a dummy variable (1 if the firm is active in that sector, 0 otherwise). I have 105 such dummies and 300k observations (each one representing a firm).

Here's a simplified version of my approach:

For each dummy variable, I count how many firms are active in that sector.
I then generate new variables (numorg_*) where each firm that is active in a sector gets the total number of competitors in that sector (including itself).
I then sum all the numorg_* variables to get a total count of competitors across all sectors where a firm is active.
Finally, I divide by the number of sectors in which the firm is active to get an average competitiveness score.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(serv_mensa serv_ricovero serv_riab serv_master serv_coord_altrorg serv_supp_oper serv_segr_soc serv_camp_inf serv_prom_polit)
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0
0 0 0 0 1 1 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 1 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
end

Here is the code:

Code:

* Step 1: loop over dummies to create numorg_* variables

local dummies serv_a serv_b serv_c serv_d // [shortened for brevity]

foreach var of local dummies {
    qui count if `var' == 1
    local tot = r(N)
    gen numorg_`var' = cond(`var' == 1, `tot', 0)
}

* Step 2: sum over all numorg_* variables
gen comp_tot_numorg = numorg_serv_a + numorg_serv_b + numorg_serv_c + numorg_serv_d

* Step 3: average over number of sectors active
gen all_serv_att = serv_a + serv_b + serv_c + serv_d
gen comp_avg_numorg = comp_tot_numorg / all_serv_att

The issue: if two firms (say i and j) are active in the exact same sectors, then each will "see" the other multiple times in the total sum, once for each shared sector. This creates duplicate counting of competitors and inflates the competitiveness measure.

Question:
Is there a way to adjust this approach — ideally without reshaping or merging datasets— so that each competing firm is counted only once in the final competitiveness measure, regardless of how many sectors it overlaps with the focal firm?

I would greatly appreciate any suggestions or elegant solutions using Stata. Many thanks!

Best,
Giacomo

Last edited by Giacomo Buzzao; 09 Jul 2025, 15:25.

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10285
#2

10 Jul 2025, 03:18

Could you create a toy example with, say, 10 observations where calculations can be done by hand? Then, walk us through the calculation in #2 and explain what is required. In both cases, calculate the value of the competitiveness index obtained from #2 and what the expected value should be.
Comment
Giacomo Buzzao

Join Date: Nov 2020

Posts: 5
#3

10 Jul 2025, 04:54

Hi Andrew,

here it is the example.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte(firm_id serv_a serv_b serv_c) 1 1 0 1 2 1 1 0 3 0 1 1 4 1 0 0 5 0 1 0 6 0 0 1 7 1 1 1 8 0 0 0 9 1 1 0 10 0 1 1 end

Let's make a toy model for Firm 1:

Active in: serv_a = 1, serv_c = 1

Step 1: Count number of firms active in each of these sectors:
serv_a: firms 1, 2, 4, 7, 9 → 5 firms

serv_c: firms 1, 3, 6, 7, 10 → 5 firms

So using the current code above:
comp_tot_numorg = 5 + 5 = 10

comp_avg_numorg = 10 / 2 = 5

But now, how many unique competitors does Firm 1 actually have?

In serv_a: Firms 2, 4, 7, 9

In serv_c: Firms 3, 6, 7, 10

Set of competitors = {2, 3, 4, 6, 7, 9, 10}
→ 7 unique competitors (not 9 — Firm 7 appears in both sectors)
So the correct competitiveness index (comp_tot_numorg_unique) I would like to have should result in 7 competitors, which, divided by 2 sectors in which firm 1 is active, should be (com_avg_numorg_unique) = 3.5.

Also, since I will be running a super long do file (more than 6k lines of code) and I will be working in the environment of the national bureau of statistics where I can only upload a do-file and no other documents (e.g., a dataset), I'd rather use preserve/restore and a temporary file rather than using merge or reshape with keep etc., and having to re-run the do file from scratch (which take almost 15 minutes to run).

Hope now is clearer and thanks for dedicating time to help me.

Best Regards,

Giacomo
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10285
#4

10 Jul 2025, 07:25

Thanks for the example and explanation. If you only had a handful of sectors, it would be easy to write down the formula and calculate this, as you show in #3. For a firm active in all 3 sectors (\(a, b\), and \(c\)):

where

\(N_a =\) Number of firms in sector \(a\)
\(N_b =\) Number of firms in sector \(b \)
\(N_c =\) Number of firms in sector \(c \)
\(N_{ab} =\) Number of firms in both sectors \(a\) and \(b \)
\(N_{ac} =\) Number of firms in both sectors \(a\) and \(c \)
\(N_{bc} =\) Number of firms in both sectors \(b\) and \(c \)
\(N_{abc} =\) Number of firms in all three sectors

Then, the total number of unique firms in the union of sectors \(a\), \(b\), and \(c\) is:

\[
|U| = N_a + N_b + N_c
- N_{ab} - N_{ac} - N_{bc}
+ N_{abc}
\]

The number of unique competitors (excluding the firm itself) is:

\[
\text{UniqueCompetitors} = |U| - 1
\]

But even with up to 9 sectors, the number of combinations explodes. In #1, you mention having 105 sectors and over 300,000 firms, so this approach isn't feasible. I’ll have to think of a different way to tackle this — but in the meantime, perhaps others in the forum will have suggestions.

Last edited by Andrew Musau; 10 Jul 2025, 07:37.
Comment
Giacomo Buzzao

Join Date: Nov 2020

Posts: 5
#5

10 Jul 2025, 09:02

Thank you, Andrew...the problem, as you mention, is the number of sectors. Let's see if someone else has suggestions.
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10285

10 Jul 2025, 10:09

Just a quick question: even though you have 105 sectors, what is the maximum number of sectors that a single firm can be in? You could obtain this using:

Code:

egen total = rowtotal(serv_*)
qui sum total
display r(max)

I’m considering a brute-force approach that loops over firms. The idea is that for firms in the same set of industries, the count will be unique. So we just need to calculate the count for one firm per set (identified by the variable "pattern" below).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(serv_mensa serv_ricovero serv_riab serv_master serv_coord_altrorg serv_supp_oper serv_segr_soc serv_camp_inf serv_prom_polit)
0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 1 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 1 0
0 0 0 0 1 1 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 1 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1 1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0
0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0
end

egen pattern= concat(serv_*)
sort pattern
drop if !regexm(pattern, "1")
l, sepby(pattern)

Res.:

Code:

. 
. l, sepby(pattern)

     +--------------------------------------------------------------------------------------------------------------+
     | serv_m~a   serv_r~o   serv_r~b   serv_m~r   serv_c~g   serv_s~r   serv_s~c   serv_c~f   serv_p~t     pattern |
     |--------------------------------------------------------------------------------------------------------------|
  1. |        0          0          0          0          0          0          0          0          1   000000001 |
     |--------------------------------------------------------------------------------------------------------------|
  2. |        0          0          0          0          0          0          0          1          0   000000010 |
  3. |        0          0          0          0          0          0          0          1          0   000000010 |
  4. |        0          0          0          0          0          0          0          1          0   000000010 |
  5. |        0          0          0          0          0          0          0          1          0   000000010 |
  6. |        0          0          0          0          0          0          0          1          0   000000010 |
  7. |        0          0          0          0          0          0          0          1          0   000000010 |
  8. |        0          0          0          0          0          0          0          1          0   000000010 |
     |--------------------------------------------------------------------------------------------------------------|
  9. |        0          0          0          0          0          0          0          1          1   000000011 |
 10. |        0          0          0          0          0          0          0          1          1   000000011 |
     |--------------------------------------------------------------------------------------------------------------|
 11. |        0          0          0          0          0          0          1          0          0   000000100 |
     |--------------------------------------------------------------------------------------------------------------|
 12. |        0          0          0          0          0          0          1          1          0   000000110 |
     |--------------------------------------------------------------------------------------------------------------|
 13. |        0          0          0          0          0          1          0          0          0   000001000 |
 14. |        0          0          0          0          0          1          0          0          0   000001000 |
 15. |        0          0          0          0          0          1          0          0          0   000001000 |
 16. |        0          0          0          0          0          1          0          0          0   000001000 |
     |--------------------------------------------------------------------------------------------------------------|
 17. |        0          0          0          0          0          1          0          1          1   000001011 |
     |--------------------------------------------------------------------------------------------------------------|
 18. |        0          0          0          0          1          0          0          0          0   000010000 |
     |--------------------------------------------------------------------------------------------------------------|
 19. |        0          0          0          0          1          1          0          0          0   000011000 |
 20. |        0          0          0          0          1          1          0          0          0   000011000 |
 21. |        0          0          0          0          1          1          0          0          0   000011000 |
 22. |        0          0          0          0          1          1          0          0          0   000011000 |
     |--------------------------------------------------------------------------------------------------------------|
 23. |        0          0          0          0          1          1          0          1          0   000011010 |
     |--------------------------------------------------------------------------------------------------------------|
 24. |        0          0          1          0          0          0          0          0          0   001000000 |
     |--------------------------------------------------------------------------------------------------------------|
 25. |        0          1          0          0          0          0          1          1          0   010000110 |
     +--------------------------------------------------------------------------------------------------------------+

.

Last edited by Andrew Musau; 10 Jul 2025, 10:22.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#7

10 Jul 2025, 10:52

I do not understand O.P.'s motivation for avoiding the use of -reshape- and -merge- here. I do not see what they have to do with having to re-run the do-file. And, frankly, even if they do somehow entail that consequence, as a person who sometimes has analyses that run for over 2 weeks, it is hard for me to muster much concern about a do-file that will run for 15 minutes. And how many hours of coding and testing time is it worth investing to save those 15 minutes? So I'm going to offer an approach that ignores those constraints but which, as far as I can see, is workable.

Now, even with -reshape- and -merge- back on the table, given the size of the data set, (and for this approach it is the number of firms, not the number of sectors, that poses the biggest problem) we do have to worry about both speed and memory considerations. The following code is reasonably economical of both speed and RAM.

Code:

isid firm_id egen `c(obs_t)' firm_num = group(firm_id) tempfile copy save `copy' reshape long serv, i(firm_num) j(sector) string favor(speed) keep if serv drop serv preserve drop firm_id rename firm_num firm_num2 save active, replace restore capture program drop one_firm_num program define one_firm_num merge 1:m sector using active, nogenerate keep(master match) sort firm_num2 gen distinct_competitors = sum(firm_num2 != firm_num2[_n-1]) replace distinct_competitors = distinct_competitors[_N] - 1 // -1 EXCLUDES SELF keep firm_num distinct_competitors keep in 1 exit end runby one_firm_num, by(firm_num) verbose merge 1:1 firm_num using `copy' replace distinct_competitors = 0 if _merge == 2

-runby- is written by Robert Picard and me, and is available from SSC. This approach, using -runby- to -merge- process a single firm at a time avoids creating an intermediate data set that might explode to hundreds of gigabytes if a single -joinby- were used instead.

Note that the intermediate file created, active.dta, is necessarily a permanent file, not a temporary file because a temporary file would be inaccessible within -runby-. It can, however, be deleted after -runby- is finished if desired.

Added: I create and use the variable firm_num here because I'm guessing that the actual firm_id variable in O.P.'s data set is not consecutive integers starting at 1 but rather is some potentially long string variable. The purpose of using the psuedonymous firm_num here is to conserve memory and speed up -merge- operations as copies of the variable begin to proliferate. It is not strictly necessary to the logic of the program: all of the commands applied to firm_num would also work with a string variable firm_id, but would be slower and consume more RAM.

Last edited by Clyde Schechter; 10 Jul 2025, 10:57.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10285
#8

10 Jul 2025, 11:02

Originally posted by Clyde Schechter View Post

I do not understand O.P.'s motivation for avoiding the use of -reshape- and -merge- here. I do not see what they have to do with having to re-run the do-file.

I interpret the following to mean that the OP cannot use any community-contributed commands, such as runby, but I may be wrong. I also don't see the point with avoiding reshape and merge.

I will be working in the environment of the national bureau of statistics where I can only upload a do-file and no other documents (e.g., a dataset)
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30169

10 Jul 2025, 11:57

...OP cannot use any community-contributed commands, such as runby,...

Yes, that is a definite possibility in these contexts. In that case, -runby- can be emulated directly in the do file:

Code:

isid firm_id
egen `c(obs_t)' firm_num = group(firm_id)
summ firm_num, meanonly
local n_firms `r(max)'
tempfile copy
save `copy'
reshape long serv, i(firm_num) j(sector) string favor(speed)
keep if serv
drop serv
preserve
drop firm_id
rename firm_num firm_num2
save active, replace

frame create junk
tempfile building
frame junk: save `building', emptyok
frame drop junk

restore

capture program drop one_firm_num
program define one_firm_num
    merge 1:m sector using active, nogenerate keep(master match)
    sort firm_num2
    gen distinct_competitors = sum(firm_num2 != firm_num2[_n-1])
    replace distinct_competitors = distinct_competitors[_N] - 1 // -1 EXCLUDES SELF
    keep firm_num distinct_competitors
    keep in 1
    exit
end


// EMULATION OF -runby-
forvalues i = 1/`n_firms' {
    frame put _all if firm_num == `i', into(working)
    frame working {
        if _N > 0 {
            isid sector, sort
            one_firm_num
            append using `building'
            save `"`building'"', replace
        }
    }
    frame drop working
}
use `building', clear


merge 1:1 firm_num using `copy'
replace distinct_competitors = 0 if _merge == 2

To be clear, this will be a great deal slower than using -runby- itself, but it will be memory-sparing in pretty much the same way -runby- is. If there is a serious problem with run time as a result, it is also possible to emulate -runby- in a way that captures most of its speed-up as well, but it is more complicated and probably not worth the coding/testing effort to do that.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30169

#10

10 Jul 2025, 14:35

Actually, on further thought, in this particular situation, it is easy to emulate -runby- in a way that also captures its speed gains. Only the "EMULATION OF -runby-" paragraph needs to change:

Code:

isid firm_id
egen `c(obs_t)' firm_num = group(firm_id)
summ firm_num, meanonly
local n_firms `r(max)'
tempfile copy
save `copy'
reshape long serv, i(firm_num) j(sector) string favor(speed)
keep if serv
drop serv
preserve
drop firm_id
rename firm_num firm_num2
save active, replace

frame create junk
tempfile building
frame junk: save `building', emptyok
frame drop junk

restore

capture program drop one_firm_num
program define one_firm_num
    merge 1:m sector using active, nogenerate keep(master match)
    sort firm_num2
    gen distinct_competitors = sum(firm_num2 != firm_num2[_n-1])
    replace distinct_competitors = distinct_competitors[_N] - 1 // -1 EXCLUDES SELF
    keep firm_num distinct_competitors
    keep in 1
    exit
end


// EMULATION OF -runby-
by firm_num, sort: gen `c(obs_t)' obs_count = _N
local current = 1
while `current' <= _N {
    local last = `current' + obs_count[`current'] - 1
    frame put _all in `current'/`last', into(working)
    frame working {
        if _N > 0 {
            isid sector, sort
            one_firm_num
            append using `building'
            save `"`building'"', replace
        }
    }
    frame drop working
    local current = `last' + 1
}

use `building', clear


merge 1:1 firm_num using `copy'
replace distinct_competitors = 0 if _merge == 2

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10285

#11

10 Jul 2025, 14:46

Here is my brute-force implementation, which doesn't use reshape or merge, and loops over firms by selecting a representative firm for each sector set. The main overhead is in execution time, which is directly affected by the number of sector sets in the dataset (as mentioned in my question #6). The results, using the dataset from #1, match those from Clyde's solution—but I’d bet that Clyde’s approach performs better in terms of runtime. Anyway, see what works best for you.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(firm_id serv_mensa serv_ricovero serv_riab serv_master serv_coord_altrorg serv_supp_oper serv_segr_soc serv_camp_inf serv_prom_polit)
 1 0 0 0 0 0 0 1 0 0
 2 0 0 0 0 0 0 0 1 0
 3 0 0 0 0 0 0 0 0 0
 4 0 0 0 0 0 0 0 0 0
 5 0 0 0 0 0 0 0 0 0
 6 0 0 0 0 0 1 0 1 1
 7 0 0 0 0 0 0 0 0 0
 8 0 0 0 0 0 1 0 0 0
 9 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0
12 0 0 0 0 1 1 0 0 0
13 0 0 0 0 0 1 0 0 0
14 0 0 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 1 0 0 0
17 0 0 1 0 0 0 0 0 0
18 0 0 0 0 1 1 0 0 0
19 0 0 0 0 0 0 0 0 0
20 0 0 0 0 0 0 0 1 0
21 0 0 0 0 0 0 0 1 1
22 0 0 0 0 0 0 0 0 0
23 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 1 1 0
25 0 0 0 0 1 1 0 1 0
26 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 1
28 0 0 0 0 0 0 0 0 0
29 0 0 0 0 0 0 0 0 0
30 0 0 0 0 0 0 0 0 0
31 0 0 0 0 1 1 0 0 0
32 0 0 0 0 0 0 0 0 0
33 0 0 0 0 0 0 0 0 0
34 0 0 0 0 0 0 0 0 0
35 0 0 0 0 0 0 0 1 0
36 0 0 0 0 0 0 0 0 0
37 0 0 0 0 0 0 0 0 0
38 0 0 0 0 0 0 0 0 0
39 0 0 0 0 0 0 0 0 0
40 0 0 0 0 0 1 0 0 0
41 0 0 0 0 0 0 0 0 0
42 0 0 0 0 0 0 0 0 0
43 0 1 0 0 0 0 1 1 0
44 0 0 0 0 0 0 0 0 0
45 0 0 0 0 0 0 0 1 0
46 0 0 0 0 0 0 0 1 1
47 0 0 0 0 0 0 0 0 0
48 0 0 0 0 0 0 0 0 0
49 0 0 0 0 0 0 0 0 0
50 0 0 0 0 0 0 0 1 0
51 0 0 0 0 0 0 0 0 0
52 0 0 0 0 0 0 0 1 0
53 0 0 0 0 1 0 0 0 0
54 0 0 0 0 1 1 0 0 0
55 0 0 0 0 0 0 0 0 0
56 0 0 0 0 0 0 0 1 0
end

egen pattern= concat(serv_*)
egen tag= tag(pattern)
replace tag= sum(tag) if tag
qui sum tag
gen ncompetitors=.
forval i=1/`r(max)'{
    preserve
    drop if !regexm(pattern, "1")
    capture{
        foreach var of varlist serv_*{
            gen obsno=_n
            levelsof obsno if tag==`i', local(n)
            if !`var' & tag in `n' drop `var'
            drop obsno
        }
        egen count= rowtotal(serv_*)
        drop if !count
        drop count
        qui count
    }
    restore
    replace ncompetitors= max(r(N)-1, 0) if tag==`i'
}
bys pattern (ncompetitors): replace ncompetitors= ncompetitors[1]
sort pattern firm_id

Res.:

Code:

. l firm_id ncompetitors pattern , sepby(pattern)

     +--------------------------------+
     | firm_id   ncompe~s     pattern |
     |--------------------------------|
  1. |       3          0   000000000 |
  2. |       4          0   000000000 |
  3. |       5          0   000000000 |
  4. |       7          0   000000000 |
  5. |       9          0   000000000 |
  6. |      10          0   000000000 |
  7. |      11          0   000000000 |
  8. |      14          0   000000000 |
  9. |      15          0   000000000 |
 10. |      19          0   000000000 |
 11. |      22          0   000000000 |
 12. |      23          0   000000000 |
 13. |      26          0   000000000 |
 14. |      28          0   000000000 |
 15. |      29          0   000000000 |
 16. |      30          0   000000000 |
 17. |      32          0   000000000 |
 18. |      33          0   000000000 |
 19. |      34          0   000000000 |
 20. |      36          0   000000000 |
 21. |      37          0   000000000 |
 22. |      38          0   000000000 |
 23. |      39          0   000000000 |
 24. |      41          0   000000000 |
 25. |      42          0   000000000 |
 26. |      44          0   000000000 |
 27. |      47          0   000000000 |
 28. |      48          0   000000000 |
 29. |      49          0   000000000 |
 30. |      51          0   000000000 |
 31. |      55          0   000000000 |
     |--------------------------------|
 32. |      27          3   000000001 |
     |--------------------------------|
 33. |       2         12   000000010 |
 34. |      20         12   000000010 |
 35. |      35         12   000000010 |
 36. |      45         12   000000010 |
 37. |      50         12   000000010 |
 38. |      52         12   000000010 |
 39. |      56         12   000000010 |
     |--------------------------------|
 40. |      21         13   000000011 |
 41. |      46         13   000000011 |
     |--------------------------------|
 42. |       1          2   000000100 |
     |--------------------------------|
 43. |      24         13   000000110 |
     |--------------------------------|
 44. |       8          9   000001000 |
 45. |      13          9   000001000 |
 46. |      16          9   000001000 |
 47. |      40          9   000001000 |
     |--------------------------------|
 48. |       6         21   000001011 |
     |--------------------------------|
 49. |      53          5   000010000 |
     |--------------------------------|
 50. |      12         10   000011000 |
 51. |      18         10   000011000 |
 52. |      31         10   000011000 |
 53. |      54         10   000011000 |
     |--------------------------------|
 54. |      25         21   000011010 |
     |--------------------------------|
 55. |      17          0   001000000 |
     |--------------------------------|
 56. |      43         13   010000110 |
     +--------------------------------+

Last edited by Andrew Musau; 10 Jul 2025, 15:21.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2423
#12

15 Jul 2025, 19:43

After thinking about Giacomo's problem on and off for a week, I realized that it interested me partly because it resembles a problem I posted about here back in 2019, and to which several thoughtful responses occurred. Like my earlier problem, the current one concerns counting up how often the elements in a vector of responses for each observation match or "agree" with those in each other observation's response vector.

Using some of that earlier material, I realized that there is a straightforward Mata approach to the current problem that requires only about 10 lines of code. Beyond simplicity, I'd say my approach is interesting because it's considerably faster than Clyde's offering here for smaller data sets, and only become slower with a large number of observations. (e.g. With 105 sectors, and with N = 5000 firms, my approach per below took about 25 sec and Clyde's about 150 sec. on my machine. ) My approach doesn't require -reshape-, and its memory use in Mata, as in Stata, is on the order of N * Number of Sectors. However, based on some experiments, I estimate that my approach would take about 25 hr. vs. about 12 hr. for Clyde's on for the original problem with N = 300k firms and 105 sectors, so Clyde "wins" in the end <grin>. My approach seems to be O(N^2), while Clyde's appears to be something less than O(N * ln(N)). I'd invite someone more adept in Mata than I am to suggest changes, as possible, to avoid explicit loops, which could make the code faster.

Here's what I came up with:

Code:

// Simulate example data on firms and sectors. For simplicity, firms are assumed to be // indexed by consecutive id values. I used "ins1, ins2, ..., ins105" ("in sector") // rather than serv* to indicate whether a firm participates in a given sector set seed 9876 set obs 1000 // for example local nsector = 105 local sectprob = 0.9 gen long id = _n forval i = 1/`nsector' { gen byte ins`i' = runiform() > `sectprob' } // // Work begins here // // Put all the sector indicators into Mata. Row index of the matrix corresponds to the firm id. timer clear 1 timer on 1 putmata X = (ins*) mata: NC = J(rows(X), 1, 0) // NC[i,1] is number of competitors firm i has across all other firms. // Examine each firm in relation to each other. for (i = 1; i<= rows(X); i++) { for (j = 1; j <= rows(X); j++) { // Count firm j if it competes with i in at least one sector. if (i != j) { // exclude self // "2" means both sector indicators for each of the given firms are 1s. AtLeast1Competitor = rowsum((X[i,.] :+ (X[j,.]) :== 2) ) > 0 NC[i,1] = NC[i,1] + AtLeast1Competitor } } } end // getmata NumCompetitors = NC timer off 1 timer list 1

Last edited by Mike Lacy; 15 Jul 2025, 19:56.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#13

16 Jul 2025, 01:50

As requested by Mike Lacy per PM, here's his code from #12 speeded up by about 10x (but quickly reduces to about 5x with double sample size)

Code:

putmata X = (ins*) mata : N = i = rows(X) index = (1..N) n_competitors = J(N,1,0) while ( i ) n_competitors[i] = sum(rowsum(select(X[select(index,!e(i,N)),],X[i--,])):>0) end getmata n_competitors = n_competitors

Last edited by daniel klein; 16 Jul 2025, 02:42. Reason: first version was about 3x faster than the original
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2423
#14

16 Jul 2025, 08:17

Thanks to daniel klein for responding to my PM and contributing. On my machine, with a one-processor version of Stata, his version is about 3.5 times faster than mine, but it's still of course O(N^2). My estimate now is that on my machine, his version would take only about 8 hr. to complete the original problem with N = 300k.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3886

#15

16 Jul 2025, 13:46

You can get \(O\left(n^2\right)\) as fast as

Code:

putmata X = (serv*)

mata : n_competitors = rowsum((X*X'):>0):-1

getmata n_competitors= n_competitors

assuming no inactive firms (otherwise, change negative competitor counts to 0 post-hoc) and dealing with the implied space complexity.

If the matrix is sparse, something along these lines might turn out to be faster for (very) large N and small k

Code:

putmata X = (serv*)

mata :

n_competitors = J(rows(X),1,0)

firms_in_sector = asarray_create("real")
asarray_notfound(firms_in_sector,J(0,1,.))

for (s=1; s<=cols(X); s++)
    asarray(firms_in_sector,s,selectindex(X[,s]))
    
for (i=1; i<=rows(X); i++) {
    
    sectors_i = selectindex(X[i,])
    
    competitors_i = J(0,1,.)
    
    for (s=1; s<=cols(sectors_i); s++)
        competitors_i = competitors_i\ asarray(firms_in_sector,sectors_i[s])
    
    n_competitors[i] = rows(uniqrows(competitors_i))-1
    
}
    
end

getmata n_competitors = n_competitors

Last edited by daniel klein; 16 Jul 2025, 13:48.

Announcement

Avoiding double-counting competitors when summing overlapping dummy variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment