Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Avoiding double-counting competitors when summing overlapping dummy variables


    Dear Statalist,

    I am working on a measure of "competitive density" for each firm in a dataset, based on the number of other firms active in the same sectors. Each sector is represented as a dummy variable (1 if the firm is active in that sector, 0 otherwise). I have 105 such dummies and 300k observations (each one representing a firm).

    Here's a simplified version of my approach:
    1. For each dummy variable, I count how many firms are active in that sector.
    2. I then generate new variables (numorg_*) where each firm that is active in a sector gets the total number of competitors in that sector (including itself).
    3. I then sum all the numorg_* variables to get a total count of competitors across all sectors where a firm is active.
    4. Finally, I divide by the number of sectors in which the firm is active to get an average competitiveness score.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(serv_mensa serv_ricovero serv_riab serv_master serv_coord_altrorg serv_supp_oper serv_segr_soc serv_camp_inf serv_prom_polit)
    0 0 0 0 0 0 1 0 0
    0 0 0 0 0 0 0 1 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 1 0 1 1
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 1 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 1 1 0 0 0
    0 0 0 0 0 1 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 1 0 0 0
    0 0 1 0 0 0 0 0 0
    0 0 0 0 1 1 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 1 0
    0 0 0 0 0 0 0 1 1
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 1 1 0
    0 0 0 0 1 1 0 1 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 1
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 1 1 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 1 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 1 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 1 0 0 0 0 1 1 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 1 0
    0 0 0 0 0 0 0 1 1
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 1 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 1 0
    0 0 0 0 1 0 0 0 0
    0 0 0 0 1 1 0 0 0
    0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 1 0
    end


    Here is the code:

    Code:
    * Step 1: loop over dummies to create numorg_* variables
    
    local dummies serv_a serv_b serv_c serv_d // [shortened for brevity]
    
    foreach var of local dummies {
        qui count if `var' == 1
        local tot = r(N)
        gen numorg_`var' = cond(`var' == 1, `tot', 0)
    }
    
    * Step 2: sum over all numorg_* variables
    gen comp_tot_numorg = numorg_serv_a + numorg_serv_b + numorg_serv_c + numorg_serv_d
    
    * Step 3: average over number of sectors active
    gen all_serv_att = serv_a + serv_b + serv_c + serv_d
    gen comp_avg_numorg = comp_tot_numorg / all_serv_att

    The issue: if two firms (say i and j) are active in the exact same sectors, then each will "see" the other multiple times in the total sum, once for each shared sector. This creates duplicate counting of competitors and inflates the competitiveness measure.

    Question:
    Is there a way to adjust this approach — ideally without reshaping or merging datasets— so that each competing firm is counted only once in the final competitiveness measure, regardless of how many sectors it overlaps with the focal firm?

    I would greatly appreciate any suggestions or elegant solutions using Stata. Many thanks!

    Best,
    Giacomo
    Last edited by Giacomo Buzzao; 09 Jul 2025, 15:25.

  • #2
    Could you create a toy example with, say, 10 observations where calculations can be done by hand? Then, walk us through the calculation in #2 and explain what is required. In both cases, calculate the value of the competitiveness index obtained from #2 and what the expected value should be.

    Comment


    • #3
      Hi Andrew,

      here it is the example.


      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input byte(firm_id serv_a serv_b serv_c)
      1 1 0 1
      2 1 1 0
      3 0 1 1
      4 1 0 0
      5 0 1 0
      6 0 0 1
      7 1 1 1
      8 0 0 0
      9 1 1 0
      10 0 1 1
      end
      Let's make a toy model for Firm 1:

      • Active in: serv_a = 1, serv_c = 1
      • Step 1: Count number of firms active in each of these sectors:
        • serv_a: firms 1, 2, 4, 7, 9 → 5 firms
        • serv_c: firms 1, 3, 6, 7, 10 → 5 firms
      • So using the current code above:
        • comp_tot_numorg = 5 + 5 = 10
        • comp_avg_numorg = 10 / 2 = 5
      But now, how many unique competitors does Firm 1 actually have?

      • In serv_a: Firms 2, 4, 7, 9
      • In serv_c: Firms 3, 6, 7, 10
      Set of competitors = {2, 3, 4, 6, 7, 9, 10}
      → 7 unique competitors (not 9 — Firm 7 appears in both sectors)
      So the correct competitiveness index (comp_tot_numorg_unique) I would like to have should result in 7 competitors, which, divided by 2 sectors in which firm 1 is active, should be (com_avg_numorg_unique) = 3.5.

      Also, since I will be running a super long do file (more than 6k lines of code) and I will be working in the environment of the national bureau of statistics where I can only upload a do-file and no other documents (e.g., a dataset), I'd rather use preserve/restore and a temporary file rather than using merge or reshape with keep etc., and having to re-run the do file from scratch (which take almost 15 minutes to run).

      Hope now is clearer and thanks for dedicating time to help me.

      Best Regards,

      Giacomo

      Comment


      • #4
        Thanks for the example and explanation. If you only had a handful of sectors, it would be easy to write down the formula and calculate this, as you show in #3. For a firm active in all 3 sectors (\(a, b\), and \(c\)):

        where


        \(N_a =\) Number of firms in sector \(a\)
        \(N_b =\) Number of firms in sector \(b \)
        \(N_c =\) Number of firms in sector \(c \)
        \(N_{ab} =\) Number of firms in both sectors \(a\) and \(b \)
        \(N_{ac} =\) Number of firms in both sectors \(a\) and \(c \)
        \(N_{bc} =\) Number of firms in both sectors \(b\) and \(c \)
        \(N_{abc} =\) Number of firms in all three sectors


        Then, the total number of unique firms in the union of sectors \(a\), \(b\), and \(c\) is:

        \[
        |U| = N_a + N_b + N_c
        - N_{ab} - N_{ac} - N_{bc}
        + N_{abc}
        \]

        The number of unique competitors (excluding the firm itself) is:

        \[
        \text{UniqueCompetitors} = |U| - 1
        \]


        But even with up to 9 sectors, the number of combinations explodes. In #1, you mention having 105 sectors and over 300,000 firms, so this approach isn't feasible. I’ll have to think of a different way to tackle this — but in the meantime, perhaps others in the forum will have suggestions.
        Last edited by Andrew Musau; 10 Jul 2025, 07:37.

        Comment


        • #5
          Thank you, Andrew...the problem, as you mention, is the number of sectors. Let's see if someone else has suggestions.

          Comment


          • #6
            Just a quick question: even though you have 105 sectors, what is the maximum number of sectors that a single firm can be in? You could obtain this using:

            Code:
            egen total = rowtotal(serv_*)
            qui sum total
            display r(max)
            I’m considering a brute-force approach that loops over firms. The idea is that for firms in the same set of industries, the count will be unique. So we just need to calculate the count for one firm per set (identified by the variable "pattern" below).

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input float(serv_mensa serv_ricovero serv_riab serv_master serv_coord_altrorg serv_supp_oper serv_segr_soc serv_camp_inf serv_prom_polit)
            0 0 0 0 0 0 1 0 0
            0 0 0 0 0 0 0 1 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 1 0 1 1
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 1 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 1 1 0 0 0
            0 0 0 0 0 1 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 1 0 0 0
            0 0 1 0 0 0 0 0 0
            0 0 0 0 1 1 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 1 0
            0 0 0 0 0 0 0 1 1
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 1 1 0
            0 0 0 0 1 1 0 1 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 1
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 1 1 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 1 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 1 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 1 0 0 0 0 1 1 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 1 0
            0 0 0 0 0 0 0 1 1
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 1 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 1 0
            0 0 0 0 1 0 0 0 0
            0 0 0 0 1 1 0 0 0
            0 0 0 0 0 0 0 0 0
            0 0 0 0 0 0 0 1 0
            end
            
            egen pattern= concat(serv_*)
            sort pattern
            drop if !regexm(pattern, "1")
            l, sepby(pattern)
            Res.:

            Code:
            . 
            . l, sepby(pattern)
            
                 +--------------------------------------------------------------------------------------------------------------+
                 | serv_m~a   serv_r~o   serv_r~b   serv_m~r   serv_c~g   serv_s~r   serv_s~c   serv_c~f   serv_p~t     pattern |
                 |--------------------------------------------------------------------------------------------------------------|
              1. |        0          0          0          0          0          0          0          0          1   000000001 |
                 |--------------------------------------------------------------------------------------------------------------|
              2. |        0          0          0          0          0          0          0          1          0   000000010 |
              3. |        0          0          0          0          0          0          0          1          0   000000010 |
              4. |        0          0          0          0          0          0          0          1          0   000000010 |
              5. |        0          0          0          0          0          0          0          1          0   000000010 |
              6. |        0          0          0          0          0          0          0          1          0   000000010 |
              7. |        0          0          0          0          0          0          0          1          0   000000010 |
              8. |        0          0          0          0          0          0          0          1          0   000000010 |
                 |--------------------------------------------------------------------------------------------------------------|
              9. |        0          0          0          0          0          0          0          1          1   000000011 |
             10. |        0          0          0          0          0          0          0          1          1   000000011 |
                 |--------------------------------------------------------------------------------------------------------------|
             11. |        0          0          0          0          0          0          1          0          0   000000100 |
                 |--------------------------------------------------------------------------------------------------------------|
             12. |        0          0          0          0          0          0          1          1          0   000000110 |
                 |--------------------------------------------------------------------------------------------------------------|
             13. |        0          0          0          0          0          1          0          0          0   000001000 |
             14. |        0          0          0          0          0          1          0          0          0   000001000 |
             15. |        0          0          0          0          0          1          0          0          0   000001000 |
             16. |        0          0          0          0          0          1          0          0          0   000001000 |
                 |--------------------------------------------------------------------------------------------------------------|
             17. |        0          0          0          0          0          1          0          1          1   000001011 |
                 |--------------------------------------------------------------------------------------------------------------|
             18. |        0          0          0          0          1          0          0          0          0   000010000 |
                 |--------------------------------------------------------------------------------------------------------------|
             19. |        0          0          0          0          1          1          0          0          0   000011000 |
             20. |        0          0          0          0          1          1          0          0          0   000011000 |
             21. |        0          0          0          0          1          1          0          0          0   000011000 |
             22. |        0          0          0          0          1          1          0          0          0   000011000 |
                 |--------------------------------------------------------------------------------------------------------------|
             23. |        0          0          0          0          1          1          0          1          0   000011010 |
                 |--------------------------------------------------------------------------------------------------------------|
             24. |        0          0          1          0          0          0          0          0          0   001000000 |
                 |--------------------------------------------------------------------------------------------------------------|
             25. |        0          1          0          0          0          0          1          1          0   010000110 |
                 +--------------------------------------------------------------------------------------------------------------+
            
            .
            Last edited by Andrew Musau; 10 Jul 2025, 10:22.

            Comment


            • #7
              I do not understand O.P.'s motivation for avoiding the use of -reshape- and -merge- here. I do not see what they have to do with having to re-run the do-file. And, frankly, even if they do somehow entail that consequence, as a person who sometimes has analyses that run for over 2 weeks, it is hard for me to muster much concern about a do-file that will run for 15 minutes. And how many hours of coding and testing time is it worth investing to save those 15 minutes? So I'm going to offer an approach that ignores those constraints but which, as far as I can see, is workable.

              Now, even with -reshape- and -merge- back on the table, given the size of the data set, (and for this approach it is the number of firms, not the number of sectors, that poses the biggest problem) we do have to worry about both speed and memory considerations. The following code is reasonably economical of both speed and RAM.

              Code:
              isid firm_id
              egen `c(obs_t)' firm_num = group(firm_id)
              tempfile copy
              save `copy'
              reshape long serv, i(firm_num) j(sector) string favor(speed)
              keep if serv
              drop serv
              preserve
              drop firm_id
              rename firm_num firm_num2
              save active, replace
              
              restore
              capture program drop one_firm_num
              program define one_firm_num
                  merge 1:m sector using active, nogenerate keep(master match)
                  sort firm_num2
                  gen distinct_competitors = sum(firm_num2 != firm_num2[_n-1])
                  replace distinct_competitors = distinct_competitors[_N] - 1 // -1 EXCLUDES SELF
                  keep firm_num distinct_competitors
                  keep in 1
                  exit
              end
              
              runby one_firm_num, by(firm_num) verbose
              merge 1:1 firm_num using `copy'
              replace distinct_competitors = 0 if _merge == 2
              -runby- is written by Robert Picard and me, and is available from SSC. This approach, using -runby- to -merge- process a single firm at a time avoids creating an intermediate data set that might explode to hundreds of gigabytes if a single -joinby- were used instead.

              Note that the intermediate file created, active.dta, is necessarily a permanent file, not a temporary file because a temporary file would be inaccessible within -runby-. It can, however, be deleted after -runby- is finished if desired.

              Added: I create and use the variable firm_num here because I'm guessing that the actual firm_id variable in O.P.'s data set is not consecutive integers starting at 1 but rather is some potentially long string variable. The purpose of using the psuedonymous firm_num here is to conserve memory and speed up -merge- operations as copies of the variable begin to proliferate. It is not strictly necessary to the logic of the program: all of the commands applied to firm_num would also work with a string variable firm_id, but would be slower and consume more RAM.
              Last edited by Clyde Schechter; 10 Jul 2025, 10:57.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                I do not understand O.P.'s motivation for avoiding the use of -reshape- and -merge- here. I do not see what they have to do with having to re-run the do-file.
                I interpret the following to mean that the OP cannot use any community-contributed commands, such as runby, but I may be wrong. I also don't see the point with avoiding reshape and merge.

                I will be working in the environment of the national bureau of statistics where I can only upload a do-file and no other documents (e.g., a dataset)

                Comment


                • #9
                  ...OP cannot use any community-contributed commands, such as runby,...
                  Yes, that is a definite possibility in these contexts. In that case, -runby- can be emulated directly in the do file:
                  Code:
                  isid firm_id
                  egen `c(obs_t)' firm_num = group(firm_id)
                  summ firm_num, meanonly
                  local n_firms `r(max)'
                  tempfile copy
                  save `copy'
                  reshape long serv, i(firm_num) j(sector) string favor(speed)
                  keep if serv
                  drop serv
                  preserve
                  drop firm_id
                  rename firm_num firm_num2
                  save active, replace
                  
                  frame create junk
                  tempfile building
                  frame junk: save `building', emptyok
                  frame drop junk
                  
                  restore
                  
                  capture program drop one_firm_num
                  program define one_firm_num
                      merge 1:m sector using active, nogenerate keep(master match)
                      sort firm_num2
                      gen distinct_competitors = sum(firm_num2 != firm_num2[_n-1])
                      replace distinct_competitors = distinct_competitors[_N] - 1 // -1 EXCLUDES SELF
                      keep firm_num distinct_competitors
                      keep in 1
                      exit
                  end
                  
                  
                  // EMULATION OF -runby-
                  forvalues i = 1/`n_firms' {
                      frame put _all if firm_num == `i', into(working)
                      frame working {
                          if _N > 0 {
                              isid sector, sort
                              one_firm_num
                              append using `building'
                              save `"`building'"', replace
                          }
                      }
                      frame drop working
                  }
                  use `building', clear
                  
                  
                  merge 1:1 firm_num using `copy'
                  replace distinct_competitors = 0 if _merge == 2
                  To be clear, this will be a great deal slower than using -runby- itself, but it will be memory-sparing in pretty much the same way -runby- is. If there is a serious problem with run time as a result, it is also possible to emulate -runby- in a way that captures most of its speed-up as well, but it is more complicated and probably not worth the coding/testing effort to do that.

                  Comment


                  • #10
                    Actually, on further thought, in this particular situation, it is easy to emulate -runby- in a way that also captures its speed gains. Only the "EMULATION OF -runby-" paragraph needs to change:

                    Code:
                    isid firm_id
                    egen `c(obs_t)' firm_num = group(firm_id)
                    summ firm_num, meanonly
                    local n_firms `r(max)'
                    tempfile copy
                    save `copy'
                    reshape long serv, i(firm_num) j(sector) string favor(speed)
                    keep if serv
                    drop serv
                    preserve
                    drop firm_id
                    rename firm_num firm_num2
                    save active, replace
                    
                    frame create junk
                    tempfile building
                    frame junk: save `building', emptyok
                    frame drop junk
                    
                    restore
                    
                    capture program drop one_firm_num
                    program define one_firm_num
                        merge 1:m sector using active, nogenerate keep(master match)
                        sort firm_num2
                        gen distinct_competitors = sum(firm_num2 != firm_num2[_n-1])
                        replace distinct_competitors = distinct_competitors[_N] - 1 // -1 EXCLUDES SELF
                        keep firm_num distinct_competitors
                        keep in 1
                        exit
                    end
                    
                    
                    // EMULATION OF -runby-
                    by firm_num, sort: gen `c(obs_t)' obs_count = _N
                    local current = 1
                    while `current' <= _N {
                        local last = `current' + obs_count[`current'] - 1
                        frame put _all in `current'/`last', into(working)
                        frame working {
                            if _N > 0 {
                                isid sector, sort
                                one_firm_num
                                append using `building'
                                save `"`building'"', replace
                            }
                        }
                        frame drop working
                        local current = `last' + 1
                    }
                    
                    use `building', clear
                    
                    
                    merge 1:1 firm_num using `copy'
                    replace distinct_competitors = 0 if _merge == 2

                    Comment


                    • #11
                      Here is my brute-force implementation, which doesn't use reshape or merge, and loops over firms by selecting a representative firm for each sector set. The main overhead is in execution time, which is directly affected by the number of sector sets in the dataset (as mentioned in my question #6). The results, using the dataset from #1, match those from Clyde's solution—but I’d bet that Clyde’s approach performs better in terms of runtime. Anyway, see what works best for you.


                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input float(firm_id serv_mensa serv_ricovero serv_riab serv_master serv_coord_altrorg serv_supp_oper serv_segr_soc serv_camp_inf serv_prom_polit)
                       1 0 0 0 0 0 0 1 0 0
                       2 0 0 0 0 0 0 0 1 0
                       3 0 0 0 0 0 0 0 0 0
                       4 0 0 0 0 0 0 0 0 0
                       5 0 0 0 0 0 0 0 0 0
                       6 0 0 0 0 0 1 0 1 1
                       7 0 0 0 0 0 0 0 0 0
                       8 0 0 0 0 0 1 0 0 0
                       9 0 0 0 0 0 0 0 0 0
                      10 0 0 0 0 0 0 0 0 0
                      11 0 0 0 0 0 0 0 0 0
                      12 0 0 0 0 1 1 0 0 0
                      13 0 0 0 0 0 1 0 0 0
                      14 0 0 0 0 0 0 0 0 0
                      15 0 0 0 0 0 0 0 0 0
                      16 0 0 0 0 0 1 0 0 0
                      17 0 0 1 0 0 0 0 0 0
                      18 0 0 0 0 1 1 0 0 0
                      19 0 0 0 0 0 0 0 0 0
                      20 0 0 0 0 0 0 0 1 0
                      21 0 0 0 0 0 0 0 1 1
                      22 0 0 0 0 0 0 0 0 0
                      23 0 0 0 0 0 0 0 0 0
                      24 0 0 0 0 0 0 1 1 0
                      25 0 0 0 0 1 1 0 1 0
                      26 0 0 0 0 0 0 0 0 0
                      27 0 0 0 0 0 0 0 0 1
                      28 0 0 0 0 0 0 0 0 0
                      29 0 0 0 0 0 0 0 0 0
                      30 0 0 0 0 0 0 0 0 0
                      31 0 0 0 0 1 1 0 0 0
                      32 0 0 0 0 0 0 0 0 0
                      33 0 0 0 0 0 0 0 0 0
                      34 0 0 0 0 0 0 0 0 0
                      35 0 0 0 0 0 0 0 1 0
                      36 0 0 0 0 0 0 0 0 0
                      37 0 0 0 0 0 0 0 0 0
                      38 0 0 0 0 0 0 0 0 0
                      39 0 0 0 0 0 0 0 0 0
                      40 0 0 0 0 0 1 0 0 0
                      41 0 0 0 0 0 0 0 0 0
                      42 0 0 0 0 0 0 0 0 0
                      43 0 1 0 0 0 0 1 1 0
                      44 0 0 0 0 0 0 0 0 0
                      45 0 0 0 0 0 0 0 1 0
                      46 0 0 0 0 0 0 0 1 1
                      47 0 0 0 0 0 0 0 0 0
                      48 0 0 0 0 0 0 0 0 0
                      49 0 0 0 0 0 0 0 0 0
                      50 0 0 0 0 0 0 0 1 0
                      51 0 0 0 0 0 0 0 0 0
                      52 0 0 0 0 0 0 0 1 0
                      53 0 0 0 0 1 0 0 0 0
                      54 0 0 0 0 1 1 0 0 0
                      55 0 0 0 0 0 0 0 0 0
                      56 0 0 0 0 0 0 0 1 0
                      end
                      
                      egen pattern= concat(serv_*)
                      egen tag= tag(pattern)
                      replace tag= sum(tag) if tag
                      qui sum tag
                      gen ncompetitors=.
                      forval i=1/`r(max)'{
                          preserve
                          drop if !regexm(pattern, "1")
                          capture{
                              foreach var of varlist serv_*{
                                  gen obsno=_n
                                  levelsof obsno if tag==`i', local(n)
                                  if !`var' & tag in `n' drop `var'
                                  drop obsno
                              }
                              egen count= rowtotal(serv_*)
                              drop if !count
                              drop count
                              qui count
                          }
                          restore
                          replace ncompetitors= max(r(N)-1, 0) if tag==`i'
                      }
                      bys pattern (ncompetitors): replace ncompetitors= ncompetitors[1]
                      sort pattern firm_id
                      Res.:

                      Code:
                      . l firm_id ncompetitors pattern , sepby(pattern)
                      
                           +--------------------------------+
                           | firm_id   ncompe~s     pattern |
                           |--------------------------------|
                        1. |       3          0   000000000 |
                        2. |       4          0   000000000 |
                        3. |       5          0   000000000 |
                        4. |       7          0   000000000 |
                        5. |       9          0   000000000 |
                        6. |      10          0   000000000 |
                        7. |      11          0   000000000 |
                        8. |      14          0   000000000 |
                        9. |      15          0   000000000 |
                       10. |      19          0   000000000 |
                       11. |      22          0   000000000 |
                       12. |      23          0   000000000 |
                       13. |      26          0   000000000 |
                       14. |      28          0   000000000 |
                       15. |      29          0   000000000 |
                       16. |      30          0   000000000 |
                       17. |      32          0   000000000 |
                       18. |      33          0   000000000 |
                       19. |      34          0   000000000 |
                       20. |      36          0   000000000 |
                       21. |      37          0   000000000 |
                       22. |      38          0   000000000 |
                       23. |      39          0   000000000 |
                       24. |      41          0   000000000 |
                       25. |      42          0   000000000 |
                       26. |      44          0   000000000 |
                       27. |      47          0   000000000 |
                       28. |      48          0   000000000 |
                       29. |      49          0   000000000 |
                       30. |      51          0   000000000 |
                       31. |      55          0   000000000 |
                           |--------------------------------|
                       32. |      27          3   000000001 |
                           |--------------------------------|
                       33. |       2         12   000000010 |
                       34. |      20         12   000000010 |
                       35. |      35         12   000000010 |
                       36. |      45         12   000000010 |
                       37. |      50         12   000000010 |
                       38. |      52         12   000000010 |
                       39. |      56         12   000000010 |
                           |--------------------------------|
                       40. |      21         13   000000011 |
                       41. |      46         13   000000011 |
                           |--------------------------------|
                       42. |       1          2   000000100 |
                           |--------------------------------|
                       43. |      24         13   000000110 |
                           |--------------------------------|
                       44. |       8          9   000001000 |
                       45. |      13          9   000001000 |
                       46. |      16          9   000001000 |
                       47. |      40          9   000001000 |
                           |--------------------------------|
                       48. |       6         21   000001011 |
                           |--------------------------------|
                       49. |      53          5   000010000 |
                           |--------------------------------|
                       50. |      12         10   000011000 |
                       51. |      18         10   000011000 |
                       52. |      31         10   000011000 |
                       53. |      54         10   000011000 |
                           |--------------------------------|
                       54. |      25         21   000011010 |
                           |--------------------------------|
                       55. |      17          0   001000000 |
                           |--------------------------------|
                       56. |      43         13   010000110 |
                           +--------------------------------+
                      Last edited by Andrew Musau; 10 Jul 2025, 15:21.

                      Comment


                      • #12
                        After thinking about Giacomo's problem on and off for a week, I realized that it interested me partly because it resembles a problem I posted about here back in 2019, and to which several thoughtful responses occurred. Like my earlier problem, the current one concerns counting up how often the elements in a vector of responses for each observation match or "agree" with those in each other observation's response vector.

                        Using some of that earlier material, I realized that there is a straightforward Mata approach to the current problem that requires only about 10 lines of code. Beyond simplicity, I'd say my approach is interesting because it's considerably faster than Clyde's offering here for smaller data sets, and only become slower with a large number of observations. (e.g. With 105 sectors, and with N = 5000 firms, my approach per below took about 25 sec and Clyde's about 150 sec. on my machine. ) My approach doesn't require -reshape-, and its memory use in Mata, as in Stata, is on the order of N * Number of Sectors. However, based on some experiments, I estimate that my approach would take about 25 hr. vs. about 12 hr. for Clyde's on for the original problem with N = 300k firms and 105 sectors, so Clyde "wins" in the end <grin>. My approach seems to be O(N^2), while Clyde's appears to be something less than O(N * ln(N)). I'd invite someone more adept in Mata than I am to suggest changes, as possible, to avoid explicit loops, which could make the code faster.


                        Here's what I came up with:

                        Code:
                        // Simulate example data on firms and sectors. For simplicity, firms are assumed to be
                        //  indexed by consecutive id values. I used "ins1, ins2, ..., ins105" ("in sector")
                        // rather than serv* to indicate whether a firm participates in a given sector
                        set seed 9876
                        set obs 1000  // for example
                        local nsector = 105
                        local sectprob = 0.9
                        gen long id = _n
                        forval i = 1/`nsector' {
                           gen byte ins`i' = runiform() > `sectprob'
                        }
                        //
                        // Work begins here
                        //
                        // Put all the sector indicators into Mata.  Row index of the matrix corresponds to the firm id.
                        timer clear 1
                        timer on 1
                        putmata X = (ins*)
                        mata:
                        NC = J(rows(X), 1, 0)   // NC[i,1] is number of competitors firm i has across all other firms.
                        // Examine each firm in relation to each other.
                        for (i = 1; i<= rows(X); i++) {
                           for (j = 1; j <= rows(X); j++) {
                              // Count firm j if it competes with i in at least one sector.
                              if (i != j) {   // exclude self
                                 // "2" means both sector indicators for each of the given firms are 1s.
                                 AtLeast1Competitor = rowsum((X[i,.] :+ (X[j,.]) :== 2) ) > 0
                                 NC[i,1] = NC[i,1] + AtLeast1Competitor
                              }        
                           }  
                        }
                        end
                        //
                        getmata NumCompetitors = NC
                        timer off 1
                        timer list 1

                        Last edited by Mike Lacy; 15 Jul 2025, 19:56.

                        Comment


                        • #13
                          As requested by Mike Lacy per PM, here's his code from #12 speeded up by about 10x (but quickly reduces to about 5x with double sample size)
                          Code:
                          putmata X = (ins*)
                          
                          mata :
                          
                          N = i = rows(X)
                          index = (1..N)
                          
                          n_competitors = J(N,1,0)
                          
                          while ( i )
                              n_competitors[i] = sum(rowsum(select(X[select(index,!e(i,N)),],X[i--,])):>0)
                          
                          end
                          
                          getmata n_competitors = n_competitors
                          Last edited by daniel klein; 16 Jul 2025, 02:42. Reason: first version was about 3x faster than the original

                          Comment


                          • #14
                            Thanks to daniel klein for responding to my PM and contributing. On my machine, with a one-processor version of Stata, his version is about 3.5 times faster than mine, but it's still of course O(N^2). My estimate now is that on my machine, his version would take only about 8 hr. to complete the original problem with N = 300k.

                            Comment


                            • #15
                              You can get \(O\left(n^2\right)\) as fast as
                              Code:
                              putmata X = (serv*)
                              
                              mata : n_competitors = rowsum((X*X'):>0):-1
                              
                              getmata n_competitors= n_competitors
                              assuming no inactive firms (otherwise, change negative competitor counts to 0 post-hoc) and dealing with the implied space complexity.

                              If the matrix is sparse, something along these lines might turn out to be faster for (very) large N and small k
                              Code:
                              putmata X = (serv*)
                              
                              mata :
                              
                              n_competitors = J(rows(X),1,0)
                              
                              firms_in_sector = asarray_create("real")
                              asarray_notfound(firms_in_sector,J(0,1,.))
                              
                              for (s=1; s<=cols(X); s++)
                                  asarray(firms_in_sector,s,selectindex(X[,s]))
                                  
                              for (i=1; i<=rows(X); i++) {
                                  
                                  sectors_i = selectindex(X[i,])
                                  
                                  competitors_i = J(0,1,.)
                                  
                                  for (s=1; s<=cols(sectors_i); s++)
                                      competitors_i = competitors_i\ asarray(firms_in_sector,sectors_i[s])
                                  
                                  n_competitors[i] = rows(uniqrows(competitors_i))-1
                                  
                              }
                                  
                              end
                              
                              getmata n_competitors = n_competitors
                              Last edited by daniel klein; 16 Jul 2025, 13:48.

                              Comment

                              Working...
                              X