Dear Statalist,
I am working on a measure of "competitive density" for each firm in a dataset, based on the number of other firms active in the same sectors. Each sector is represented as a dummy variable (1 if the firm is active in that sector, 0 otherwise). I have 105 such dummies and 300k observations (each one representing a firm).
Here's a simplified version of my approach:
- For each dummy variable, I count how many firms are active in that sector.
- I then generate new variables (numorg_*) where each firm that is active in a sector gets the total number of competitors in that sector (including itself).
- I then sum all the numorg_* variables to get a total count of competitors across all sectors where a firm is active.
- Finally, I divide by the number of sectors in which the firm is active to get an average competitiveness score.
Code:
* Example generated by -dataex-. For more info, type help dataex clear input float(serv_mensa serv_ricovero serv_riab serv_master serv_coord_altrorg serv_supp_oper serv_segr_soc serv_camp_inf serv_prom_polit) 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 end
Here is the code:
Code:
* Step 1: loop over dummies to create numorg_* variables local dummies serv_a serv_b serv_c serv_d // [shortened for brevity] foreach var of local dummies { qui count if `var' == 1 local tot = r(N) gen numorg_`var' = cond(`var' == 1, `tot', 0) } * Step 2: sum over all numorg_* variables gen comp_tot_numorg = numorg_serv_a + numorg_serv_b + numorg_serv_c + numorg_serv_d * Step 3: average over number of sectors active gen all_serv_att = serv_a + serv_b + serv_c + serv_d gen comp_avg_numorg = comp_tot_numorg / all_serv_att
The issue: if two firms (say i and j) are active in the exact same sectors, then each will "see" the other multiple times in the total sum, once for each shared sector. This creates duplicate counting of competitors and inflates the competitiveness measure.
Question:
Is there a way to adjust this approach — ideally without reshaping or merging datasets— so that each competing firm is counted only once in the final competitiveness measure, regardless of how many sectors it overlaps with the focal firm?
I would greatly appreciate any suggestions or elegant solutions using Stata. Many thanks!
Best,
Giacomo
Comment