Generating average number of market overlap with other firms

Kyeonggook Park

Join Date: Sep 2021

Posts: 9
#1

Generating average number of market overlap with other firms

21 Nov 2022, 00:20

Code:

clear input float firm_id str6 market 1 "a" 1 "b" 1 "c" 1 "d" 2 "b" 2 "c" 2 "f" 3 "a" 3 "b" 4 "c" end

Using the example dataset above, I am trying to count the number of overlapping markets that each firm has. In the above example dataset, a firm with firm_id "1" is operating in 4 markets: "a", "b", "c", and "d". A firm with firm_id "2" is operating in 3 markets: "b", "c", and "f". Between these two firms, "1" and "2", they overlap in 2 markets, "b" and "c".

What I would like to accomplish ultimately is generating a column variable that takes the average number of overlapping markets that each firm has with other firms in the dataset. Let's call this column avg_n_overlap. Using firm "1" as an example, firm "1" has 2 overlapping markets with firm "2", 2 overlapping markets with firm "3", and 1 overlapping market with firm "4". Therefore, after taking the average of these numbers, firm "1" will have the value 1.6666 (=(2+2+1)/3).

After calculating this avg_n_overlap, the final dataset that I am trying to create would look like below:

Code:

clear input float firm_id str6 market float avg_n_overlap 1 "a" 1.6666666 1 "b" 1.6666666 1 "c" 1.6666666 1 "d" 1.6666666 2 "b" 1.3333334 2 "c" 1.3333334 2 "f" 1.3333334 3 "a" 1 3 "b" 1 4 "c" .6666667 end

I hope my question makes sense. I would like to write a macro code that accomplishes the task explained above, but I am a bit stuck at this point, and I would really appreciate any help!
Tags: count, panel data, variable creation

Øyvind Snilsberg

Join Date: Oct 2021
Posts: 591

21 Nov 2022, 02:25

should avg_n_overlap = 1.5 for firm 3 (2 overlapping markets with firm 1 and 1 overlapping market with firm 2 = (2 + 1) / 2 = 1.5)?

Code:

tempfile x
save `x'
gen n=1
reshape wide n, i(market) j(firm_id)
joinby market using `x'
collapse (sum) n*, by(firm_id)
reshape long n, i(firm_id)
collapse (sum) a = n (count) b = n if n & firm_id != _j, by(firm_id)
gen avg_n_overlap = a / b
merge 1:m firm_id using `x'

Code:

       firm_id   market   avg_n_overlap  
  1.         1        a        1.666667  
  2.         1        b        1.666667  
  3.         1        c        1.666667  
  4.         1        d        1.666667  
  5.         2        b        1.333333  
  6.         2        c        1.333333  
  7.         2        f        1.333333  
  8.         3        a             1.5  
  9.         3        b             1.5  
 10.         4        c               1

Last edited by Øyvind Snilsberg; 21 Nov 2022, 02:27.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35632

21 Nov 2022, 02:36

This code assumes that -- as in your data example -- values of market are words in Stata's sense, meaning that they do not contain spaces. If not, then use strtoname() first, or just replace spaces by underscores or other punctuation.

The fractions 1/3 and 2/3 here round following the usual rules, not as in #1.

Code:

clear
input float firm_id str6 market
1 "a"
1 "b"
1 "c"
1 "d"
2 "b"
2 "c"
2 "f"
3 "a"
3 "b"
4 "c"
end

bysort firm_id : gen profile = market[1] if _n == 1
by firm_id : replace profile = profile[_n-1] + " " + market if _n > 1
by firm_id : replace profile = profile[_N]

save maindata

keep firm_id profile
by firm_id : keep if _n == 1

local max = _N
local maxM1 = _N - 1
gen overlap = 0
forval i = 1/`maxM1' {
    local I = `i' + 1
    forval j = `I'/`max' {
        local A = profile[`i']
        local B = profile[`j']
        local inter : list A & B
        local count : list sizeof inter
        replace overlap = overlap + `count' in `i'
        replace overlap = overlap + `count' in `j'
    }
}

replace overlap = overlap / `maxM1'

merge 1:m firm_id using maindata

drop _merge

sort firm_id market

list, sepby(firm_id)
    +---------------------------------------+
     | firm_id   profile    overlap   market |
     |---------------------------------------|
  1. |       1   a b c d   1.666667        a |
  2. |       1   a b c d   1.666667        b |
  3. |       1   a b c d   1.666667        c |
  4. |       1   a b c d   1.666667        d |
     |---------------------------------------|
  5. |       2     b c f   1.333333        b |
  6. |       2     b c f   1.333333        c |
  7. |       2     b c f   1.333333        f |
     |---------------------------------------|
  8. |       3       a b          1        a |
  9. |       3       a b          1        b |
     |---------------------------------------|
 10. |       4         c   .6666667        c |
     +---------------------------------------+

See also https://journals.sagepub.com/doi/epu...36867X20909698 for concatenating string values over observations.

EDIT: #2 got there first. My concern in writing mine was to avoid an explosion in file sizes that isn't really needed.

Also, does the real problem also involve data in time?

Last edited by Nick Cox; 21 Nov 2022, 02:43.

Comment

Kyeonggook Park

Join Date: Sep 2021

Posts: 9
#4

21 Nov 2022, 08:18

Thank you both for the answers! I will go through line by line to study what I should have done. (@Øyvind Snilsberg, yes you are right, I made an error in the example calculation.)

@Nick Cox: Thank you. "market" variable does actually contain spaces, and I will be sure to deal with this before using your code. The actual problem also does involve data in time. In other words, I want to calculate this overlap measure for each firm-year. Should I do something different in this case?
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35632

21 Nov 2022, 09:13

Surely you need to change the code if you want to monitor changes, as the code shown in #2 and #3 knows and says nothing about time.

This code includes instances of 0 overlap in the calculations.

Code:

clear
input float year firm_id  str6 market
2020 1 "a"
2020 1 "b"
2020 1 "c"
2020 1 "d"
2020 2 "b"
2020 2 "c"
2020 2 "f"
2020 3 "a"
2020 3 "b"
2020 4 "c"
2021 1 "a"
2021 1 "b"
2021 1 "c"
2021 1 "d"
2021 2 "b"
2021 2 "c"
2021 2 "f"
2021 3 "a"
2021 3 "b"
2021 4 "c"
2021 4 "d" 
end

bysort year firm_id (market): gen profile = market[1] if _n == 1
by year firm_id: replace profile = profile[_n-1] + " " + market if _n > 1
by year firm_id: replace profile = profile[_N]

save maindata, replace 

keep year firm_id profile
by year firm_id: keep if _n == 1
gen long obsno = _n 
gen overlap = 0 

egen y = group(year) 
su y, meanonly 


forval y = 1/`r(max)' { 

su obsno if y == `y', meanonly 
local maxM1 = r(N) - 1 
local first = r(min) 
local last = r(max) 
local lastM1 = r(max) - 1 

forval i = `first'/`lastM1' {
    local I = `i' + 1
    forval j = `I'/`last' {
        local A = profile[`i']
        local B = profile[`j']
        local inter : list A & B
        local count : list sizeof inter
        replace overlap = overlap + `count' in `i'
        replace overlap = overlap + `count' in `j'
    }
}

replace overlap = overlap / `maxM1' in `first'/`last' 

} 

merge 1:m firm_id year using maindata

drop _merge

sort year firm_id market

list, sepby(year firm_id)

     +----------------------------------------------------------+
     | year   firm_id   profile   obsno    overlap   y   market |
     |----------------------------------------------------------|
  1. | 2020         1   a b c d       1   1.666667   1        a |
  2. | 2020         1   a b c d       1   1.666667   1        b |
  3. | 2020         1   a b c d       1   1.666667   1        c |
  4. | 2020         1   a b c d       1   1.666667   1        d |
     |----------------------------------------------------------|
  5. | 2020         2     b c f       2   1.333333   1        b |
  6. | 2020         2     b c f       2   1.333333   1        c |
  7. | 2020         2     b c f       2   1.333333   1        f |
     |----------------------------------------------------------|
  8. | 2020         3       a b       3          1   1        a |
  9. | 2020         3       a b       3          1   1        b |
     |----------------------------------------------------------|
 10. | 2020         4         c       4   .6666667   1        c |
     |----------------------------------------------------------|
 11. | 2021         1   a b c d       5          2   2        a |
 12. | 2021         1   a b c d       5          2   2        b |
 13. | 2021         1   a b c d       5          2   2        c |
 14. | 2021         1   a b c d       5          2   2        d |
     |----------------------------------------------------------|
 15. | 2021         2     b c f       6   1.333333   2        b |
 16. | 2021         2     b c f       6   1.333333   2        c |
 17. | 2021         2     b c f       6   1.333333   2        f |
     |----------------------------------------------------------|
 18. | 2021         3       a b       7          1   2        a |
 19. | 2021         3       a b       7          1   2        b |
     |----------------------------------------------------------|
 20. | 2021         4       c d       8          1   2        c |
 21. | 2021         4       c d       8          1   2        d |
     +----------------------------------------------------------+

Comment

Kyeonggook Park

Join Date: Sep 2021

Posts: 9
#6

22 Nov 2022, 03:24

Nick Cox Of course, I was thinking about creating a "firm_id_year" column before using the code that you provided before. I am assuming that this approach should work similarly?

Thank you so much for this additional code that would take into account change over time!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35632
#7

22 Nov 2022, 04:33

Indeed, a joint variable (you say "column") that was a composite of identifier and year could be used. But then you'd need to make sure that you were only comparing within the same year. The code in #5 is fiddly but feel free to rewrite it.
Comment
Kyeonggook Park

Join Date: Sep 2021

Posts: 9
#8

22 Nov 2022, 11:13

Right, of course. Thank you for the insightful comment and all the help!
Comment

Announcement