How to pair a firm with its peers

Yayuan Ren

Join Date: Aug 2020

Posts: 9
#1

How to pair a firm with its peers

05 Oct 2023, 17:37

Dear fellow Stata users,

I am working on a panel data and wanted to pair a firm with its peers that share similar product lines and geographic markets. Below is an illustrative dataset. For each firm, it details the top 3 product lines and the top 3 geographic regions. My matching criteria are as follows: in a given year, for each focal firm, a competitive peer must share at least two out of three top product lines and at least two out of three top regions.

In this dataset, in year 2001, A and C are considered peers, and B & C are peers, but A and C are not peers.

I wonder how to conduct the pairing in STATA. I want all firms meeting the above criteria identified as peers of a target firm.

Thank you in advance for any help you can provide.

Last edited by Yayuan Ren; 05 Oct 2023, 17:51.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#2

05 Oct 2023, 20:04

There are several ways to do this, with tradeoffs involving speed and memory usage. Here's one:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float year str1 firm str9 region1 str10 region2 str9 region3 str4 line1 str10 line2 str4 line3 2001 "A" "midwest" "southwest" "west" "auto" "home" "fire" 2001 "B" "west" "southwest" "northeast" "auto" "multiperil" "wc" 2001 "C" "northeast" "southwest" "midwest" "auto" "wc" "home" 2002 "A" "midwest" "southwest" "west" "auto" "home" "fire" 2002 "D" "midwest" "northeast" "midwest" "fire" "multiperil" "wc" 2002 "E" "northeast" "southwest" "midwest" "auto" "wc" "home" end tempfile original save `original' local regions forvalues i = 1/3 { levelsof region`i', local(temp) local regions: list regions | temp } foreach r of local regions { gen byte in_`r' = inlist(`"`r'"', region1, region2, region3) } local lines forvalues i = 1/3 { levelsof line`i', local(temp) local lines: list lines | temp } foreach l of local lines { gen byte has_`l' = inlist(`"`l'"', line1, line2, line3) } keep year firm in_* has_* preserve rename (firm in_* has_*) =_2 tempfile holding save `holding' restore rename (firm in_* has_*) =_1 joinby year using `holding' drop if firm_1 == firm_2 gen region_matches = 0 foreach v of varlist in_*_1 { local w: subinstr local v "_1" "_2" replace region_matches = region_matches + `v'*`w' } keep if region_matches >= 2 gen line_matches = 0 foreach v of varlist has_*_1 { local w: subinstr local v "_1" "_2" replace line_matches = line_matches + `v'* `w' } keep if line_matches >= 2 keep year firm* isid year firm_1 firm_2, sort save `"`holding'"', replace use `original', clear rename firm firm_1 merge 1:m year firm_1 using `holding', assert(master match) nogenerate isid year firm_1 firm_2, sort miss

The final result has the original data, with firm renamed to firm_1. The new variable firm_2 contains the name of a peer firm. If there is more than one peer for a given firm (e.g. firm C in 2001 is pper to both A and B, then there will be extra copies of that firm's original observation, one for each identified peer. If a firm has no peer, then the variable firm_2 will have missing value.

In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

I know you meant well showing a screenshot of your data. But that is without a doubt the least helpful way to show example data. There is no way to import a screenshot into Stata to work with and develop code. I was sufficiently intrigued by your problem that I typed the data into Stata by hand. But most people will not do that, at least not most of the time. So you decrease your chances of getting a timely and helpful response when you show a data screenshot.

By the way, in the course of manually entering your data, I discovered what I believe is a typo in your data set. For 2001, firm B, line2 "mutiperil" should, I think, be "multiperil." So, before you even start running my code to see how it works in your full data set, you should carefully clean your data to remove typos, and also to make sure that you don't have alternative spellings of words, or non-uniform case, or stray leading, trailing, or embedded blank spaces. Remember that Stata does not read like a person: it matches strings literally, and exactly, character-by-character. It will not think that "abc", "Abc", and "abc " are the same thing: it will consider them all different.

Last edited by Clyde Schechter; 05 Oct 2023, 20:08.
Comment
Yayuan Ren

Join Date: Aug 2020

Posts: 9
#3

06 Oct 2023, 10:10

Hello Clyde Schechter , I greatly appreciate your kindness and especially the patience to input the data. Your help is exceptional and will benefit my project a lot. I have another question: my panel dataset comprises 3000 firms spanning 15 years. Do you think the code will perform efficiently with a dataset of this size?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30161

06 Oct 2023, 10:27

It depends. The run time for this code will not depend very much on the size of the data set itself. It will depend more on how many different regions there are (which, I would imagine is pretty small) and on how many different products there are--which might be large or small. The only inherently slow commands in the code are the -use-, -save-, -preserve-, -restore-, -joinby- and -merge- commands, because they access the hard drive. But 3,000 firms * 15 years is only 45,000 observations. In today's terms that's a small data set and can be digested quickly by those commands. Well, except -joinby-. It's going to have to create a data set with 225 million observations. So here's a revised version that should do the same thing much faster.

Code:

tempfile original
save `original'

local regions
forvalues i = 1/3 {
    levelsof region`i', local(temp)
    local regions: list regions | temp
}
foreach r of local regions {
    gen byte in_`r' = inlist(`"`r'"', region1, region2, region3)
}

local lines
forvalues i = 1/3 {
    levelsof line`i', local(temp)
    local lines: list lines | temp
}

foreach l of local lines {
    gen byte has_`l' = inlist(`"`l'"', line1, line2, line3)
}

keep year firm in_* has_*
sort year firm
preserve
rename (firm in_* has_*) =_2
tempfile holding
save `holding'
restore
rename (in_* has_*) =_1
gen holding = "`holding'"

capture program drop one_year_firm
program define one_year_firm
    local holding = holding[1]
    sort year firm
    merge 1:m year using `holding', keep(master match) nogenerate sorted

    drop if firm == firm_2

    gen region_matches = 0
    foreach v of varlist in_*_1 {
        local w: subinstr local v "_1" "_2"
        replace region_matches = region_matches + `v'*`w'
    }
    keep if region_matches >= 2
    gen line_matches = 0
    foreach v of varlist has_*_1 {
        local w: subinstr local v "_1" "_2"
        replace line_matches = line_matches + `v'* `w'
    }
    keep if line_matches >= 2

    keep year firm*
    isid year firm firm_2, sort
    exit
end

runby one_year_firm, by(year firm) verbose

merge m:1 year firm using `original', assert(using match) nogenerate
isid year firm firm_2, sort miss

To use this, you must install -runby-, by Robert Picard and me, available from SSC.

Note: In this version, instead of firm_1 and firm_2 you end up with firm and firm_2. The original region and line information about firm is retained in each observation of firm.

Last edited by Clyde Schechter; 06 Oct 2023, 11:07.

Comment

Yayuan Ren

Join Date: Aug 2020

Posts: 9
#5

23 Oct 2023, 10:31

Hi Clyde Schechter,

I appreciate your assistance with my previous inquiries; the provided codes worked very well. I've encountered another problem in my project. This time, I need to pair a firm with another firm that belongs to Group 1 but is not part of Group 2.

Using the dataset below as an example, the first dataset pairs firms with those from Group 1, and the second dataset pairs firms with firms from Group 2. My objective is to pair a firm with its peers in Group 1 but not among Group 2. In this example, the pair of C1 and C1001 and the pair of C2 and C3821 are not the results I'm seeking, as C1001 and C3821 are both in Group1 and Group2.

* Example dataset: Group1

input float year str5 firm str5 group1

2001 "C1" "C1001"
2001 "C1" "C121"
2001 "C1" "C321"
2001 "C1" "C221"
2002 "C2" "C188"
2002 "C2" "C273"
2002 "C2" "C3821"
2002 "C2" "C176"
end

save Group1

*Example dataset: Group2
clear
input float year str5 firm str5 group2

2001"C1" "C1001"
2001"C1" "C234"
2002 "C2" "C271"
2002"C2" "C181"
2002"C2" "C3821"
2002"C2" "C283"
2002"C2" "C194"
2002"C2" "C272"
2002"C2" "C387"
end

save Group2

Thank you so much for your input.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#6

23 Oct 2023, 11:27

Well, within the framework of the code in #4, what you need to do is eliminate the firms that are part of group2 from the `holding' file before you save it. But I can't advise you on how to do that because I cannot discern from any of the data examples you have shown, how to identify which firms those are.
Comment
Yayuan Ren

Join Date: Aug 2020

Posts: 9
#7

23 Oct 2023, 12:40

I have used the code in #4 to conduct pairing. Group 1 data are the results of pairing using pairing criteria #1, and similarly, the Group 2 data are pairing results using criteria #2. "Cxxx" are simply firm ID. My goal is to identify the pairings that meet criteria #1 but exclusive from criteria #2. In the above example, the pairing of 2001 "C1" "C1001" and 2002 "C2" "C3821" should be excluded from Group 1 as they also appear in Group 2 pairings.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30161
#8

23 Oct 2023, 15:19

So starting from two separate data sets that enumerate the different pairings, as exemplified in #7:

Code:

use Group1, clear rename group1 paired_firm frame create group2 frame group2 { use Group2 rename group2 paired_firm } frlink 1:1 firm paired_firm, frame(group2) keep if missing(group2) drop group2 frame drop group2

will leave in active memory the subset of the Group1 data set that are not also found in the Group2 data set.
Comment
Yayuan Ren

Join Date: Aug 2020

Posts: 9
#9

23 Oct 2023, 18:25

One more question, in the #8 codes, I don't see the variable "year." I wanted to make sure that the pairing is conducted on yearly basis. Thanks a lot!
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30161

#10

23 Oct 2023, 20:42

I don't see the variable "year." I wanted to make sure that the pairing is conducted on yearly basis.

Minor modification

Code:

use Group1, clear
rename group1 paired_firm
frame create group2
frame group2 {
    use Group2
    rename group2 paired_firm
}
frlink 1:1 firm year paired_firm, frame(group2)
keep if missing(group2)
drop group2
frame drop group2

Comment

Yayuan Ren

Join Date: Aug 2020

Posts: 9
#11

24 Oct 2023, 21:07

Perfect! Thank you so much for your help.
Comment

Announcement