Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to pair a firm with its peers


    Dear fellow Stata users,

    I am working on a panel data and wanted to pair a firm with its peers that share similar product lines and geographic markets. Below is an illustrative dataset. For each firm, it details the top 3 product lines and the top 3 geographic regions. My matching criteria are as follows: in a given year, for each focal firm, a competitive peer must share at least two out of three top product lines and at least two out of three top regions.

    Screen Shot 2023-10-05 at 6.48.24 PM.png




















    In this dataset, in year 2001, A and C are considered peers, and B & C are peers, but A and C are not peers.

    I wonder how to conduct the pairing in STATA. I want all firms meeting the above criteria identified as peers of a target firm.

    Thank you in advance for any help you can provide.
    Last edited by Yayuan Ren; 05 Oct 2023, 17:51.

  • #2
    There are several ways to do this, with tradeoffs involving speed and memory usage. Here's one:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float year str1 firm str9 region1 str10 region2 str9 region3 str4 line1 str10 line2 str4 line3
    2001 "A" "midwest"   "southwest" "west"      "auto" "home"       "fire"
    2001 "B" "west"      "southwest" "northeast" "auto" "multiperil" "wc"  
    2001 "C" "northeast" "southwest" "midwest"   "auto" "wc"         "home"
    2002 "A" "midwest"   "southwest" "west"      "auto" "home"       "fire"
    2002 "D" "midwest"   "northeast" "midwest"   "fire" "multiperil" "wc"  
    2002 "E" "northeast" "southwest" "midwest"   "auto" "wc"         "home"
    end
    
    tempfile original
    save `original'
    
    local regions
    forvalues i = 1/3 {
        levelsof region`i', local(temp)
        local regions: list regions | temp
    }
    foreach r of local regions {
        gen byte in_`r' = inlist(`"`r'"', region1, region2, region3)
    }
    
    local lines
    forvalues i = 1/3 {
        levelsof line`i', local(temp)
        local lines: list lines | temp
    }
    
    foreach l of local lines {
        gen byte has_`l' = inlist(`"`l'"', line1, line2, line3)
    }
    
    keep year firm in_* has_*
    preserve
    rename (firm in_* has_*) =_2
    tempfile holding
    save `holding'
    restore
    rename (firm in_* has_*) =_1
    joinby year using `holding'
    drop if firm_1 == firm_2
    
    gen region_matches = 0
    foreach v of varlist in_*_1 {
        local w: subinstr local v "_1" "_2"
        replace region_matches = region_matches + `v'*`w'
    }
    keep if region_matches >= 2
    gen line_matches = 0
    foreach v of varlist has_*_1 {
        local w: subinstr local v "_1" "_2"
        replace line_matches = line_matches + `v'* `w'
    }
    keep if line_matches >= 2
    
    keep year firm*
    isid year firm_1 firm_2, sort
    save `"`holding'"', replace
    
    use `original', clear
    rename firm firm_1
    merge 1:m year firm_1 using `holding', assert(master match) nogenerate
    isid year firm_1 firm_2, sort miss
    The final result has the original data, with firm renamed to firm_1. The new variable firm_2 contains the name of a peer firm. If there is more than one peer for a given firm (e.g. firm C in 2001 is pper to both A and B, then there will be extra copies of that firm's original observation, one for each identified peer. If a firm has no peer, then the variable firm_2 will have missing value.

    In the future, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    I know you meant well showing a screenshot of your data. But that is without a doubt the least helpful way to show example data. There is no way to import a screenshot into Stata to work with and develop code. I was sufficiently intrigued by your problem that I typed the data into Stata by hand. But most people will not do that, at least not most of the time. So you decrease your chances of getting a timely and helpful response when you show a data screenshot.

    By the way, in the course of manually entering your data, I discovered what I believe is a typo in your data set. For 2001, firm B, line2 "mutiperil" should, I think, be "multiperil." So, before you even start running my code to see how it works in your full data set, you should carefully clean your data to remove typos, and also to make sure that you don't have alternative spellings of words, or non-uniform case, or stray leading, trailing, or embedded blank spaces. Remember that Stata does not read like a person: it matches strings literally, and exactly, character-by-character. It will not think that "abc", "Abc", and "abc " are the same thing: it will consider them all different.
    Last edited by Clyde Schechter; 05 Oct 2023, 20:08.

    Comment


    • #3
      Hello Clyde Schechter , I greatly appreciate your kindness and especially the patience to input the data. Your help is exceptional and will benefit my project a lot. I have another question: my panel dataset comprises 3000 firms spanning 15 years. Do you think the code will perform efficiently with a dataset of this size?

      Comment


      • #4
        It depends. The run time for this code will not depend very much on the size of the data set itself. It will depend more on how many different regions there are (which, I would imagine is pretty small) and on how many different products there are--which might be large or small. The only inherently slow commands in the code are the -use-, -save-, -preserve-, -restore-, -joinby- and -merge- commands, because they access the hard drive. But 3,000 firms * 15 years is only 45,000 observations. In today's terms that's a small data set and can be digested quickly by those commands. Well, except -joinby-. It's going to have to create a data set with 225 million observations. So here's a revised version that should do the same thing much faster.

        Code:
        tempfile original
        save `original'
        
        local regions
        forvalues i = 1/3 {
            levelsof region`i', local(temp)
            local regions: list regions | temp
        }
        foreach r of local regions {
            gen byte in_`r' = inlist(`"`r'"', region1, region2, region3)
        }
        
        local lines
        forvalues i = 1/3 {
            levelsof line`i', local(temp)
            local lines: list lines | temp
        }
        
        foreach l of local lines {
            gen byte has_`l' = inlist(`"`l'"', line1, line2, line3)
        }
        
        keep year firm in_* has_*
        sort year firm
        preserve
        rename (firm in_* has_*) =_2
        tempfile holding
        save `holding'
        restore
        rename (in_* has_*) =_1
        gen holding = "`holding'"
        
        capture program drop one_year_firm
        program define one_year_firm
            local holding = holding[1]
            sort year firm
            merge 1:m year using `holding', keep(master match) nogenerate sorted
        
            drop if firm == firm_2
        
            gen region_matches = 0
            foreach v of varlist in_*_1 {
                local w: subinstr local v "_1" "_2"
                replace region_matches = region_matches + `v'*`w'
            }
            keep if region_matches >= 2
            gen line_matches = 0
            foreach v of varlist has_*_1 {
                local w: subinstr local v "_1" "_2"
                replace line_matches = line_matches + `v'* `w'
            }
            keep if line_matches >= 2
        
            keep year firm*
            isid year firm firm_2, sort
            exit
        end
        
        runby one_year_firm, by(year firm) verbose
        
        merge m:1 year firm using `original', assert(using match) nogenerate
        isid year firm firm_2, sort miss
        To use this, you must install -runby-, by Robert Picard and me, available from SSC.

        Note: In this version, instead of firm_1 and firm_2 you end up with firm and firm_2. The original region and line information about firm is retained in each observation of firm.
        Last edited by Clyde Schechter; 06 Oct 2023, 11:07.

        Comment


        • #5
          Hi Clyde Schechter,

          I appreciate your assistance with my previous inquiries; the provided codes worked very well. I've encountered another problem in my project. This time, I need to pair a firm with another firm that belongs to Group 1 but is not part of Group 2.

          Using the dataset below as an example, the first dataset pairs firms with those from Group 1, and the second dataset pairs firms with firms from Group 2. My objective is to pair a firm with its peers in Group 1 but not among Group 2. In this example, the pair of C1 and C1001 and the pair of C2 and C3821 are not the results I'm seeking, as C1001 and C3821 are both in Group1 and Group2.


          * Example dataset: Group1

          input float year str5 firm str5 group1

          2001 "C1" "C1001"
          2001 "C1" "C121"
          2001 "C1" "C321"
          2001 "C1" "C221"
          2002 "C2" "C188"
          2002 "C2" "C273"
          2002 "C2" "C3821"
          2002 "C2" "C176"
          end

          save Group1

          *Example dataset: Group2
          clear
          input float year str5 firm str5 group2

          2001"C1" "C1001"
          2001"C1" "C234"
          2002 "C2" "C271"
          2002"C2" "C181"
          2002"C2" "C3821"
          2002"C2" "C283"
          2002"C2" "C194"
          2002"C2" "C272"
          2002"C2" "C387"
          end

          save Group2

          Thank you so much for your input.

          Comment


          • #6
            Well, within the framework of the code in #4, what you need to do is eliminate the firms that are part of group2 from the `holding' file before you save it. But I can't advise you on how to do that because I cannot discern from any of the data examples you have shown, how to identify which firms those are.

            Comment


            • #7
              I have used the code in #4 to conduct pairing. Group 1 data are the results of pairing using pairing criteria #1, and similarly, the Group 2 data are pairing results using criteria #2. "Cxxx" are simply firm ID. My goal is to identify the pairings that meet criteria #1 but exclusive from criteria #2. In the above example, the pairing of 2001 "C1" "C1001" and 2002 "C2" "C3821" should be excluded from Group 1 as they also appear in Group 2 pairings.

              Comment


              • #8
                So starting from two separate data sets that enumerate the different pairings, as exemplified in #7:
                Code:
                use Group1, clear
                rename group1 paired_firm
                frame create group2
                frame group2 {
                    use Group2
                    rename group2 paired_firm
                }
                frlink 1:1 firm paired_firm, frame(group2)
                keep if missing(group2)
                drop group2
                frame drop group2
                will leave in active memory the subset of the Group1 data set that are not also found in the Group2 data set.

                Comment


                • #9
                  One more question, in the #8 codes, I don't see the variable "year." I wanted to make sure that the pairing is conducted on yearly basis. Thanks a lot!

                  Comment


                  • #10
                    I don't see the variable "year." I wanted to make sure that the pairing is conducted on yearly basis.
                    Minor modification
                    Code:
                    use Group1, clear
                    rename group1 paired_firm
                    frame create group2
                    frame group2 {
                        use Group2
                        rename group2 paired_firm
                    }
                    frlink 1:1 firm year paired_firm, frame(group2)
                    keep if missing(group2)
                    drop group2
                    frame drop group2

                    Comment


                    • #11
                      Perfect! Thank you so much for your help.

                      Comment

                      Working...
                      X