Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing groups with the same value occurring for any observation in the group

    I have a dataset of sales transactions where there are potentially multiple buyers and multiple sellers for a given sale transaction as illustrated below:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long sale_id double(seller_id buyer_id)
    37990841 11020 88369
    37990841 39642 88369
    38070842 75912 56098
    38070842 56098 56098
    38080470 74588 74588
    38080470 86591 74588
    38090443 74895 74617
    38090443 74617 74617
    38140391 10216     .
    38140391 10216     .
    38410027 11703 11703
    38410027 11703 20458
    38490322 75034 75034
    38490322 75034 75034
    38550746 64231 64231
    38550746 64231 64231
    38650703     . 64231
    38650703     . 76655
    38650964 88786 48207
    38650964 48207 48207
    38650964 86021 48207
    38650964 59555 48207
    end
    I need to remove all observations within a sale_id group if any buyer is the same as any seller in the sale_id group and retain all other observations. For example, both observations for sale_id 37990841 should be included in my final dataset because the buyer_id is not the same as either of the seller_ids. However, both observations for sale_id 38070842 should be excluded because buyer_id 56098 also appears as a seller_id for the same sale_id. Here is what my final dataset should look like:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long sale_id double(seller_id buyer_id)
    37990841 11020 88369
    37990841 39642 88369
    38140391 10216     .
    38140391 10216     .
    38650703     . 64231
    38650703     . 76655
    end
    A simple drop command is obviously insufficient here, but I am unsure how to proceed.
    Code:
    drop if seller_id==buyer_id
    Thank you in advance for your help

  • #2
    I expect that someone will find a more efficient solution (e.g. without creating a new variable), but this will do what you want:
    Code:
    bys sale_id: gen sellbuy_id = seller_id == buyer_id
    by sale_id: replace sellbuy_id = sum(sellbuy_id)
    by sale_id: replace sellbuy_id = sellbuy_id[_N]
    keep if sellbuy_id==0
    drop sellbuy_id

    Comment


    • #3
      I interpreted O.P.'s request differently. The code in #3 looks for individual observations in which the buyer_id and seller_id are the same. My understanding is that O.P. also wants to identify situations where a buyer in one observation has the same id as a seller in any other observation with the same sale_id. The code in #2 does not capture that.

      In the example data shown, all sale_id groups with any such match include a single observation where buyer_id == seller_id. To illustrate the difference, however, I have modified O.P.'s example data to make the match for sale_id 38080470 occur only across different observations, not in the same one. The code in #2 does not pick this up and fails to drop sale_id 38080470.

      Here is my solution:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long sale_id double(seller_id buyer_id)
      37990841 11020 88369
      37990841 39642 88369
      38070842 75912 56098
      38070842 56098 56098
      38080470 74588 56098 // DATA CHANGED TO ILLUSTRATE CODE
      38080470 86591 74588
      38090443 74895 74617
      38090443 74617 74617
      38140391 10216     .
      38140391 10216     .
      38410027 11703 11703
      38410027 11703 20458
      38490322 75034 75034
      38490322 75034 75034
      38550746 64231 64231
      38550746 64231 64231
      38650703     . 64231
      38650703     . 76655
      38650964 88786 48207
      38650964 48207 48207
      38650964 86021 48207
      38650964 59555 48207
      end
      
      rangestat (count) matched = buyer_id, by(sale_id) interval(seller_id buyer_id buyer_id)
      replace matched = 0 if missing(matched)
      by sale_id (matched), sort: replace matched = matched[_N]
      drop if matched
      drop matched
      -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

      I think that both my approach and that in #2 are fair interpretations of what was asked for in #1. O.P. will have to decide which one she intended.

      Comment


      • #4
        Clyde Schechter : I believe the code in #2 does capture situations where a buyer in one observation has the same id as a seller in any other observation with the same sale_id. Note that I did use
        Code:
        by sale_id:

        Comment


        • #5
          No, they are different. With the modified data, it produces a different result that retains sale_id 38080470:
          Code:
          . * Example generated by -dataex-. To install: ssc install dataex
          . clear
          
          . input long sale_id double(seller_id buyer_id)
          
                    sale_id   seller_id    buyer_id
            1. 37990841 11020 88369
            2. 37990841 39642 88369
            3. 38070842 75912 56098
            4. 38070842 56098 56098
            5. 38080470 74588 56098 // DATA CHANGED TO ILLUSTRATE CODE
            6. 38080470 86591 74588
            7. 38090443 74895 74617
            8. 38090443 74617 74617
            9. 38140391 10216     .
           10. 38140391 10216     .
           11. 38410027 11703 11703
           12. 38410027 11703 20458
           13. 38490322 75034 75034
           14. 38490322 75034 75034
           15. 38550746 64231 64231
           16. 38550746 64231 64231
           17. 38650703     . 64231
           18. 38650703     . 76655
           19. 38650964 88786 48207
           20. 38650964 48207 48207
           21. 38650964 86021 48207
           22. 38650964 59555 48207
           23. end
          
          .
          .
          . bys sale_id: gen sellbuy_id = seller_id == buyer_id
          
          . by sale_id: replace sellbuy_id = sum(sellbuy_id)
          (5 real changes made)
          
          . by sale_id: replace sellbuy_id = sellbuy_id[_N]
          (5 real changes made)
          
          . keep if sellbuy_id==0
          (14 observations deleted)
          
          . drop sellbuy_id
          
          .
          . list, noobs clean
          
               sale_id   seller~d   buyer_id  
              37990841      11020      88369  
              37990841      39642      88369  
              38080470      74588      56098  
              38080470      86591      74588  
              38140391      10216          .  
              38140391      10216          .  
              38650703          .      64231  
              38650703          .      76655
          Last edited by Clyde Schechter; 15 Jul 2022, 17:26.

          Comment


          • #6
            Clyde Schechter : You are correct -- I missed the possibility that seller_id and buyer_id need not be the same in the same row.

            Comment


            • #7
              Of course, as I mentioned at the end of #3, it is unclear which of our solutions does what O.P. actually wants!

              Comment


              • #8
                I also interpreted the question of OP the way how Clyde interpreted it. And this is a nasty and hairy problem. So while I was scratching my head how to solve it, Dirk provided a solution to the simpler problem. I just waited for OP to show up and clear the air.

                As OP is not showing up, here is my solution to the hard problem interpretation. I think it does the same as Clyde code does, but without user contributted commands.

                Code:
                . sort sale_id
                
                . gen flag = 0
                
                . levelsof sale_id, local(sales)
                37990841 38070842 38080470 38090443 38140391 38410027 38490322 38550746 38650703 38650964
                
                . foreach l of local sales {
                  2. levelsof seller_id if sale_id == `l', local(sellers)
                  3. levelsof buyer_id if sale_id == `l', local(buyers)
                  4. foreach s of local sellers {
                  5. replace flag = 1 if strpos("`buyers'", "`s'")>0 & sale_id == `l'
                  6. }
                  7. }
                
                drop if flag
                I think it gives the same result as Clyde's code.

                Comment


                • #9
                  Thank you all. Clyde did understand the problem as I intended it. I appreciate him also catching that my sample data failed to include the case where the common buyer_id and seller_id do not appear in the same row for any observation in the sale_id group.

                  Comment

                  Working...
                  X