Removing groups with the same value occurring for any observation in the group

Kristen Valentine

Join Date: Jul 2014

Posts: 11
#1

Removing groups with the same value occurring for any observation in the group

15 Jul 2022, 14:47

I have a dataset of sales transactions where there are potentially multiple buyers and multiple sellers for a given sale transaction as illustrated below:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long sale_id double(seller_id buyer_id) 37990841 11020 88369 37990841 39642 88369 38070842 75912 56098 38070842 56098 56098 38080470 74588 74588 38080470 86591 74588 38090443 74895 74617 38090443 74617 74617 38140391 10216 . 38140391 10216 . 38410027 11703 11703 38410027 11703 20458 38490322 75034 75034 38490322 75034 75034 38550746 64231 64231 38550746 64231 64231 38650703 . 64231 38650703 . 76655 38650964 88786 48207 38650964 48207 48207 38650964 86021 48207 38650964 59555 48207 end

I need to remove all observations within a sale_id group if any buyer is the same as any seller in the sale_id group and retain all other observations. For example, both observations for sale_id 37990841 should be included in my final dataset because the buyer_id is not the same as either of the seller_ids. However, both observations for sale_id 38070842 should be excluded because buyer_id 56098 also appears as a seller_id for the same sale_id. Here is what my final dataset should look like:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long sale_id double(seller_id buyer_id) 37990841 11020 88369 37990841 39642 88369 38140391 10216 . 38140391 10216 . 38650703 . 64231 38650703 . 76655 end

A simple drop command is obviously insufficient here, but I am unsure how to proceed.

Code:

drop if seller_id==buyer_id

Thank you in advance for your help
Tags: None
Dirk Enzmann

Join Date: Apr 2014

Posts: 599
#2

15 Jul 2022, 15:28

I expect that someone will find a more efficient solution (e.g. without creating a new variable), but this will do what you want:

Code:

bys sale_id: gen sellbuy_id = seller_id == buyer_id by sale_id: replace sellbuy_id = sum(sellbuy_id) by sale_id: replace sellbuy_id = sellbuy_id[_N] keep if sellbuy_id==0 drop sellbuy_id
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#3

15 Jul 2022, 15:54

I interpreted O.P.'s request differently. The code in #3 looks for individual observations in which the buyer_id and seller_id are the same. My understanding is that O.P. also wants to identify situations where a buyer in one observation has the same id as a seller in any other observation with the same sale_id. The code in #2 does not capture that.

In the example data shown, all sale_id groups with any such match include a single observation where buyer_id == seller_id. To illustrate the difference, however, I have modified O.P.'s example data to make the match for sale_id 38080470 occur only across different observations, not in the same one. The code in #2 does not pick this up and fails to drop sale_id 38080470.

Here is my solution:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long sale_id double(seller_id buyer_id) 37990841 11020 88369 37990841 39642 88369 38070842 75912 56098 38070842 56098 56098 38080470 74588 56098 // DATA CHANGED TO ILLUSTRATE CODE 38080470 86591 74588 38090443 74895 74617 38090443 74617 74617 38140391 10216 . 38140391 10216 . 38410027 11703 11703 38410027 11703 20458 38490322 75034 75034 38490322 75034 75034 38550746 64231 64231 38550746 64231 64231 38650703 . 64231 38650703 . 76655 38650964 88786 48207 38650964 48207 48207 38650964 86021 48207 38650964 59555 48207 end rangestat (count) matched = buyer_id, by(sale_id) interval(seller_id buyer_id buyer_id) replace matched = 0 if missing(matched) by sale_id (matched), sort: replace matched = matched[_N] drop if matched drop matched

-rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

I think that both my approach and that in #2 are fair interpretations of what was asked for in #1. O.P. will have to decide which one she intended.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 599
#4

15 Jul 2022, 16:15

Clyde Schechter : I believe the code in #2 does capture situations where a buyer in one observation has the same id as a seller in any other observation with the same sale_id. Note that I did use

Code:

by sale_id:
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

15 Jul 2022, 17:18

No, they are different. With the modified data, it produces a different result that retains sale_id 38080470:

Code:

. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input long sale_id double(seller_id buyer_id)

          sale_id   seller_id    buyer_id
  1. 37990841 11020 88369
  2. 37990841 39642 88369
  3. 38070842 75912 56098
  4. 38070842 56098 56098
  5. 38080470 74588 56098 // DATA CHANGED TO ILLUSTRATE CODE
  6. 38080470 86591 74588
  7. 38090443 74895 74617
  8. 38090443 74617 74617
  9. 38140391 10216     .
 10. 38140391 10216     .
 11. 38410027 11703 11703
 12. 38410027 11703 20458
 13. 38490322 75034 75034
 14. 38490322 75034 75034
 15. 38550746 64231 64231
 16. 38550746 64231 64231
 17. 38650703     . 64231
 18. 38650703     . 76655
 19. 38650964 88786 48207
 20. 38650964 48207 48207
 21. 38650964 86021 48207
 22. 38650964 59555 48207
 23. end

.
.
. bys sale_id: gen sellbuy_id = seller_id == buyer_id

. by sale_id: replace sellbuy_id = sum(sellbuy_id)
(5 real changes made)

. by sale_id: replace sellbuy_id = sellbuy_id[_N]
(5 real changes made)

. keep if sellbuy_id==0
(14 observations deleted)

. drop sellbuy_id

.
. list, noobs clean

     sale_id   seller~d   buyer_id  
    37990841      11020      88369  
    37990841      39642      88369  
    38080470      74588      56098  
    38080470      86591      74588  
    38140391      10216          .  
    38140391      10216          .  
    38650703          .      64231  
    38650703          .      76655

Last edited by Clyde Schechter; 15 Jul 2022, 17:26.

Comment

Dirk Enzmann

Join Date: Apr 2014

Posts: 599
#6

15 Jul 2022, 17:49

Clyde Schechter : You are correct -- I missed the possibility that seller_id and buyer_id need not be the same in the same row.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#7

15 Jul 2022, 18:10

Of course, as I mentioned at the end of #3, it is unclear which of our solutions does what O.P. actually wants!
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

16 Jul 2022, 01:59

I also interpreted the question of OP the way how Clyde interpreted it. And this is a nasty and hairy problem. So while I was scratching my head how to solve it, Dirk provided a solution to the simpler problem. I just waited for OP to show up and clear the air.

As OP is not showing up, here is my solution to the hard problem interpretation. I think it does the same as Clyde code does, but without user contributted commands.

Code:

. sort sale_id . gen flag = 0 . levelsof sale_id, local(sales) 37990841 38070842 38080470 38090443 38140391 38410027 38490322 38550746 38650703 38650964 . foreach l of local sales { 2. levelsof seller_id if sale_id == `l', local(sellers) 3. levelsof buyer_id if sale_id == `l', local(buyers) 4. foreach s of local sellers { 5. replace flag = 1 if strpos("`buyers'", "`s'")>0 & sale_id == `l' 6. } 7. } drop if flag

I think it gives the same result as Clyde's code.
Comment
Kristen Valentine

Join Date: Jul 2014

Posts: 11
#9

16 Jul 2022, 20:17

Thank you all. Clyde did understand the problem as I intended it. I appreciate him also catching that my sample data failed to include the case where the common buyer_id and seller_id do not appear in the same row for any observation in the sale_id group.
Comment

Announcement

Removing groups with the same value occurring for any observation in the group

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment