How to Keep Duplicated Variables based on the Value of another Column

Jaylin Spencer

Join Date: Apr 2022

Posts: 3
#1

How to Keep Duplicated Variables based on the Value of another Column

07 May 2022, 21:54

Hello all!

I am working on cleaning up a dataset, and I do not know how to do so. Here are some of the relevant parts of the data set that has approximately 400,000 observations, with duplicates based on an ID number. What I want to do is keep the case IDs that correspond to the highest outcome. So, for the following:
ID Number Date Outcome

3 2/2/22 4

3 2/2/22 3

3 2/2/22 3

3 2/2/22 2

I want to keep only the first row because it has the highest code. Some IDs have 5 corresponding values for outcomes; some have 2; I think one even has 10.

I have tried to google this, but got very confused by duplicates and dups. I'd really appreciate any suggestions anyone has.
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17749

08 May 2022, 03:47

Jaylin:
tweaking your example a bit:

Code:

. input ID_Number Date Outcome

     ID_Number       Date    Outcome
  1. 3 2022 4
  2. 3 2022 3
  3. 3 2022 3
  4. 3 2022 2
  5. end

. bysort ID_Number: egen wanted=max( Outcome)

. list

     +------------------------------------+
     | ID_Num~r   Date   Outcome   wanted |
     |------------------------------------|
  1. |        3   2022         4        4 |
  2. |        3   2022         3        4 |
  3. |        3   2022         3        4 |
  4. |        3   2022         2        4 |
     +------------------------------------+


. keep if Outcome==wanted


. list

     +------------------------------------+
     | ID_Num~r   Date   Outcome   wanted |
     |------------------------------------|
  1. |        3   2022         4        4 |
     +------------------------------------+

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35810
#3

08 May 2022, 04:20

What's unclear in #1 includes whether you want to do this separately by date. If you do, then code in the spirit of @Carlo Lazzaro's suggestion might be

Code:

bysort ID_number Date (Outcome) : keep if _n == _N

There is no need to calculate the maximum in a new variable -- unless you have missing values on Outcome. If you do, then

Code:

gen OK = !missing(Outcome) bysort OK ID_number Date (Outcome) : keep if OK & _n == _N

Leave Date out of the code if it is immaterial.

Note that the problem here is not one of duplicates, as the problem hinges on selecting observations with different values.

Last edited by Nick Cox; 08 May 2022, 04:27.
1 like
Comment
Jaylin Spencer

Join Date: Apr 2022

Posts: 3
#4

08 May 2022, 12:13

Thank you both!!! I had some missing values, so the combination of your suggestions worked perfectly. I am so grateful for your help. Thank you!
Comment

ID Number	Date	Outcome
3	2/2/22	4
3	2/2/22	3
3	2/2/22	3
3	2/2/22	2

Announcement

How to Keep Duplicated Variables based on the Value of another Column

Comment

Comment

Comment