How to keep duplicates and drop non-duplicates?

Marianne Sie

Join Date: Nov 2021

Posts: 4
#1

How to keep duplicates and drop non-duplicates?

03 Nov 2021, 10:40

Hi all,

I am currently working on data quality and I would like to create a dataset that will keep the duplicates and drop the non-duplicates observations.
Is there any code that can help me to achieve this goal?

I have shared the dataset with the link below. I'm trying to keep the duplicates based on these variables: facilityid pickupdate uniqueno description.
Based on the duplicates report, I should be able to keep a dataset of 500 variables:

duplicates report /*Duplicate is not case-sensitive
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 69763 0
2 | 916 458
3 | 63 42
--------------------------------------

Is it possible to keep the duplicates only?

Thank you very much!
https://drive.google.com/file/d/1x8I...ew?usp=sharing

Last edited by Marianne Sie; 03 Nov 2021, 11:05.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

03 Nov 2021, 10:44

Code:

duplicates tag, gen(flag) keep if flag
Comment
Marianne Sie

Join Date: Nov 2021

Posts: 4
#3

03 Nov 2021, 19:26

Hi Clyde Schechter , many thanks for your prompt reply.
I tried this code before and that's not the right final dataset that I would like to keep. Let me explain.

I would like to keep these 500 (458+42) observations in the column Surplus (see below) because these are the observations that will be dropped if I drop the duplicates.
I don't want to drop these 500 observations but rather keep them. Is it possible to do so?
duplicates report --------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 69763 0
2 | 916 458
3 | 63 42
--------------------------------------

When I run the command "keep if flag", I'm left with the observations (916 + 63) but that's not what I need. I would like to keep the 500 surplus only.
I hope that my explanation is clear.

keep if flag
duplicates report

--------------------------------------
Copies | Observations Surplus
----------+---------------------------
2 | 916 458
3 | 63 42
--------------------------------------

Thanks for your help

Last edited by Marianne Sie; 03 Nov 2021, 19:30.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#4

03 Nov 2021, 20:28

Oh, I see what you want.

Code:

by _all, sort: drop if _n == 1

will leave only the surplus observations.
Comment
Marianne Sie

Join Date: Nov 2021

Posts: 4
#5

04 Nov 2021, 11:28

Hi Clyde Schechter , this code worked perfectly! Thanks a ton!

I have another question, I was wondering if it is possible to keep the duplicates (500 observations) + the missing values from the variable pickupdate.
Is there a code that can help me to keep or drop observations while looking at two variables.

Thank you!!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#6

04 Nov 2021, 11:57

Code:

by _all, sort: keep if _n > 1 | missing(pickupdate)
Comment
Marianne Sie

Join Date: Nov 2021

Posts: 4
#7

05 Nov 2021, 05:19

Thank you very much, Clyde.
These codes worked perfectly.
Comment

Announcement

How to keep duplicates and drop non-duplicates?

Comment

Comment

Comment

Comment

Comment

Comment