Remove Duplicate Observations

Eliot Assoudeh

Join Date: Dec 2021

Posts: 7
#1

Remove Duplicate Observations

06 Aug 2024, 08:27

Greetings,

I have a big dataset in which some of the observations have up to 17 variations; meaning their ID is the same but the operational codes are different. For a project, I classified more than 50 operational codes into three categories (MH = 0, 1, and 3). I wrote the following script to remove any duplicates (same ID) greater than 1 and MH codes 0 and 3. But when I tabulate the dup variable after dropping duplicates, I still have lots of duplicate observations that I don't need. I appreciate your kind advice, please.

drop if (dup > 1 & (MH == 0 | MH == 3))

Thanks,
Eliot
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#2

06 Aug 2024, 08:55

Well, first of all, you should still be left with duplicates where MH == 1. So before you conclude that something is wrong check and see if the surviving duplicates have MH == 1.

If not, then probably you did something wrong in calculating the variable dup, so that dup > 1 doesn't correctly capture all and only surplus observations. You don't show how you did that, so it's anybody's guess what (if anything) went wrong and how it might be fixed.
Comment
Eliot Assoudeh

Join Date: Dec 2021

Posts: 7
#3

06 Aug 2024, 09:02

Thanks Clyde. I have checked and I have enough correct number of duplicates with MH==1. Here is the code:

sort ID
quietly by ID: gen dup = cond(_N==1,0,_n)
order dup, after(ID)
tabulate dup

drop if (dup > 1 & (MH == 0 | MH == 3))

Still I have ID with MH == 3 that I don't need.

For instance, I have the followings after drop command above:

ID dup MH
1234 1 1
1234 2 3

5678 2 3
5678 3 3

Last edited by Eliot Assoudeh; 06 Aug 2024, 09:22.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1371
#4

07 Aug 2024, 01:17

The core problem here, I believe is that you are checking for dup > 1, instead of dup > 0. For observations that are unique, dup is zero. dup = 1 means there is one duplicate (i.e. two are identical), and so on. This is the standard way in which the variable is produced by duplicates tag.

Code:

clear set obs 3 gen x = _n gen y = (_n < 3) * 5 gen z = 10 duplicates tag x, gen(dupx) duplicates tag y, gen(dupy) duplicates tag z, gen(dupz)

which yields:

Code:

. list, noobs +---------------------------------+ | x y z dupx dupy dupz | |---------------------------------| | 1 5 10 0 1 2 | | 2 5 10 0 1 2 | | 3 0 10 0 0 2 | +---------------------------------+

Last edited by Hemanshu Kumar; 07 Aug 2024, 01:27.
1 like
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1371
#5

07 Aug 2024, 03:31

Sorry, I missed your post in #3 where you show how you generate your dup variable, which looks a bit different from the one produced by duplicates tag. So ignore my response in #4.

Before we go any further, it should be impossible to have any observations with dup greater than 1 and also MH equal to 3 if you were literally doing what you said:

Code:

drop if (dup > 1 & (MH == 0 | MH == 3))

So can you confirm this is actually what you use in your code, or did you try and show us some toy code to reproduce the problem and so this is not exactly what you do?

Last edited by Hemanshu Kumar; 07 Aug 2024, 03:52.
Comment
Eliot Assoudeh

Join Date: Dec 2021

Posts: 7
#6

08 Aug 2024, 05:45

Thanks Hemanshu for your note. That is exactly the code that I used and there are multiple instances where dup is greater than 1 and MH == 3. Someone suggested that I use -sortby- rather than -sort- and it improved the removal of duplication but not completely.
Comment

Announcement

Remove Duplicate Observations

Comment

Comment

Comment

Comment

Comment