Remove duplicates

Adriana Peci

Join Date: Nov 2014

Posts: 32
#1

Remove duplicates

18 Mar 2015, 12:57

Hi all,

I need some help transforming a multiple record dataset to unique person. In my dataset only 80% of persons have id present and even when is present it may have errors in it. Therefore, to account for that I need to use the help of other identifiers such as : first name, last name and birth date which may have errors present . I am illustrating below my dataset and some of issues relating to it to help understanding.

The way I have done the transformation was using three steps: In the first step, I identified unique persons using duplicates report and tag commands among those with id present by using the id first name, last name and birth date. During the second step, I identified unique persons among those that do not have id present based on the first name, last name and birth date only. In the last I repeat the second step by looking at all records. I am just wondering if there is a more robust method and shorter way to do it ?

Your help is really appreciated,

id last_name first_name birth_date
11 Red Pencil June-01-2004
11 Red Penicl June-01-2004
20 Empty Box May-27-1967
. Empty Box May-27-1967
30 Beautiful Day April-4-1939
. Beautiful Day April-3-1939
14 Red Carpet September-01-2001
41 Red Carpet September-01-2001
14 Red Carpet September-01-2001

Adriana
Tags: None
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#2

18 Mar 2015, 13:51

If you are sure that the combination of first_name last_name and birth_date identifies individually everyone (and have no missing value), thus two possibilities are open to you :

1) Re-generate a proper individual_id, with the egen group command (it might be useful for anything else)

Code:

drop id egen id_individual = group(last_name first_name birth_date)

This will give you an id foreach combination of these three variables, hence for each person, so in your example "empty box" will have the same id_individual, rather than one missing.
Afterwhile if you want to remove duplicates use the duplicate drop command or the line : bysort id_indiv : keep if _n==1, that will only keep the first observation of each individual.

2) Directly drop duplicates based on the three identification variables (first_name last_name birth_date)

Code:

duplicates drop first_name last_name birth_date, force

This is simpler, however it solution doesn't guarantee that you'll have a proper individual_id, since it could kept the "empty box" observation where id is missing.

Hope this helps
Charlie
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#3

18 Mar 2015, 14:08

Careful about using the -force- option with duplicates drop. If there are observations that agree on first_name last_name and birth_date but have differing information about other variables, you will end up with having arbitrarily selected one version of those other variables. If those other variables don't matter, then that's ok. But really it would be safer to check them out first:

Code:

unab vbles: _all local key first_name last_name birth_date local vbles: list vbles - key gen byte discrepancy = 0 foreach v of varlist vbles { by `key' (`v'), sort: replace discrepancy = 1 if `v'[1] != `v'[_N] } assert discrepancy == 0

If the -assert- statement passes, then you can just do the -duplicates drop- command without the -force- option and it will run. (For that matter, with all the variables other than first_name last_name and birth_date concordant, you can just simplify it to -duplicates drop-

But if the -assert- statement fails, you should look at those observations where discrepancy = 1 and try to decide how to reconcile those cases. Picking one arbitrarily (which is what -duplicates drop first_name last_name birth_date, force- will do) is only occasionally a good way to handle that.
Comment
Adriana Peci

Join Date: Nov 2014

Posts: 32
#4

19 Mar 2015, 10:00

Hi both,
The first suggestion does not account for possible errors in any of the identifiers. I loved the second suggestion but I had to modify a bit since I still want to use the id to identify unique individuals . Id helps me identify reasons for the discrepancy caused by errors in the last_name first_name and date of birth. Thank you so much to both of you.
Comment

Announcement

Comment

Comment

Comment