I would like to create an id variable that uniquely identifies people within a dataset that looks like this:
cardholderid is a variable in the original data that is supposed to uniquely identify people, but it does not. I have improved the data by running code like: group_id cardholderid, matchby(fsx lsx dob) to cut down on repeat cardholderid issued to people who have typos in their names but the same dateofbirth.
The issue is that there are also a fair amount of date of birth typos, I've made the mock dataset here show that clearly-- you can easily tell that cardholderid 12345 and 60000 identify the same person, who has a relatively unique name, but the date of birth is off by exactly a month. The dob typos are all over the place- sometimes the date is off a few days (ex: same uncommon name but two entries, one with dob 13nov1980 and one with dob 19nov1980), sometimes the month is off by one (as with kit fugate above), sometimes the year is off by even up to five years for the same uncommon name.
How can I de-duplicate my dataset to create one id for someone like kit fugate, born 13 dec/nov 1997? Because some of the typos are more obvious than others-- based on bad the dob typos are (date off by a day versus off by 5 years) and how uncommon the name is (if there are two John Smiths who are share a birth month+year but have different day, we would be less certain these are really the same person compared to two Rashida Tlaibs who share a birth month+year but are off by a day)-- it would be great to have some way to make a similarity score and manually check as opposed to doing something like group_id cardholderid, matchby(fsx lsx dob_month dob_year).
Thanks in advance!
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float cardholderid str9 firstname str6 lastname str4(fsx lsx) str9 dob 12345 "kit" "fugate" "K300" "F230" "13dec1997" 12345 "kit" "fugate" "K300" "F230" "13dec1997" 60000 "kit" "fugate" "K300" "F230" "13nov1997" 23456 "ben" "smith" "B500" "S530" "27jan2004" 23456 "ben" "smith" "B500" "S530" "27jan2004" 70000 "ben" "smith" "B500" "S530" "26jan2004" end
The issue is that there are also a fair amount of date of birth typos, I've made the mock dataset here show that clearly-- you can easily tell that cardholderid 12345 and 60000 identify the same person, who has a relatively unique name, but the date of birth is off by exactly a month. The dob typos are all over the place- sometimes the date is off a few days (ex: same uncommon name but two entries, one with dob 13nov1980 and one with dob 19nov1980), sometimes the month is off by one (as with kit fugate above), sometimes the year is off by even up to five years for the same uncommon name.
How can I de-duplicate my dataset to create one id for someone like kit fugate, born 13 dec/nov 1997? Because some of the typos are more obvious than others-- based on bad the dob typos are (date off by a day versus off by 5 years) and how uncommon the name is (if there are two John Smiths who are share a birth month+year but have different day, we would be less certain these are really the same person compared to two Rashida Tlaibs who share a birth month+year but are off by a day)-- it would be great to have some way to make a similarity score and manually check as opposed to doing something like group_id cardholderid, matchby(fsx lsx dob_month dob_year).
Thanks in advance!