Matching two datasets against the first three letters of both first name and surname

Colin S Brown

Join Date: Mar 2015

Posts: 5
#1

Matching two datasets against the first three letters of both first name and surname

05 Mar 2015, 02:48

Many thanks in advance to all for your help. I have two datasets, one from a clinic, the second for a list of admissions to regional hospitals. I am looking to see if any of our patients were readmitted elsewhere after having being discharged from our clinic.

The problem is many names are poorly entered, particularly the latter characters, or have multiple spelling variations, e.g.:

Smith, Smiths
Stephen, Stephan
Sharon, Sharyn

I was hoping to do the following rather straightforward analysis (although on about 10,000 readmissions) using first name, last name, and age:

sort first last age
quietly by first last age: gen dup = cond(_N==1,0,_n)
tab dup
drop if dup==0
browse

Then I'd look at those duplicates seen in our clinic and then at another hospital at a date following been seen with us. However even though I've gone through the lists to sort out any immediate problem e.g. inverted first names and surnames, I cannot correct for the differences in name spelling. This may be input error or simply people giving their name spelt differently to different hospitals.

Is there a way to sort and duplicate based on the first three characters of a name? That way I could generate an albeit larger list but one I could look through for duplicates? I've looked through a lot of topic posts but can't find a way to do this. Any help greatly appreciated.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

05 Mar 2015, 02:54

The first three characters of each name can be extracted using

Code:

gen fir = substr(first, 1, 3) gen las = substr(last, 1, 3)

and then those variables can be checked for duplicates. There is a dedicated duplicates command as you probably know.
Comment
Colin S Brown

Join Date: Mar 2015

Posts: 5
#3

05 Mar 2015, 03:49

Many thanks Nick, that is much appreciated.

One final question is is there any way to automatically search using two .dta files, rather than merge them into a rather unwieldy larger set to look for duplicates. The first file is about 500 patients, the second about 10,000.

If they are called clinic.dta and hospital.dta, ideally I'd sort each dataset by the new variables fir las age, then say for each entry in clinic.dta, can I look at duplicates in hospital.dta purely using the line listing of clinic.dta.

The reason for this is there is already a lot of duplication in the hospital.dta that I don't need to look at unless they relate to an entry form our clinic.dta database. I could merge them and create a separate variable in clinic.dta e.g. match, and label it as 1, then look for duplicates if match==1 but am not sure if that would discount the same names in the hospital.dta if they did not have a match==1 variable.

Sorry, am sure this is rather simplistic to answer but have spent a few days thinking about it and have not come up trumps.

Best wishes,

Colin
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

05 Mar 2015, 04:15

It is not clear to me whether these datasets should be appended or merged.

To get better advice, you may have to give concrete details that are realistic. Clearly they don't have to be real.
Comment

Colin S Brown

Join Date: Mar 2015
Posts: 5

05 Mar 2015, 07:01

Thanks indeed Nick,

I've used your substr command for the first three letters and that works very well. The two datasets are:

Data Set A - Clinic.dta, made-up patients, in one clinic that we operate (in actual fact this file is four clinics, clinics a-d, about 500 entries)

A

age	first	last	middle	address	admissiondate	dischargedate	location
50	steve	smith	j	111 pine st	27/12/2014	29/12/2014	clinic a
42	james	jones	robert	222 pine road	23/11/2014	25/11/2014	clinic a
25	claire	cousins	r	333 pine hill	23/10/2014	25/10/2014	clinic a
27	david	duck	g	444 pine lane	01/02/2015	10/02/2015	clinic a

Data Set B - Hospital.dta, made-up patients in a laboratory dataset with results from multiple hospitals/clinics sent for testing - what we are interesting in is a sample sent following discharge from our clinic a-d from any of the 20+ hospitals that will have sent results to this master spreadsheet, about 10,500 entries

B

age	first	last	middle	address	sampledate	location
50	steve	smith		222 pine st	01/01/2015	hospital h
50	steve	smith	j	222 pine road	25/12/2014	hospital g
25	claire	cousins	r	333 pine hill	01/10/2014	hospital f
27	david	duck	g	444 pine lane	15/02/2015	hospital e

Duplicates

Therefore I'm not interesting in the first line, as they are at a different address, but I would like to see that as a duplicate as they had a sample sent from a hospital after discharge from our clinic. I would look through that manually as the addresses are too discrepantly entered to match otherwise.

For the second and third I don't want to see them as duplicates if possible, as they were sent from other hospitals before they were seen in our clinic. I only want to look at upwards referral from our clinics onwards.

Number 4 I would want, matched patient with same address discharged from our clinic and seen in hospital afterwards.

I do NOT want to see the many many multiple samples sent from clinics or hospitals that are not in the original dataset Clinic.dta (clinics a-d). There will be many of these but they do not interest me.

Best wishes,

Colin

Last edited by Colin S Brown; 05 Mar 2015, 07:16.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#6

05 Mar 2015, 08:48

Going back to your original question, if you are not entirely happy with the way matching on the first three letters of the names is working out, you might look at matching on Soundex codes, which can be calculated using the built-in soundex() function [-help soundex()-]. Soundex codes were specifically invented for the purpose of overcoming variation in spellings of names and they work pretty well.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

05 Mar 2015, 10:37

The number of observations appears pretty small so here's an approach using cross to find all pairwise combinations of observations between the two datasets. All combinations where the sample date precedes the discharge date can be dropped. You can fine-tune each variable to improve the matching. You can even decide to give additional weight to some variables in terms of matching.

Code:

clear
input age    str20 (first    last    middle    address    admissiondate    dischargedate location)
50    "steve"    "smith"    "j"    "111 pine st"    "27/12/2014"    "29/12/2014"    "clinic a"
42    "james"    "jones"    "robert"    "222 pine road"    "23/11/2014"    "25/11/2014"    "clinic a"
25    "claire"    "cousins"    "r"    "333 pine hill"    "23/10/2014"    "25/10/2014"    "clinic a"
27    "david"    "duck"    "g"    "444 pine lane"    "01/02/2015"    "10/02/2015"    "clinic a"
end

* get the data in shape
gen datein = date(admissiondate,"DMY")
gen dateout = date(dischargedate,"DMY")
drop admissiondate dischargedate
format %td date*
rename * =_a
gen id_a = _n
tempfile a
save "`a'"

clear
input age    str20 (first    last    middle    address    sampledate    location)
50 "steve" "smith" "" "222 pine st" "01/01/2015" "hospital h"
50 "steve" "smith" "j" "222 pine road" "25/12/2014" "hospital g"
25 "claire" "cousins" "r" "333 pine hill" "01/10/2014" "hospital f"
27 "david" "duck" "g" "444 pine lane" "15/02/2015" "hospital e"
end 

gen id_b = _n
gen date_b = date(sampledate,"DMY")
drop sampledate
format %td date_b

* form all pairwise combinations
cross using "`a'"
sort id_a id_b

* samples must be after discharge date
keep if date_b >= dateout_a

* Make any adjustments to improve the matching here
gen address6 = substr(address,1,6)
gen address6_a = substr(address_a,1,6)

* Compare all variables that should normally match and count matches
gen nmatch = abs(age - age_a) <= 1
foreach v of varlist first last middle address6 {
    replace nmatch = nmatch + (`v' == `v'_a)
}

* Order by increasing match level, would normally check for ties
* in best match
sort id_a nmatch id_b
by id_a: gen best = _n == _N

list age* first* last* middle* address6* id_a nmatch if best, noobs

Comment

Colin S Brown

Join Date: Mar 2015

Posts: 5
#8

06 Mar 2015, 08:15

Dear Clyde and Richard,

Clyde, many thanks for this - I had thought of soundex bur didn't have experience using it. I'll play around with it and see what I come up with.

Richard, that's great - very useful indeed (those were made up data as I'm working with patient names so there will likely be about 50 matches in total in 10,000 observations) - I'll try this in the larger dataset.

Best wishes indeed,

Colin
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

06 Mar 2015, 12:04

If you liked the concept, let me suggest a variation that scales up better for larger problems at the cost of some extra data management gymnastics. The idea is to use joinby to form all pairwise combination of ids when the value of a variable matches. The joinby command is simply cross within groups of observations that have the same values.

Code:

clear
input age    str20 (first    last    middle    address    admissiondate    dischargedate location)
50    "steve"    "smith"    "j"    "111 pine st"    "27/12/2014"    "29/12/2014"    "clinic a"
42    "james"    "jones"    "robert"    "222 pine road"    "23/11/2014"    "25/11/2014"    "clinic a"
25    "claire"    "cousins"    "r"    "333 pine hill"    "23/10/2014"    "25/10/2014"    "clinic a"
27    "david"    "duck"    "g"    "444 pine lane"    "01/02/2015"    "10/02/2015"    "clinic a"
end

gen datein = date(admissiondate,"DMY")
gen dateout = date(dischargedate,"DMY")
drop admissiondate dischargedate
format %td date*

* add any transformed variables that would help the matching
gen address6 = substr(address,1,6)
gen soundslike = soundex(first + last)
gen id_a = _n
tempfile a
save "`a'"

clear
input age    str20 (first    last    middle    address    sampledate    location)
50 "steve" "smith" "" "222 pine st" "01/01/2015" "hospital h"
50 "steve" "smith" "j" "222 pine road" "25/12/2014" "hospital g"
25 "claire" "cousins" "r" "333 pine hill" "01/10/2014" "hospital f"
27 "david" "duck" "g" "444 pine lane" "15/02/2015" "hospital e"
end 

gen id_b = _n
gen date_b = date(sampledate,"DMY")
drop sampledate
format %td date_b

* add any transformed variables that would help the matching
gen address6 = substr(address,1,6)
gen soundslike = soundex(first + last)
tempfile b
save "`b'"

* Match by variables
foreach v of varlist age first last middle address6 soundslike {
    use "`a'", clear
    joinby `v' using "`b'"
    keep `v' id_a id_b
    tempfile `v'
    save "``v''"
}

* Combine all matches (can't use a varlist in the loop here)
use "`age'", clear
foreach v in first last middle address6 soundslike {
    merge 1:1 id_a id_b using "``v''", nogen
}

* Apply the date constraint
merge m:1 id_a using "`a'", keep(match) keepusing(dateout) nogen
merge m:1 id_b using "`b'", keep(match) keepusing(date_b) nogen
keep if date_b >= dateout

* Make sure that id_a id_b are still unique
isid id_a id_b, sort
order id_a id_b

* Count matches
gen nmatch = 0
foreach v of varlist age first last middle address6 soundslike {
    replace nmatch = nmatch + 1 if !mi(`v')
}

list, sepby(id_a) noobs

Announcement