Many thanks in advance to all for your help. I have two datasets, one from a clinic, the second for a list of admissions to regional hospitals. I am looking to see if any of our patients were readmitted elsewhere after having being discharged from our clinic.
The problem is many names are poorly entered, particularly the latter characters, or have multiple spelling variations, e.g.:
Smith, Smiths
Stephen, Stephan
Sharon, Sharyn
I was hoping to do the following rather straightforward analysis (although on about 10,000 readmissions) using first name, last name, and age:
sort first last age
quietly by first last age: gen dup = cond(_N==1,0,_n)
tab dup
drop if dup==0
browse
Then I'd look at those duplicates seen in our clinic and then at another hospital at a date following been seen with us. However even though I've gone through the lists to sort out any immediate problem e.g. inverted first names and surnames, I cannot correct for the differences in name spelling. This may be input error or simply people giving their name spelt differently to different hospitals.
Is there a way to sort and duplicate based on the first three characters of a name? That way I could generate an albeit larger list but one I could look through for duplicates? I've looked through a lot of topic posts but can't find a way to do this. Any help greatly appreciated.
The problem is many names are poorly entered, particularly the latter characters, or have multiple spelling variations, e.g.:
Smith, Smiths
Stephen, Stephan
Sharon, Sharyn
I was hoping to do the following rather straightforward analysis (although on about 10,000 readmissions) using first name, last name, and age:
sort first last age
quietly by first last age: gen dup = cond(_N==1,0,_n)
tab dup
drop if dup==0
browse
Then I'd look at those duplicates seen in our clinic and then at another hospital at a date following been seen with us. However even though I've gone through the lists to sort out any immediate problem e.g. inverted first names and surnames, I cannot correct for the differences in name spelling. This may be input error or simply people giving their name spelt differently to different hospitals.
Is there a way to sort and duplicate based on the first three characters of a name? That way I could generate an albeit larger list but one I could look through for duplicates? I've looked through a lot of topic posts but can't find a way to do this. Any help greatly appreciated.
Comment