Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching two datasets against the first three letters of both first name and surname

    Many thanks in advance to all for your help. I have two datasets, one from a clinic, the second for a list of admissions to regional hospitals. I am looking to see if any of our patients were readmitted elsewhere after having being discharged from our clinic.

    The problem is many names are poorly entered, particularly the latter characters, or have multiple spelling variations, e.g.:

    Smith, Smiths
    Stephen, Stephan
    Sharon, Sharyn

    I was hoping to do the following rather straightforward analysis (although on about 10,000 readmissions) using first name, last name, and age:

    sort first last age
    quietly by first last age: gen dup = cond(_N==1,0,_n)
    tab dup
    drop if dup==0
    browse


    Then I'd look at those duplicates seen in our clinic and then at another hospital at a date following been seen with us. However even though I've gone through the lists to sort out any immediate problem e.g. inverted first names and surnames, I cannot correct for the differences in name spelling. This may be input error or simply people giving their name spelt differently to different hospitals.

    Is there a way to sort and duplicate based on the first three characters of a name? That way I could generate an albeit larger list but one I could look through for duplicates? I've looked through a lot of topic posts but can't find a way to do this. Any help greatly appreciated.


  • #2
    The first three characters of each name can be extracted using

    Code:
    gen fir = substr(first, 1, 3)
    gen las = substr(last, 1, 3)
    and then those variables can be checked for duplicates. There is a dedicated duplicates command as you probably know.

    Comment


    • #3
      Many thanks Nick, that is much appreciated.

      One final question is is there any way to automatically search using two .dta files, rather than merge them into a rather unwieldy larger set to look for duplicates. The first file is about 500 patients, the second about 10,000.

      If they are called clinic.dta and hospital.dta, ideally I'd sort each dataset by the new variables fir las age, then say for each entry in clinic.dta, can I look at duplicates in hospital.dta purely using the line listing of clinic.dta.

      The reason for this is there is already a lot of duplication in the hospital.dta that I don't need to look at unless they relate to an entry form our clinic.dta database. I could merge them and create a separate variable in clinic.dta e.g. match, and label it as 1, then look for duplicates if match==1 but am not sure if that would discount the same names in the hospital.dta if they did not have a match==1 variable.

      Sorry, am sure this is rather simplistic to answer but have spent a few days thinking about it and have not come up trumps.

      Best wishes,

      Colin

      Comment


      • #4
        It is not clear to me whether these datasets should be appended or merged.

        To get better advice, you may have to give concrete details that are realistic. Clearly they don't have to be real.

        Comment


        • #5
          Thanks indeed Nick,

          I've used your substr command for the first three letters and that works very well. The two datasets are:

          Data Set A - Clinic.dta, made-up patients, in one clinic that we operate (in actual fact this file is four clinics, clinics a-d, about 500 entries)

          A
          age first last middle address admissiondate dischargedate location
          50 steve smith j 111 pine st 27/12/2014 29/12/2014 clinic a
          42 james jones robert 222 pine road 23/11/2014 25/11/2014 clinic a
          25 claire cousins r 333 pine hill 23/10/2014 25/10/2014 clinic a
          27 david duck g 444 pine lane 01/02/2015 10/02/2015 clinic a
          Data Set B - Hospital.dta, made-up patients in a laboratory dataset with results from multiple hospitals/clinics sent for testing - what we are interesting in is a sample sent following discharge from our clinic a-d from any of the 20+ hospitals that will have sent results to this master spreadsheet, about 10,500 entries

          B
          age first last middle address sampledate location
          50 steve smith 222 pine st 01/01/2015 hospital h
          50 steve smith j 222 pine road 25/12/2014 hospital g
          25 claire cousins r 333 pine hill 01/10/2014 hospital f
          27 david duck g 444 pine lane 15/02/2015 hospital e
          Duplicates

          Therefore I'm not interesting in the first line, as they are at a different address, but I would like to see that as a duplicate as they had a sample sent from a hospital after discharge from our clinic. I would look through that manually as the addresses are too discrepantly entered to match otherwise.

          For the second and third I don't want to see them as duplicates if possible, as they were sent from other hospitals before they were seen in our clinic. I only want to look at upwards referral from our clinics onwards.

          Number 4 I would want, matched patient with same address discharged from our clinic and seen in hospital afterwards.

          I do NOT want to see the many many multiple samples sent from clinics or hospitals that are not in the original dataset Clinic.dta (clinics a-d). There will be many of these but they do not interest me.

          Best wishes,

          Colin
          Last edited by Colin S Brown; 05 Mar 2015, 07:16.

          Comment


          • #6
            Going back to your original question, if you are not entirely happy with the way matching on the first three letters of the names is working out, you might look at matching on Soundex codes, which can be calculated using the built-in soundex() function [-help soundex()-]. Soundex codes were specifically invented for the purpose of overcoming variation in spellings of names and they work pretty well.

            Comment


            • #7
              The number of observations appears pretty small so here's an approach using cross to find all pairwise combinations of observations between the two datasets. All combinations where the sample date precedes the discharge date can be dropped. You can fine-tune each variable to improve the matching. You can even decide to give additional weight to some variables in terms of matching.

              Code:
              clear
              input age    str20 (first    last    middle    address    admissiondate    dischargedate location)
              50    "steve"    "smith"    "j"    "111 pine st"    "27/12/2014"    "29/12/2014"    "clinic a"
              42    "james"    "jones"    "robert"    "222 pine road"    "23/11/2014"    "25/11/2014"    "clinic a"
              25    "claire"    "cousins"    "r"    "333 pine hill"    "23/10/2014"    "25/10/2014"    "clinic a"
              27    "david"    "duck"    "g"    "444 pine lane"    "01/02/2015"    "10/02/2015"    "clinic a"
              end
              
              * get the data in shape
              gen datein = date(admissiondate,"DMY")
              gen dateout = date(dischargedate,"DMY")
              drop admissiondate dischargedate
              format %td date*
              rename * =_a
              gen id_a = _n
              tempfile a
              save "`a'"
              
              clear
              input age    str20 (first    last    middle    address    sampledate    location)
              50 "steve" "smith" "" "222 pine st" "01/01/2015" "hospital h"
              50 "steve" "smith" "j" "222 pine road" "25/12/2014" "hospital g"
              25 "claire" "cousins" "r" "333 pine hill" "01/10/2014" "hospital f"
              27 "david" "duck" "g" "444 pine lane" "15/02/2015" "hospital e"
              end 
              
              gen id_b = _n
              gen date_b = date(sampledate,"DMY")
              drop sampledate
              format %td date_b
              
              * form all pairwise combinations
              cross using "`a'"
              sort id_a id_b
              
              * samples must be after discharge date
              keep if date_b >= dateout_a
              
              * Make any adjustments to improve the matching here
              gen address6 = substr(address,1,6)
              gen address6_a = substr(address_a,1,6)
              
              * Compare all variables that should normally match and count matches
              gen nmatch = abs(age - age_a) <= 1
              foreach v of varlist first last middle address6 {
                  replace nmatch = nmatch + (`v' == `v'_a)
              }
              
              * Order by increasing match level, would normally check for ties
              * in best match
              sort id_a nmatch id_b
              by id_a: gen best = _n == _N
              
              list age* first* last* middle* address6* id_a nmatch if best, noobs

              Comment


              • #8
                Dear Clyde and Richard,

                Clyde, many thanks for this - I had thought of soundex bur didn't have experience using it. I'll play around with it and see what I come up with.

                Richard, that's great - very useful indeed (those were made up data as I'm working with patient names so there will likely be about 50 matches in total in 10,000 observations) - I'll try this in the larger dataset.

                Best wishes indeed,

                Colin

                Comment


                • #9
                  If you liked the concept, let me suggest a variation that scales up better for larger problems at the cost of some extra data management gymnastics. The idea is to use joinby to form all pairwise combination of ids when the value of a variable matches. The joinby command is simply cross within groups of observations that have the same values.

                  Code:
                  clear
                  input age    str20 (first    last    middle    address    admissiondate    dischargedate location)
                  50    "steve"    "smith"    "j"    "111 pine st"    "27/12/2014"    "29/12/2014"    "clinic a"
                  42    "james"    "jones"    "robert"    "222 pine road"    "23/11/2014"    "25/11/2014"    "clinic a"
                  25    "claire"    "cousins"    "r"    "333 pine hill"    "23/10/2014"    "25/10/2014"    "clinic a"
                  27    "david"    "duck"    "g"    "444 pine lane"    "01/02/2015"    "10/02/2015"    "clinic a"
                  end
                  
                  gen datein = date(admissiondate,"DMY")
                  gen dateout = date(dischargedate,"DMY")
                  drop admissiondate dischargedate
                  format %td date*
                  
                  * add any transformed variables that would help the matching
                  gen address6 = substr(address,1,6)
                  gen soundslike = soundex(first + last)
                  gen id_a = _n
                  tempfile a
                  save "`a'"
                  
                  clear
                  input age    str20 (first    last    middle    address    sampledate    location)
                  50 "steve" "smith" "" "222 pine st" "01/01/2015" "hospital h"
                  50 "steve" "smith" "j" "222 pine road" "25/12/2014" "hospital g"
                  25 "claire" "cousins" "r" "333 pine hill" "01/10/2014" "hospital f"
                  27 "david" "duck" "g" "444 pine lane" "15/02/2015" "hospital e"
                  end 
                  
                  gen id_b = _n
                  gen date_b = date(sampledate,"DMY")
                  drop sampledate
                  format %td date_b
                  
                  * add any transformed variables that would help the matching
                  gen address6 = substr(address,1,6)
                  gen soundslike = soundex(first + last)
                  tempfile b
                  save "`b'"
                  
                  * Match by variables
                  foreach v of varlist age first last middle address6 soundslike {
                      use "`a'", clear
                      joinby `v' using "`b'"
                      keep `v' id_a id_b
                      tempfile `v'
                      save "``v''"
                  }
                  
                  * Combine all matches (can't use a varlist in the loop here)
                  use "`age'", clear
                  foreach v in first last middle address6 soundslike {
                      merge 1:1 id_a id_b using "``v''", nogen
                  }
                  
                  * Apply the date constraint
                  merge m:1 id_a using "`a'", keep(match) keepusing(dateout) nogen
                  merge m:1 id_b using "`b'", keep(match) keepusing(date_b) nogen
                  keep if date_b >= dateout
                  
                  * Make sure that id_a id_b are still unique
                  isid id_a id_b, sort
                  order id_a id_b
                  
                  * Count matches
                  gen nmatch = 0
                  foreach v of varlist age first last middle address6 soundslike {
                      replace nmatch = nmatch + 1 if !mi(`v')
                  }
                  
                  list, sepby(id_a) noobs

                  Comment

                  Working...
                  X