Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy match across multiple variables within a dataset

    I have a dataset of about 15000 observations of different patients, many of which are duplicates. There is a lot of missing information, however, and they are not exact duplicates, so I would like to do a fuzzy matching process based on (ideally) three string variables. I have experimented with using matchit and reclink, but there are obvious problems if I try to merge the dataset to itself (because a perfect match exists), and I haven't worked out how to overcome this without knowing in advance which variables are duplicates. Please accept my apologies in advance if the answer is posted somewhere - I'm sure I'm not the first person to come across this problem but I haven't found it anywhere. The data are confidential but I've created a toy example below. Ideally, a solution to this problem would identify four unique patients (Remy, Eleanor, Jo and Josh). I'm using Stata 15.

    Thanks.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str18(name dob_y dob_m)
    "Jo Bloggs"     "1997" "March"    
    "Remi W"        "1984" "September"
    "Jo"            "1997" "March"    
    "Remy W"        "1984" "Sept"    
    "Eleanor Jones" "1989" ""        
    "Eleanor J"     "1989" "January"  
    "Remy W"        "1984" "September"
    "Jo Blogg"      "1997" ""        
    "Eleanor J"     "1989" "January"  
    "Josh Hastings" "2001" "January"  
    end
    Last edited by Susan Wilson; 28 Oct 2020, 19:16.

  • #2
    If you have full dates of birth, you may increase your chances of identifying imperfect duplicates. But with 15,000 entries, there is a chance of having distinct individuals who share the same name and date of birth. You will usually need more information than name and date of birth to identify individuals. There is a similar approach highlighted in https://www.statalist.org/forums/for...order-to-group. However, with fuzzy matching, there are no guarantees. The following based on the approach using matchit (SSC) does achieve your desired result.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str18(name dob_y dob_m)
    "Jo Bloggs"     "1997" "March"    
    "Remi W"        "1984" "September"
    "Jo"            "1997" "March"    
    "Remy W"        "1984" "Sept"    
    "Eleanor Jones" "1989" ""        
    "Eleanor J"     "1989" "January"  
    "Remy W"        "1984" "September"
    "Jo Blogg"      "1997" ""        
    "Eleanor J"     "1989" "January"  
    "Josh Hastings" "2001" "January"  
    end
    
    tempfile list  
    preserve
    rename (*) (*2)
    gen name=name2
    save `list'
    restore, preserve
    cross using `list'
    drop if name<=name2
    *ssc install matchit
    *ssc install freqindex
    matchit name name2
    rename similscore score1
    matchit dob_m dob_m2
    rename similscore score2
    matchit dob_y dob_y2
    rename similscore score3
    keep if (score1>0.35 &score2>0.35 &score3>0.85)| (score1>0.35 &score2==0 &score3>0.85)
    keep name name2 score*
    contract name name2
    replace _freq=_n
    rename name name1
    reshape long name, i(_freq) j(which)
    bys _freq (which): gen name2=name[1]
    keep name name2
    gen order=_n
    bys name (order): keep if _n==_N
    save `list', replace
    restore
    merge m:1 name using `list',keep (master match) nogen
    replace name2=name if missing(name2)
    sort name2
    drop order
    l, sepby(name2)
    Res.:

    Code:
    . l, sepby(name2)
    
         +---------------------------------------------------+
         |          name   dob_y       dob_m           name2 |
         |---------------------------------------------------|
      1. |     Eleanor J    1989     January   Eleanor Jones |
      2. |     Eleanor J    1989     January   Eleanor Jones |
      3. | Eleanor Jones    1989               Eleanor Jones |
         |---------------------------------------------------|
      4. |            Jo    1997       March       Jo Bloggs |
      5. |      Jo Blogg    1997                   Jo Bloggs |
      6. |     Jo Bloggs    1997       March       Jo Bloggs |
         |---------------------------------------------------|
      7. | Josh Hastings    2001     January   Josh Hastings |
         |---------------------------------------------------|
      8. |        Remi W    1984   September          Remy W |
      9. |        Remy W    1984   September          Remy W |
     10. |        Remy W    1984        Sept          Remy W |
         +---------------------------------------------------+
    
    .

    Comment


    • #3
      This is fantastic, thank you very much. I will take your advice on the dates of birth and see if I can get access to another matching variable or two as well.
      Last edited by Susan Wilson; 29 Oct 2020, 17:18.

      Comment

      Working...
      X