Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching fuzzy string variables

    Hi Statalisters!

    Haven't managed to find a solution to this problem online but presume its a fairly straightforward one...

    I've merged two datasets based on a unique identifyer. My goal is to go through the successfully merged individuals and check for any false negatives based on there name. The trouble is, the two data sets have frequently inputed names with different spelling, titles, only first name/last name etc.

    I would like to create a variable that identifies whether the two name variables share say a string of 5 characters, then 4 characters in common, then 3 characters in common and so on. From there I can manually look over it to identify any irregularities.

    I'm also open to any other suggestions you think might be better.

    An example of the sort of dataset I'm using is below. Here I would like to identify whether variable name1 and name2 share a common string of 3 characters.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int uid str13 name1 str15 name2
      1 "Mr John Smith" "Jon smith"      
     14 "Shanon Russel" "Shannon Russell"
     22 "Tim Clyde"     "Tim Clyde"      
     56 "Jeremy "       "Jerimy Blaine"  
     76 "Fiona Jones"   "David Blake"    
     39 "Sian"          "Sean"           
    104 "Nancy Tugwell" "Nat Togwel"     
      8 "Marry Ann"     "Mrs Ann"        
    145 "W Blaire"      "Darren Blaire"  
    120 "Md Smith"      "Md Duncan Smith"
    end

    Hope this all makes sense! Thanks all in advance

    Chris

  • #2
    Jargon-wise, we more commonly see (and search for, both on Statalist and in more general searches of the web) "fuzzy matching" rather than "fuzzy strings" (or "fuzzy data").

    With that said, rather than invent your own technique, several already have been implemented by Stata users. Others here have had success with the user-written matchit command from Julio Raffo, as discussed in the following two threads.

    http://www.statalist.org/forums/foru...s-observations

    http://www.statalist.org/forums/foru...-e-fuzzy-match

    Comment


    • #3
      Thank you William, really helpful!

      Comment

      Working...
      X