Comparing strings where the order of names may vary

Matt Gibbons

Join Date: Feb 2018

Posts: 13
#1

Comparing strings where the order of names may vary

18 Feb 2018, 00:25

I've been matching up names on 2014 and 2017 electoral rolls for NZ and am now down to the last 10% of cases. People frequently change or correct details when they move, e.g. adding or deleting a 2nd or 3rd name, but not usually completely changing a forename. After matching by surname (I'd previously had to allow for people taking their flatmate's surname) and age bands I initially wrote a program that identified the number of identical names over the two periods, because some people list up to six. I then switched to: levenshtein f_14_n f_17_n, gen(ss_fore_sim) to compare forenames for two time periods when these had been reduced to a single string with no gaps or apostrophes etc. I then made this calculation for some other situations, such as people correcting the order of their first two names or going from two to one forename. The program I've got results in about 80% of cases having a lowest Levenshtein score of 0 to 2. Most of the scores higher than this are increasingly bad matches, e.g. people with completely different middle names, but there are still cases where the Levenshtein score is high, but a careful reading shows they are the same person. E.g. Dickiegeorgetewhakahonore and Dicktewhakahonoregeorge. Some of the webpages I've seen on the internet suggest an N-grams approach. Is there a way of doing this in Stata, or am I likely to have to manually compare results? Can anyone point me to an example in Stata?
Tags: None
Matt Gibbons

Join Date: Feb 2018

Posts: 13
#2

19 Feb 2018, 00:53

I've also noticed that for short forenames a low Levenshtein score doesn't differentiate well between names. So I'm going to have to either manually check these names of divide by the length of the first name to get more robust results.

I've also calculated the Levenshtein score for people's flatmates, because sometimes this quickly resolves ambiguous cases.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

19 Feb 2018, 10:01

You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. Your problem is one that someone might program for you but they'd need the data.

There are other user-written Stata routines that do Levenshtein - see strdist for example. I don't know if these are better. If you google ngrams stata, there is a user-written ngrams program.
See also strutil:
https://www.statalist.org/forums/for...larity-metrics
Comment

Announcement

Comparing strings where the order of names may vary

Comment

Comment