Correction of misspellings

Hannah Staab

Join Date: Mar 2015

Posts: 2
#1

Correction of misspellings

23 Mar 2015, 06:20

Hello,

I have a variable called "Name". My Problem is that there are several different ways of spelling.
For example there is one observation named "Smith, Andrew", another one "Smith, A." and another one "Smith, Andrew M." but it is all the same person.
Is there a way to combine all these observations or to correct misspellings?

Thank you!
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

23 Mar 2015, 06:29

not clear what your goal is here; I would start with -extrname- (use search to find and install) which will give you a set of variables which will probably be easier to deal with
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

23 Mar 2015, 08:16

1. Match by person ID
2.

but it is all the same person

What makes you sure? use that information to match individuals.
3. With typos you can work in context only. There are last names: Smith, Smit, Smitz, Smidt, etc, see http://en.wikipedia.org/wiki/Smith_%28surname%29
If you only have Smit, how do you tell whether the name is correct or not?
If you have a context: Smit & Constructivist and Ecological Rationality in Economics then you can be relatively confident that Smit should be in fact Smith.
This is the idea behind http://www.stata.com/meeting/canada09/ca09_radyakin.pdf especially slide #9.
Both programs shown there no longer work after Google and Yahoo have changed their output formats, but the idea still is viable.

Best, Sergiy Radyakin
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#4

24 Mar 2015, 04:34

Perhaps you have another way to identify your individual?
Do you have a person ID variable that is properly set?
1)If you have :

Code:

bysort person_ID : replace name=name[1]

Will harmonize the name for a given individual, taking the first value.

2) If you don't have a proper ID variable, perhaps you could build one with individual parameters.
-Do you have birth date? location? activity?
If (and only if) you have these variables well identified (no missing values, if you have panel data make sure to take some id invariant characteristics).
Gather the maximal of these individual specific parameters and use the group command to create an ID variable :

Code:

egen ID = group(var1, var2,...)

It will give a unique identifier for observations that have the same values of all these variables (e.g. individuals born the same day, that lives in the same county/town, that have the same profession,...) This is why you should check if the individual parameters you have are sufficient to identify each individual.

Hint : in the example you quote, all kind of name spelling starts with the (full) Last name followed by a coma ",". If this is always the case, then you could use the -substr- command to generate a variable with the three first characters of name variable (Assuming that no familly names are shorter than two letters, and the following coma, the three first characters are common for all observartion of a person, wathever the way the name is spelt). And then use it among the previous individual specific variable.

Date of birth and the three first letters of familly name should be enough to distinguish over 90% of your individual.

Then once ID created, go back to step 1, and harmonize the name by ID.

This is still hand-made modification of dataset, so not 100% sure, and requires a lot of verifications. You should consider it only as the last opportunity, and always keep an original version of your dataset before trying this.
Comment
Hannah Staab

Join Date: Mar 2015

Posts: 2
#5

25 Mar 2015, 03:18

Thank you for your help!
Unfortunately I don't have any further information on the names to identify the persons.

I just found the two functions "strdist" and "strgroup" and will try to do a "fuzzy collapse" that way.

Last edited by Hannah Staab; 25 Mar 2015, 03:20.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

25 Mar 2015, 09:01

In addition to those, you might also look at the Stata built-in function -soundex()-, which was developed by the US Census Bureau about a century ago for the specific purpose of matching names with variant spellings. And Michael Blasnik's -reclink- may also be of service here.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#7

02 Apr 2015, 18:05

I hope it's not too late, but you can also try my ado -matchit- which now is available in ssc.
Comment

Announcement

Correction of misspellings

Comment

Comment

Comment

Comment

Comment

Comment