Fuzzy match inconsistent identifiers in a panel

Jack Landry

Join Date: May 2017

Posts: 9
#1

Fuzzy match inconsistent identifiers in a panel

13 Oct 2019, 12:11

Dear Statalist,

I've been working on what I feel like should be a common problem but I can't find any threads on it. I'm trying to create consistent identifiers for a panel dataset of cities that have slight spelling differences over time using some kind of fuzzy matching algorithm. A very simple example is below, in my real data, I have many hundreds of place names spelled differently over 30 years, so manual fixes are infeasible. I know there are not multiple observations of the same place for the same year. The end result I'm looking for is to get a third variable that is some numeric place_id that would consistently identify Chicago and New York in the example below. If anyone has experienced this problem before and could just give a general overview of their workflow, it would be much appreciated, I don't necessarily need specific code. I've been using reclink from ssc, but I'm open to anything. Let me know if I've left out important details or can give more background. Thanks!

Code:

clear input str20 place_name year "chicago" 1990 "chicag" 1991 "chicago" 1992 "new york" 1990 "new york city" 1991 "new york c" 1992 end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

13 Oct 2019, 12:59

It's quicker and easier to show some code that you can adapt than to try to explain it in words.

Code:

clear input str20 place_name year "chicago" 1990 "chicag" 1991 "chicago" 1992 "new york" 1990 "new york city" 1991 "new york c" 1992 end tempfile original_data save `original_data' keep place_name duplicates drop tempfile copy save `copy' rename place_name place_name2 cross using `copy' matchit place_name place_name2 keep if inlist(place_name2, "chicago", "new york city") & similscore > 0.8 merge 1:m place_name using `original_data' rename place_name2 correct_place_name

Now, you may find instances of place_name that didn't match with any correction. (In your example, "new york" did not match with anything.) Hopefully, the number of such instances (all characterized by _merge == 2) will be small enough that you can fix them individually with some -replace correct_place_name = whatever if place_name == whatever_else- commands. If there are too many for that to be practical, try again using a lower threshold than 0.8 for similscore. (If you go down to 0.75, "new york" will match with "new york city". Of course, the lower you put the threshold, the more likely you will come out with spurious matches that you then have to weed out with individual -drop- commands after the -keep- command.

Anyway, if you play around with this, you will eventually find a threshold value for similscore that reduces the burden of handling individual cases separately to a manageable minimum.

Note also that -matchit- also allows you to try a large number of different string similarity measures, so if you can't find a workable threshold value with the default methods, try some of the others.
Comment

Announcement

Fuzzy match inconsistent identifiers in a panel

Comment