Help matching city names in two data sets

David Merriman

Join Date: Oct 2014

Posts: 15
#1

Help matching city names in two data sets

06 Jul 2019, 16:58

I am trying to match data on cities in two data sets using only the names of the cities. Unfortunately, there are many variants for how names can be presented.

For example,
in one data set the name of a city is TERREBONNE PARISH CONSOLIDATED GOVERNMENT but
in the other data set the name of the city is TERREBONNE CONSOLIDATED GOVERNMENT.

These are almost certainly the same city but when I use the matchit function they do not give me a perfect match.

I cannot just use matchit's similarity scores because there are also cases in the data like APPLEGATE VILLAGE which is matched to THE VILLAGE OF DOUGLAS. These two are almost certainly not the same city but I get a matchit score of .5 because they both contain the word VILLAGE

I thought about creating a variable for each word in the city name and then dropping words like "Village", "Town" "of" etc. but even if I programmed that in I can't figure out any reasonablely efficient strategy to find that "TERREBONNE" is in each name since the word order can differ across data sets.

Suggestions for how to proceed would be appreciated.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

06 Jul 2019, 18:43

I thought about creating a variable for each word in the city name and then dropping words like "Village", "Town" "of" etc. but even if I programmed that in I can't figure out any reasonablely efficient strategy to find that "TERREBONNE" is in each name since the word order can differ across data sets.

That's an interesting idea you have, and below I will discuss how I would go about getting rid of extraneous words.

But first, I'm a fan of matchit, but not a major user of it. The author, Julio Raffo, often sees and responds to questions on Statalist about matchit. Perhaps he will see this and share his experience.

Now, what I think you want to do is not to split each name into separate words, trying to find the meaningful one, but instead to just delete common words from the city names and then apply matchiit. Here's some example code that may point you in a useful direction.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str41 city "TERREBONNE PARISH CONSOLIDATED GOVERNMENT" "TERREBONNE CONSOLIDATED GOVERNMENT" "APPLEGATE VILLAGE" "THE VILLAGE OF DOUGLAS" end generate city2 = " " + city + " " replace city2 = subinstr(city2," PARISH "," ",.) replace city2 = subinstr(city2," VILLAGE "," ",.) replace city2 = subinstr(city2," THE "," ",.) replace city2 = subinstr(city2," OF "," ",.) replace city2 = trim(city2)

Code:

. list, clean noobs city city2 TERREBONNE PARISH CONSOLIDATED GOVERNMENT TERREBONNE CONSOLIDATED GOVERNMENT TERREBONNE CONSOLIDATED GOVERNMENT TERREBONNE CONSOLIDATED GOVERNMENT APPLEGATE VILLAGE APPLEGATE THE VILLAGE OF DOUGLAS DOUGLAS .

Doing this I think lets you make more effective use of the similarity scores, once they're not being thrown off by common words.

Again, I'm no expert on this, but your idea is an interesting one and I hope you can get closer to what you need by using it.
1 like
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

07 Jul 2019, 11:45

You may also find that you can do a lot relatively quickly with the kind of replace statements that Williams demonstrates. While it is not elegant, you can make a do file with these replacements to clean up a bunch of these. A day's work can let you id and program a great many such replacements.

replace city2="TERREBONE" if strpos(city2,"TERREBONE") != 0
1 like
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2404

07 Jul 2019, 12:16

Here's a way to find and strip out the most common words before trying -matchit-, for what it's worth. For the current data example, in which "TERREBONE" is a common word, it gets stripped out, but presumably that's not an issue in the actual data.

Code:

// Get a list of most common words
preserve
local top = 5 // whatever
keep city
replace city = trim(city)
split city, gen(word)
stack word*, into(w) clear
drop if missing(w)
drop _stack
contract w
gsort -_freq
levelsof w in 1/`top', local(common)
restore
//
// Trim out most common words
gen trimcity = city
foreach w of local common {
   quiet replace trimcity = subinstr(trimcity, "`w'", "", .)
}

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

07 Jul 2019, 14:04

In post #2 I used a common hack to ensure that I would be replacing whole words and not changing "LOFTON" to "LTON". By (a) adding spaces around the string to be matched (" PARISH ") and (b) adding a space at the start and the end of the string in which the replacement is to occur I (hopefully!) guarantee (a) that full words will be matched (b) even if they appear at the start or end of the string in which the replacement is to occur. And replacing the matched string (which has a leading and trailing space) with a single space rather than nothing preserves a space between the words surrounding it.

Code:

generate city2 = " " + city + " " replace city2 = subinstr(city2," PARISH "," ",.)

So I very much like Mike's technique for finding common words, but I would change his replacement code to

Code:

gen trimcity = " " + city + " " foreach w of local common { quiet replace trimcity = subinstr(trimcity, " `w' ", " ", .) } replace trimcity = trim(trimcity)
2 likes
Comment
David Merriman

Join Date: Oct 2014

Posts: 15
#6

12 Jul 2019, 07:55

Wow, Mike and William those are really helpful comments.
Comment

Announcement

Help matching city names in two data sets

Comment

Comment

Comment

Comment

Comment