Text Matching

Erik Tiersten-Nyman

Join Date: Oct 2021

Posts: 7
#1

Text Matching

18 Apr 2023, 09:11

I am working with a dataset of geographically identified entries, where each entry is assigned a value of the string "city", which denotes the name of the city associated with the entry.

Currently, my city variable is quite messy, with inconsistent spelling, etc. For example, "PEORIA" may show up as "PORIA", "POERIA","PEOROA","PEORIA, IL", or "...PEORIA...."

My goal is to have a dataset of cleaned city names where I can use the "city" variable to accurately perform data manipulations by city. For example, I would like to be able to use "bysort city:" or "collapse (), by(city)".

Does any sort of convenient fuzzy matching algorithm exist for my purposes? I have looked into -matchit- and -regex- but neither seemed applicable to my case.

Thanks,

Erik
Tags: None
Sebastian Schirner

Join Date: Jan 2023

Posts: 53
#2

18 Apr 2023, 09:32

Hey Erik,

I used the command strgroup by Julian Reif for a similar purpose and it worked well. It uses Levenshtein edit distance between strings and you can set different thresholds for "tolerance".

Code:

strgroup city, threshold(0.1) gen(group_match) bysort group_match: gen city_clean = city[1] // Note that city_clean not necessarily contains the correct city name. It just contains the same city name for all matched city
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#3

18 Apr 2023, 12:35

Do you have coordinates for these cities
Comment

Announcement

Comment

Comment