I am working with a dataset of geographically identified entries, where each entry is assigned a value of the string "city", which denotes the name of the city associated with the entry.
Currently, my city variable is quite messy, with inconsistent spelling, etc. For example, "PEORIA" may show up as "PORIA", "POERIA","PEOROA","PEORIA, IL", or "...PEORIA...."
My goal is to have a dataset of cleaned city names where I can use the "city" variable to accurately perform data manipulations by city. For example, I would like to be able to use "bysort city:" or "collapse (), by(city)".
Does any sort of convenient fuzzy matching algorithm exist for my purposes? I have looked into -matchit- and -regex- but neither seemed applicable to my case.
Thanks,
Erik
Currently, my city variable is quite messy, with inconsistent spelling, etc. For example, "PEORIA" may show up as "PORIA", "POERIA","PEOROA","PEORIA, IL", or "...PEORIA...."
My goal is to have a dataset of cleaned city names where I can use the "city" variable to accurately perform data manipulations by city. For example, I would like to be able to use "bysort city:" or "collapse (), by(city)".
Does any sort of convenient fuzzy matching algorithm exist for my purposes? I have looked into -matchit- and -regex- but neither seemed applicable to my case.
Thanks,
Erik
Comment