Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Text Matching

    I am working with a dataset of geographically identified entries, where each entry is assigned a value of the string "city", which denotes the name of the city associated with the entry.

    Currently, my city variable is quite messy, with inconsistent spelling, etc. For example, "PEORIA" may show up as "PORIA", "POERIA","PEOROA","PEORIA, IL", or "...PEORIA...."

    My goal is to have a dataset of cleaned city names where I can use the "city" variable to accurately perform data manipulations by city. For example, I would like to be able to use "bysort city:" or "collapse (), by(city)".

    Does any sort of convenient fuzzy matching algorithm exist for my purposes? I have looked into -matchit- and -regex- but neither seemed applicable to my case.

    Thanks,

    Erik

  • #2
    Hey Erik,

    I used the command strgroup by Julian Reif for a similar purpose and it worked well. It uses Levenshtein edit distance between strings and you can set different thresholds for "tolerance".

    Code:
    strgroup city, threshold(0.1) gen(group_match)
    bysort group_match: gen city_clean = city[1]    // Note that city_clean not necessarily contains the correct city name. It just contains the same city name for all matched city

    Comment


    • #3
      Do you have coordinates for these cities

      Comment

      Working...
      X