Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy match inconsistent identifiers in a panel

    Dear Statalist,

    I've been working on what I feel like should be a common problem but I can't find any threads on it. I'm trying to create consistent identifiers for a panel dataset of cities that have slight spelling differences over time using some kind of fuzzy matching algorithm. A very simple example is below, in my real data, I have many hundreds of place names spelled differently over 30 years, so manual fixes are infeasible. I know there are not multiple observations of the same place for the same year. The end result I'm looking for is to get a third variable that is some numeric place_id that would consistently identify Chicago and New York in the example below. If anyone has experienced this problem before and could just give a general overview of their workflow, it would be much appreciated, I don't necessarily need specific code. I've been using reclink from ssc, but I'm open to anything. Let me know if I've left out important details or can give more background. Thanks!


    Code:
    clear
    input str20 place_name year
    "chicago" 1990
    "chicag" 1991
    "chicago" 1992
    "new york" 1990
    "new york city" 1991
    "new york c" 1992
    end

  • #2
    It's quicker and easier to show some code that you can adapt than to try to explain it in words.

    Code:
    clear
    input str20 place_name year
    "chicago" 1990
    "chicag" 1991
    "chicago" 1992
    "new york" 1990
    "new york city" 1991
    "new york c" 1992
    end
    tempfile original_data
    save `original_data'
    
    keep place_name
    duplicates drop
    tempfile copy
    save `copy'
    rename place_name place_name2
    cross using `copy'
    matchit place_name place_name2
    keep if inlist(place_name2, "chicago", "new york city") & similscore > 0.8
    
    merge 1:m place_name using `original_data'
    rename place_name2 correct_place_name
    Now, you may find instances of place_name that didn't match with any correction. (In your example, "new york" did not match with anything.) Hopefully, the number of such instances (all characterized by _merge == 2) will be small enough that you can fix them individually with some -replace correct_place_name = whatever if place_name == whatever_else- commands. If there are too many for that to be practical, try again using a lower threshold than 0.8 for similscore. (If you go down to 0.75, "new york" will match with "new york city". Of course, the lower you put the threshold, the more likely you will come out with spurious matches that you then have to weed out with individual -drop- commands after the -keep- command.

    Anyway, if you play around with this, you will eventually find a threshold value for similscore that reduces the burden of handling individual cases separately to a manageable minimum.

    Note also that -matchit- also allows you to try a large number of different string similarity measures, so if you can't find a workable threshold value with the default methods, try some of the others.

    Comment

    Working...
    X