Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help matching city names in two data sets

    I am trying to match data on cities in two data sets using only the names of the cities. Unfortunately, there are many variants for how names can be presented.

    For example,
    in one data set the name of a city is TERREBONNE PARISH CONSOLIDATED GOVERNMENT but
    in the other data set the name of the city is TERREBONNE CONSOLIDATED GOVERNMENT.

    These are almost certainly the same city but when I use the matchit function they do not give me a perfect match.

    I cannot just use matchit's similarity scores because there are also cases in the data like APPLEGATE VILLAGE which is matched to THE VILLAGE OF DOUGLAS. These two are almost certainly not the same city but I get a matchit score of .5 because they both contain the word VILLAGE


    I thought about creating a variable for each word in the city name and then dropping words like "Village", "Town" "of" etc. but even if I programmed that in I can't figure out any reasonablely efficient strategy to find that "TERREBONNE" is in each name since the word order can differ across data sets.

    Suggestions for how to proceed would be appreciated.

  • #2
    I thought about creating a variable for each word in the city name and then dropping words like "Village", "Town" "of" etc. but even if I programmed that in I can't figure out any reasonablely efficient strategy to find that "TERREBONNE" is in each name since the word order can differ across data sets.
    That's an interesting idea you have, and below I will discuss how I would go about getting rid of extraneous words.

    But first, I'm a fan of matchit, but not a major user of it. The author, Julio Raffo, often sees and responds to questions on Statalist about matchit. Perhaps he will see this and share his experience.

    Now, what I think you want to do is not to split each name into separate words, trying to find the meaningful one, but instead to just delete common words from the city names and then apply matchiit. Here's some example code that may point you in a useful direction.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str41 city
    "TERREBONNE PARISH CONSOLIDATED GOVERNMENT"
    "TERREBONNE CONSOLIDATED GOVERNMENT"       
    "APPLEGATE VILLAGE"                        
    "THE VILLAGE OF DOUGLAS"                   
    end
    generate city2 = " " + city + " "
    replace city2 = subinstr(city2," PARISH "," ",.)
    replace city2 = subinstr(city2," VILLAGE "," ",.)
    replace city2 = subinstr(city2," THE "," ",.)
    replace city2 = subinstr(city2," OF "," ",.)
    replace city2 = trim(city2)
    Code:
    . list, clean noobs
    
                                             city                                city2  
        TERREBONNE PARISH CONSOLIDATED GOVERNMENT   TERREBONNE CONSOLIDATED GOVERNMENT  
               TERREBONNE CONSOLIDATED GOVERNMENT   TERREBONNE CONSOLIDATED GOVERNMENT  
                                APPLEGATE VILLAGE                            APPLEGATE  
                           THE VILLAGE OF DOUGLAS                              DOUGLAS  
    
    .
    Doing this I think lets you make more effective use of the similarity scores, once they're not being thrown off by common words.

    Again, I'm no expert on this, but your idea is an interesting one and I hope you can get closer to what you need by using it.

    Comment


    • #3
      You may also find that you can do a lot relatively quickly with the kind of replace statements that Williams demonstrates. While it is not elegant, you can make a do file with these replacements to clean up a bunch of these. A day's work can let you id and program a great many such replacements.

      replace city2="TERREBONE" if strpos(city2,"TERREBONE") != 0

      Comment


      • #4
        Here's a way to find and strip out the most common words before trying -matchit-, for what it's worth. For the current data example, in which "TERREBONE" is a common word, it gets stripped out, but presumably that's not an issue in the actual data.

        Code:
        // Get a list of most common words
        preserve
        local top = 5 // whatever
        keep city
        replace city = trim(city)
        split city, gen(word)
        stack word*, into(w) clear
        drop if missing(w)
        drop _stack
        contract w
        gsort -_freq
        levelsof w in 1/`top', local(common)
        restore
        //
        // Trim out most common words
        gen trimcity = city
        foreach w of local common {
           quiet replace trimcity = subinstr(trimcity, "`w'", "", .)
        }

        Comment


        • #5
          In post #2 I used a common hack to ensure that I would be replacing whole words and not changing "LOFTON" to "LTON". By (a) adding spaces around the string to be matched (" PARISH ") and (b) adding a space at the start and the end of the string in which the replacement is to occur I (hopefully!) guarantee (a) that full words will be matched (b) even if they appear at the start or end of the string in which the replacement is to occur. And replacing the matched string (which has a leading and trailing space) with a single space rather than nothing preserves a space between the words surrounding it.
          Code:
          generate city2 = " " + city + " "
          replace city2 = subinstr(city2," PARISH "," ",.)
          So I very much like Mike's technique for finding common words, but I would change his replacement code to
          Code:
          gen trimcity = " " + city + " "
          foreach w of local common {
             quiet replace trimcity = subinstr(trimcity, " `w' ", " ", .)
          }
          replace trimcity = trim(trimcity)

          Comment


          • #6
            Wow, Mike and William those are really helpful comments.

            Comment

            Working...
            X