No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying when words within a string variable are the same across observations


    I am working with commercial food business data and attempting to identify observations that are likely duplicates, but their information is not exactly the same. All of the variables are strings. Each observation contains a street address, a business name and a retailer type (ie eating place, grocery store, gas station). Sometimes stores are listed more than once. It is easy to identify those that are exact duplicates on address and business name and we’ve deleted these observations. However, there are a few (ie 1% of the observations) that are exact address and retailer type matches, but not exact business name matches, yet we think they are in fact the same business. So, for example, we might have an observation at 100 Park St that is named “Pizza Boli” and a second observation at 100 Park St that is named “Boli Pizzeria”. I’d like to be able to identify this instance of the repeated word “Boli” and all other similar situations.

    I’ve created a variable to identify businesses with the same address. I’m looking for a way to look within each group of businesses with the same address to identify whether any of the words in the business name are the same (i.e. I do not necessarily know which words I am looking for, I am just looking for a match of words (across different observations in a group).

    The other important piece of information is that only some of the businesses that have the same address and retailer type, but different names, are likely duplicates. There are “food court” type instances where 10 businesses have the same address, same retailer type, yet clearly different names. So, I can’t just treat all repeat address and retailer types as duplicates.

    Thank you in advance for any guidance/ideas!!


  • #2
    Hi Jesse,

    You can try using -matchit- (available in ssc, read more here:

    As I see it, you can apply it in two different ways depending in the size of your dataset and the confidence you have in the spelling of your address variable. If your dataset is not too large or your the address variable contains different spellings (i.e. "100 Park St." can also be found as "100 Park Street"), you can try to match your dataset against itself directly using the business name. Then (and after merging the results with the original dataset to recover the address variable for both master and using business names) you can reapply -matchit- to the addresses columns and filter the results for high similarity in names and addresses. You can also concatenate the name and address into one field and then apply -matchit- only once, results should converge.

    Alternatively, if addresses are standardized, you can first merge your dataset to itself by address (merge m:m address, keeping only the "both") and then apply -matchit- to the resulting two business names columns (remember to rename the column in one of the two files). This method is very likely to be faster, but relies on the quality of the address field.

    Under both approaches, and given your example of Boli Pizza, I suggest using -matchit- with a bigram (or trigram) similarity function and using weights. The former can help you with the differences between "Pizza" and "Pizzeria" and the latter will be useful to give more importance to "Boli" than "Pizzeria".

    I hope this helps.


    • #3
      Hi Julio,
      Thanks so much!! This works and it is fantastic. I had been spinning my wheels on this for awhile. I used the trigram, as suggested. I also used the 'simple' weight option. As you note above, I would like it to put less weight on things like "inc", but I have to admit that I do not know what simple, log, and root options actually do for the weighting. Is it the simple option that I would want for this or is it something else? Thanks so much!


      • #4
        In practical terms results are very similar in score. But theoretically it's all about how fast or slow it ignores terms that are too frequent.


        • #5
          PS: William Lisowski correctly points out to me that in my previous post where I wrote "merge m:m address", I should actually have written "joinby address". Apologies.