Identifying when words within a string variable are the same across observations

Jesse JonesSmith

Join Date: Jan 2016

Posts: 2
#1

Identifying when words within a string variable are the same across observations

09 Jan 2016, 07:06

Hello,

I am working with commercial food business data and attempting to identify observations that are likely duplicates, but their information is not exactly the same. All of the variables are strings. Each observation contains a street address, a business name and a retailer type (ie eating place, grocery store, gas station). Sometimes stores are listed more than once. It is easy to identify those that are exact duplicates on address and business name and we’ve deleted these observations. However, there are a few (ie 1% of the observations) that are exact address and retailer type matches, but not exact business name matches, yet we think they are in fact the same business. So, for example, we might have an observation at 100 Park St that is named “Pizza Boli” and a second observation at 100 Park St that is named “Boli Pizzeria”. I’d like to be able to identify this instance of the repeated word “Boli” and all other similar situations.

I’ve created a variable to identify businesses with the same address. I’m looking for a way to look within each group of businesses with the same address to identify whether any of the words in the business name are the same (i.e. I do not necessarily know which words I am looking for, I am just looking for a match of words (across different observations in a group).

The other important piece of information is that only some of the businesses that have the same address and retailer type, but different names, are likely duplicates. There are “food court” type instances where 10 businesses have the same address, same retailer type, yet clearly different names. So, I can’t just treat all repeat address and retailer types as duplicates.

Thank you in advance for any guidance/ideas!!

Jesse
Tags: None
Julio Raffo

Join Date: May 2014

Posts: 132
#2

10 Jan 2016, 05:17

Hi Jesse,

You can try using -matchit- (available in ssc, read more here: http://www.statalist.org/forums/foru...-e-fuzzy-match)

As I see it, you can apply it in two different ways depending in the size of your dataset and the confidence you have in the spelling of your address variable. If your dataset is not too large or your the address variable contains different spellings (i.e. "100 Park St." can also be found as "100 Park Street"), you can try to match your dataset against itself directly using the business name. Then (and after merging the results with the original dataset to recover the address variable for both master and using business names) you can reapply -matchit- to the addresses columns and filter the results for high similarity in names and addresses. You can also concatenate the name and address into one field and then apply -matchit- only once, results should converge.

Alternatively, if addresses are standardized, you can first merge your dataset to itself by address (merge m:m address, keeping only the "both") and then apply -matchit- to the resulting two business names columns (remember to rename the column in one of the two files). This method is very likely to be faster, but relies on the quality of the address field.

Under both approaches, and given your example of Boli Pizza, I suggest using -matchit- with a bigram (or trigram) similarity function and using weights. The former can help you with the differences between "Pizza" and "Pizzeria" and the latter will be useful to give more importance to "Boli" than "Pizzeria".

I hope this helps.
Comment
Jesse JonesSmith

Join Date: Jan 2016

Posts: 2
#3

10 Jan 2016, 10:27

Hi Julio,
Thanks so much!! This works and it is fantastic. I had been spinning my wheels on this for awhile. I used the trigram, as suggested. I also used the 'simple' weight option. As you note above, I would like it to put less weight on things like "inc", but I have to admit that I do not know what simple, log, and root options actually do for the weighting. Is it the simple option that I would want for this or is it something else? Thanks so much!
Jesse
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#4

10 Jan 2016, 21:20

In practical terms results are very similar in score. But theoretically it's all about how fast or slow it ignores terms that are too frequent.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#5

15 Jan 2016, 02:13

PS: William Lisowski correctly points out to me that in my previous post where I wrote "merge m:m address", I should actually have written "joinby address". Apologies.
Comment
allison gaines

Join Date: Apr 2020

Posts: 1
#6

08 Apr 2020, 08:50

Julio Raffo I am trying to use the 'matchit' function to fuzzy match values within a single column (specifically, I have a list of ingredients and many are misspelled!). Do you have an advice on the best approach by which to do this? Do I need to use a loop so that each entry to the variable is compared to every other entry?

Thanks in advance!
Allie
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#7

08 Apr 2020, 09:06

Hi Allison,

The easiest way is to run -matchit- using the file against itself. Meaning something like this:

Code:

use yourfile.dta keep id ingredient // id can be the row number if you don't have an id matchit id ingredient using yourfile.dta, idu(id) txtu(ingredient)
Comment
Luca Iovino

Join Date: Apr 2020

Posts: 2
#8

26 Apr 2020, 17:51

Hi Julio,

Firstly I'd just like to say thank you for making this ado command. I'm commenting on this thread because I see it was active on the 8th April 2020.

I am trying to do something extremely similar to Francois Durant in the thread: https://www.statalist.org/forums/for...tch-names-only

Like Francois, I have two data sets with company names. One data set has 6,798 unique firm names and the other has around 27,000 non-unique firms (there exists some repeats of the same firm: this dataset shows all 2007 filed patent applications by UK firms and so there are cases where the same firm may have 4 separate applications, and hence it's listed in 4 separate cells). Correct me if I'm wrong, but I don't think this is a problem when it comes to matching. If the firms' name is the same, it has identical ID numbers.

My query lays more in making sure the matching is the most accurate it can be. In both datasets, the names of the firms have "LIMITED" "LTD." "LTD" "LLC" "()" "2" etc. In short, there are parentheses, fullstops, numbers and less informative texts that may complicate/lengthen the matching time.

I am currently using this code: "matchit myid ucompanyname using patent_2007_names_for_matching.dta, idu(urid) txtu (ucompanyname) sim(bigram) w(root) f(1) t(0) di override"
In both my datasets, I have generated IDs using the codes you suggest.

However, it takes a long time to compute the results, which is understandable as it's a complex match. Do you think this code is the most effective? I have seen on this thread, you suggest using specific sims (bigram/trigram) and weights to make sure matches are made more efficiently. Are my "sim" and "w" suitable for my type of matching? As I am writing this, I have entered to above code but by my estimates, it'll be running for about 6 more hours before everything is done. Therefore, for future reference, it'll be very useful to understand what weights and sims I should use to make sure my type of matching is as efficient as possible. What do the different options really do?

Thanks a lot for your time, I really appreciate it!

Best,

Luca
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#9

27 Apr 2020, 02:48

Hi Luca,

I agree, I don't think that matching ~7k against ~30k, should take that long (but of course it always depends on where you are running Stata on). Let me first list the my usual tips for improving the performance that can also benefit others in this thread:

First, what I call cleaning, but some call parsing or simply preparing the data. You can do several things to your data to improve performance (and quality) before running matchit. Just remember to do the same to both text fields (i.e. in the master and using files):

i- Change accentuated or similar characters (i.e. á, è, þ, ü, ß, ø, etc. etc.) to their Latin version. Since Stata handles Unicode, there are some cool functions that do this in one or two steps.¶

ii- Remove any non-alphanumeric characters from your texts (e.g. change any character [^a-z0-9] to blanks). I mean non-alphanumeric in order to keep the numbers if you are using names of entities (e.g. companies or universities) to avoid missing companies like 3M. But if you are using names of people you could simply just keep the alphabet characters.

iii- Change everything to the same case. Note that matchit is case-sensitive (except for a few built-in MATA functions like soundex).

iv- Remove redundant information, such as LTD, INC, UNIV, GMBH, etc,etc. What is redundant is tricky, as it is completely specific of your data. You can use freqindex (you should have installed it when installing matchit) to assist you on this. Anything that appears too frequently in your data is less useful to establish a fuzzy match and is punishing the speed of the match.

v- Some more convoluted techniques include changing your data. For example, expanding from or contracting to acronyms (e.g. "IBM", "I.B.M." and "Int. Bus. Machines") or harmonizing terms (e.g. "International->Int", "University->Univ", "Limited->LTD"). This is also very specific to your data.

vi. After using all or any of the above, remove any double spacing and trim the texts. matchit can be very dumb on this. For instance, if you leave double spaces on most entries and running it with bigram (default) you are basically killing all gains of matchit's indexation because all these are being compared unnecessarily. A similar case applies for heading and trailing blanks.

vii. Last, remove duplicate entries that existed (or that exist now after cleaning) from both datasets to avoid running unnecessary comparisons multiple times.

Note that points 1.i to 1.iii improve quality but likely decrease speed as they increase the number of comparisons. Yet, it is still worthwhile. Points 1.iv to 1.vii increase speed, and often also quality.

Second, optimizing the matching. Here the main trade-off is between comparing too many or too little. The whole point of using a fuzzy-matching technique (e.g. matchit) is to try to fish those cases that a simple merge will miss. If you have the time and/or the computing power, don't worry and compare everything. However, most of the comparisons are unnecessarily costly in time. The main speed impact has to do with the quality of the index (or hash table, dictionary, etc.). Here some tips to improve the speed by implicitly improving/tweaking the index:

i- Use the option stopwordsauto. This is basically doing an automated version of steps 1.iv and 1.v by removing all the index's entries that are too frequent to be informative. What is the risk of doing this? On the extreme case that your data has a name entirely composed with too frequent "grams" you may miss its comparison (e.g. a company named "The Limited Company"). In any case, you can fine-tune stopwordsauto by running first the diagnose option. This will give you the lists of most frequent grams in each dataset and in both together. The latter is extremely informative of what might be making matchit to go slow.

ii- Algorithms change the size and depth of the index. Typically, the size increases in gram entries from bigram (or ngram, 2) to token, but it gets less records on each entry. So roughly it gets worse but faster results in the following order:

bigram (best, but slower) > ngram,3 > ngram, 4>token_metaphone>token_soundex>token>metaphone>so undex>Stata's merge (worse, but faster)

iii- When you cannot sort the speed issues with 2.i or 2.ii, an alternative is to split the comparisons based on a third variable (country, city, state, birth date/year, industry, fields, etc). For instance, if both datasets have the country for each name, you can perform the comparisons only within the same country. Alternatively, I know some people have resorted to group names by their first initial(s) and compare only within these. In any of these cases, you have two ways to go: (a) you could split master and using files according to your criteria and perform matchit in a loop for all the pairs of files you deem necessary; or (b) you can joinby the master and using files based on your third variable (e.g. country codes) and then perform matchit using the column-syntax to the resulting mega file.

iv -Contrary to intuition, the choice of weight and similarity scores has little impact on the speed. It's not zero, but it is quite often quite negligible if compared to the previous points.

Luca, coming back to your specific case, I think point 2.i and some cleaning (i.e. some points in 1) should do the trick.

Best,

J.
Comment

Luca Iovino

Join Date: Apr 2020
Posts: 2

#10

27 Apr 2020, 15:05

Hi Julio,

Thanks so much for all of this! I read everything carefully and you were right, 2.i and some cleaning in 1 did the trick!

Using the code:

Code:

matchit myid ucompanyname using patent_2007_names_for_matching.dta, idu(urid) txtu (ucompanyname) stopwordsauto sim(bigram) w(root) f(1) t(0) di override

I feel like this captures the suggestions you made above

I had an indexation of 51%

I then

Code:

drop if similscore<0.75

And manually made a few matches. Is this the most efficient way about going things? Not dropping low similscores led to 18,000,000 or matches.

Moreover, I used

Code:

replace ucompanyname = subinstr(ucompanyname, " LIMITED", "", .)

replace ucompanyname = subinstr(ucompanyname, " LTD", "", .)

replace ucompanyname = subinstr(ucompanyname, " LLP", "", .)

replace ucompanyname = subinstr(ucompanyname, " LIMTED", "", .)

replace ucompanyname = subinstr(ucompanyname, " INC", "", .)

replace ucompanyname = subinstr(ucompanyname, " PLC", "", .)

replace ucompanyname = subinstr(ucompanyname,".", "",.)

replace ucompanyname = subinstr(ucompanyname," UK", "",.)

replace ucompanyname = subinstr(ucompanyname," PLC", "",.)

replace ucompanyname = subinstr(ucompanyname," HOLDINGS", "",.)

replace ucompanyname = subinstr(ucompanyname,"(HOLDINGS)", "",.)

replace ucompanyname = subinstr(ucompanyname," GROUP", "",.)

replace ucompanyname = subinstr(ucompanyname," (", "",.)

replace ucompanyname = subinstr(ucompanyname,")", "",.)

replace ucompanyname = subinstr(ucompanyname,"  ", "",.)

replace ucompanyname = subinstr(ucompanyname,"(", "",.)

replace ucompanyname = subinstr(ucompanyname,",", "",.)

replace ucompanyname = subinstr(ucompanyname,"COMPANY", "",.)

replace ucompanyname = subinstr(ucompanyname,"INVESTMENTS", "",.)

replace ucompanyname = subinstr(ucompanyname,"SERVICES", "",.)

replace ucompanyname = subinstr(ucompanyname,"EUROPE", "",.)

duplicates drop ucompanyname, force

as suggested in point 1.iv

I hope I didn't go overboard here, would be interested to hear your suggestions on all of this if you had the time. Thanks again

Best,

Luca

Comment

Julio Raffo

Join Date: May 2014

Posts: 132
#11

28 Apr 2020, 01:55

Hi Luca,

On matchit, you should be able to get even higher than 51%, but if you are happy with the current speed there's no need for tweaking it. On the scores and manual checking, unfortunately it is usually a necessary step. What I do before dropping anything is to sort the results by decreasing similscore (gsort -similscore) and manually inspect them as such.

If you have secondary data (address, industry, etc.), it is often useful to bring those comparable fields from both master and using files to help disambiguate results. Note that you can also use matchit in column syntax on these new variables.

Code:

matchit addr_master addr_using, g(addrsimil) gsort -similscore -addrsimil br

On the cleaning using subinstr remember that it risks taking too much. Specially when not looking for entire words.

A quick trick is to add a blank at both ends of the string and only then look for words like

Code:

gen newucompanyname = " "+ucompanyname+" " // alternatively you can replace it with the alphanumeric only version using regex: // gen newucompanyname = " "+ustrregexra(ucompanyname,[^a-z0-9], " ",1)+" " replace newucompanyname= subinstr(newucompanyname," COMPANY ", "",.) replace newucompanyname= subinstr(newucompanyname," INVESTMENTS ", "",.) replace newucompanyname= subinstr(newucompanyname," SERVICES ", "",.) replace newucompanyname= subinstr(newucompanyname," EUROPE ", "",.) replace newucompanyname=trim(itrim(newucompanyname)) // note: untested code, errors may subsist
Comment
Rafael Acevedo

Join Date: Apr 2019

Posts: 17
#12

12 Feb 2022, 20:19

@JulioRaffo I am posting on this thread because my case could be some similar. I have this problem... I am using different sources' dataset (economics variables), but some of them country names differ (i.e Dataset 1: Venezuela; Dataset2: Venezuela, R.B.; Dataset3: Venezuela, Rep. Bol.; etc... other country... Congo; Congo, R.D.; Congo, Rep. Dem.; etc...)
Not all of those Dataset have ISO-Code 3...

There is a way using your command to merge all those datsets without matching manually country by country or adding manually de Isocode3?

Thank you very much
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#13

18 Feb 2022, 11:39

Originally posted by Rafael Acevedo View Post

There is a way using your command to merge all those datsets without matching manually country by country or adding manually de Isocode3?

Hi Rafael,

Yes, you could run -matchit- and keep the top similar pair of names from the two (or more sources) you want to merge. In my experience, you still would need to inspect manually the results.

The code should look something like this (untested):

Code:

matchit id1 ctry_name1 using file2, idu(id2) txtu(ctry_name2) score(minsimple) keep ctry_name1 ctry_name2 similscore gsort -similscore duplicates drop ctry_name1, force save bridge_files_1_2

After that, you would need to merge file1 to bridge_files_1_2 and, in turn, merge the result with file2.

J.
1 like
Comment

Announcement