Matchit: new .ado to merge datasets using different string similarity methods (i.e. fuzzy match).

Julio Raffo

Join Date: May 2014
Posts: 132

#16

04 May 2016, 03:56

Yet another batch of updates. Some of them are cosmetic (like changes in what is reported in the output window) or simply new similarity functions added (like nysiis and other hybrid phonetic algorithms).

But I think the most significant one is the introduction of the stopwordsauto option. This option generates a list of stopwords automatically based on the overall frequencies (i.e. grams per observation). In a nutshell, -matchit- will ignore a list of grams in the whole process (indexation, weights and computation of final results), which will likely improve the efficiency of indexation at the also likely risk of ignoring some potential matches.

As you can see below, this option is applied to the same example from the previous post. Everything is set exactly the same but for the option stopw (short for stopwordsauto). It can be noted that the output of the option diagnose has changed slightly in order to refer more clearly to the stopwordsauto threshold (which can be set with the option swthreshold()). What before was reported as percent now is reported as grams_per_obs. By default this threshold is set to .2, which means that grams that are found in average more than once every five observations are ignored. In this case, these are only ", ", "an", and "er", as reported in the third table of the diagnose output.

As you can compare from the two posts, what took slightly less than 7min now takes 2min. However, it is also worth mentioning that results may differ as the similarity score is not computed exactly in the same way.

Code:

. use medium, clear
. matchit person_id person_name using mediumlarge.dta, idu(person_id) txtu(person_name) ti di f(1) stopw
Matching current dataset with mediumlarge.dta
Similarity function: bigram
 4 May 2016 10:35:58
 
Performing preliminary diagnosis
--------------------------------
 
Analyzing Master file
List of most frequent grams in Master file:

       grams   freq   grams_per_obs  
  1.      ,    1139          1.1390  
  2.      er    217          0.2170  
  3.      an    205          0.2050  
  4.       J    183          0.1830  
  5.       C    176          0.1760  
  6.      on    171          0.1710  
  7.      ar    167          0.1670  
  8.      or    162          0.1620  
  9.       I    149          0.1490  
 10.      en    141          0.1410  
 11.       S    124          0.1240  
 12.       M    121          0.1210  
 13.       R    114          0.1140  
 14.      ch    113          0.1130  
 15.      ra    111          0.1110  
 16.       A    110          0.1100  
 17.      in    110          0.1100  
 18.       D    109          0.1090  
 19.       L    106          0.1060  
 20.      n,    104          0.1040  
 
Analyzing Using file
List of most frequent grams in Using file:

       grams    freq   grams_per_obs  
  1.      ,    11079          1.1079  
  2.      an    2144          0.2144  
  3.      er    2115          0.2115  
  4.       J    1795          0.1795  
  5.      ar    1794          0.1794  
  6.      on    1632          0.1632  
  7.       C    1539          0.1539  
  8.       I    1448          0.1448  
  9.       M    1349          0.1349  
 10.      en    1307          0.1307  
 11.      or    1302          0.1302  
 12.       R    1260          0.1260  
 13.       A    1252          0.1252  
 14.      ic    1191          0.1191  
 15.       S    1132          0.1132  
 16.      n,    1125          0.1125  
 17.       D    1124          0.1124  
 18.      in    1085          0.1085  
 19.      ha    1025          0.1025  
 20.      ra    1024          0.1024  
(638 real changes made)
(1 real change made)
 
Overall diagnosis
Pairs being compared: Master(1000) x Using(10000) = 10000000
Estimated maximum reduction by indexation (%):0
(note: this is an indication, final results may differ)
 
List of grams with greater negative impact to indexation:
(note: values are estimated, final results may differ)

       grams   crosspairs   max_common_space   grams_per_obs  
  1.      ,      12618981             100.00          1.1107  
  2.      er       458955               4.59          0.2120  
  3.      an       439520               4.40          0.2135  
  4.       J       328485               3.28          0.1798  
  5.      ar       299598               3.00          0.1783  
  6.      on       279072               2.79          0.1639  
  7.       C       270864               2.71          0.1559  
  8.       I       215752               2.16          0.1452  
  9.      or       210924               2.11          0.1331  
 10.      en       184287               1.84          0.1316  
 11.       M       163229               1.63          0.1336  
 12.       R       143640               1.44          0.1249  
 13.       S       140368               1.40          0.1142  
 14.       A       137720               1.38          0.1238  
 15.       D       122516               1.23          0.1121  
 16.      in       119350               1.19          0.1086  
 17.      n,       117000               1.17          0.1117  
 18.      ra       113664               1.14          0.1032  
 19.      ch       112322               1.12          0.1006  
 20.      ic       104808               1.05          0.1163  
 
Loading USING file: mediumlarge.dta
Generating stopwords automatically, threshold set at:.2
Done!
Indexing USING file.
 4 May 2016 10:36:04-> 0%
 4 May 2016 10:36:04-> 1%
 4 May 2016 10:36:04-> 2%
 4 May 2016 10:36:04-> 3%
...
 4 May 2016 10:36:07-> 97%
 4 May 2016 10:36:07-> 98%
 4 May 2016 10:36:07-> 99%
 4 May 2016 10:36:07-> Done!
Computing results
 4 May 2016 10:36:07->  Percent completed ...   (search space saved by index so far)
 4 May 2016 10:36:09->  1%                ...   (48%)
 4 May 2016 10:36:10->  2%                ...   (52%)
 4 May 2016 10:36:11->  3%                ...   (53%)
 4 May 2016 10:36:12->  4%                ...   (54%)
...
 4 May 2016 10:37:54->  97%               ...   (57%)
 4 May 2016 10:37:55->  98%               ...   (57%)
 4 May 2016 10:37:55->  99%               ...   (57%)
 4 May 2016 10:37:57->  Done!
Total search space saved by index: 57%
 4 May 2016 10:37:57

Comment

Fiona Wang

Join Date: Oct 2016

Posts: 3
#17

31 Oct 2016, 00:06

Hello Julio,

Thank you for creating this new command! I am trying to fuzzy match two datasets with company names and websites here. An error message popped up saying “'weburl' found where numeric variable expected”. Do you have any suggestion how to deal with this error? Also, do you have an example for the similarity score option? Thank you!
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#18

31 Oct 2016, 02:16

Hi Fiona,

I probably need more information on what you are trying to do and what your variables are. It seems to me that you are using a str variable as “id” (master or using one) when you need to use a numeric one. If “weburl” is the correct identifier you want to use, just do something like the code below and use that new variable as id:

Code:

egen mynewid=group(weburl)

The similarity scores are explained in the help section “Notes on the different scoring options”. My practical suggestion is to use minsimple if you do not care about what does not match as much as you care of what you actually match. For instance, if you do not care about the difference between “My Big Corporation” vs “The Small Company, part of My Big Corporation” or between “My Great Univeristy” and “My Great Univeristy, Lab of Smaller topics” then use minsimple. If you do care, use the default.

Best,

J.
Comment
Fiona Wang

Join Date: Oct 2016

Posts: 3
#19

31 Oct 2016, 15:21

Thank you for your response! I actually need the weburlto be a string because I am fuzzy matching URLs from two different datasets. Would it be possible to work around this error?

Thank you!
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#20

01 Nov 2016, 00:05

You will need to create an numeric id for each weburl string variable (i.e. on each dataset). Assuming your files are named file1 and file2,your code will look something like the following:

Code:

use file1.dta egen id1=group(weburl) save newfile1.dta use file2.dta egen id2=group(weburl) save newfile2.dta matchit id2 weburl using newfile1.dta, idu(id1) txtu(weburl)
Comment
Fiona Wang

Join Date: Oct 2016

Posts: 3
#21

01 Nov 2016, 13:05

Oh! I see what you are saying. Thanks a lot!
Comment
Philip Yang

Join Date: Jan 2017

Posts: 26
#22

30 Jan 2017, 05:02

Hi Julio, I was wondering it matchit is also able to determine similarities within one string variable. I have a variable with around 600000 self reported occupations. I would like to somehow cluster them first and then assign them to numbers.

Example:

Job Similarity of Jobs

Starbucks 1
Sterbuksch 1
work at Starbucks 1
brewing coffee at Starbucks 1
waiter Starbucks since a while 1

University Arkansas 2
University Durham 2
Eberhard University 2
LMU Universität Deutschland 2

Or do you know any other ado file or code to identify similar jobs? Thereafter, I would like to cluster these and create numerical values for them.

Many thanks

Philip
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

#23

30 Jan 2017, 07:18

Hi Phillip,

simple using -matchit- of one file against the same could do the trick. Of course, fine tuning the precise algorithm might take some thought. But I give you a working example as follows:

Code:

// just the example file
tempfile myfile
clear all
input str244 Job
"Starbucks"
"Sterbuksch"
"work at Starbucks"
"brewing coffee at Starbucks"
"waiter Starbucks since a while"
"University Arkansas"
"University Durham"
"Eberhard University"
"LMU Universität Deutschland"
end
list
gen id=_n
save `myfile', replace
// This is the command
use `myfile', clear
matchit id Job using `myfile', idu(id) txtu(Job) s(minsimple)
gsort -similscore
list
keep if similscore>=.5 // You will need to think a threshold here

// What follows rebuilds your data with the new group id
keep id*
ren id id2
gen long new_id = _n
reshape long id, i(new_id) j(n)
drop n
duplicates drop
* ssc install group_id // only if not already installed (by Robert Picard)
group_id new_id , matchby(id)
duplicates drop
merge 1:1 id using `myfile'
list

Comment

Julio Raffo

Join Date: May 2014

Posts: 132
#24

30 Jan 2017, 07:24

I'm adding here my slides to the past 2016 Swiss Stata Users Group meeting, which contain some useful examples. My previous post is based on slide #10.

The slides can also be found here: http://www.stata.com/meeting/switzerland16/#proceedings

Attached Files

Stata_Bern_2016_raffo.pdf (691.8 KB, 1 view)
Comment
Philip Yang

Join Date: Jan 2017

Posts: 26
#25

31 Jan 2017, 02:03

Great! Thank you very much!
Comment
Philip Yang

Join Date: Jan 2017

Posts: 26
#26

31 Jan 2017, 09:29

I tried it with the little dataset and it worked perfectly. Now with the 600000 occupations, it seems like the matchit command needs way to long (6 hours without any result). Is that possibly too much data? Or do I need to make an adjustment?
Comment
Philip Yang

Join Date: Jan 2017

Posts: 26
#27

31 Jan 2017, 09:30

this is where it stop:

Indexing USING file.
0%
20%
40%
60%
80%
Done!
Computing results
Percent completed ... (search space saved by index so far)
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#28

16 Feb 2017, 02:31

Hi Philip, if I understand correctly it seems you are trying to compare 36*10¹⁰ pairs, which is a lot of computation. There are some tips that can help reducing the actual space you are searching:

- First, you should be sure you are removing any duplicate jobs in the original file before using matchit.
- Second, you could use a different algorithm aiming at reducing comparisons. By default -matchit- uses sim(bigram) (which is the same than sim(ngram,2)), but you could use sim(ngram,3) or sim(ngram,4) instead. The higher grams you select, the less comparing pairs of observations you should get (at the expense of taking longer to produce the index and maybe missing some potential matches).
- Third, you could use the stopw option which avoids comparing pairs based on grams which are too common. Default threshold is .2 grams per observation, but you can change it by setting the option swt(). The more you reduce it the less pairs are compared at the expense of potentially missing good matches (by missing the too frequent grams) and of inflating the similarity score (by ignoring the too frequent grams in the comparison).

Best,

J.
Comment
Carlos Zambrana

Join Date: Feb 2015

Posts: 25
#29

22 Mar 2017, 16:24

Hi Julio,

First of all props for writing this very useful package, and also for replying to questions and comments for three years now.

I come to you because something very odd happened today. I had been using the package throughout the day without running into any issues. Stata crashed 3 or 4 times, but the same has happened to me in this computer many times before. I only mention it because at some point after a crash I figured the computer could use some rest, so I turned it off normally and let it rest for 15 minutes. After turning it back on and using the same code I had been using earlier today, now I get the following error:

Code:

tokenwrap not found as a similarity function. Check spelling. Mata run-time error r(3499);

This happened while trying sim(tokenwrap, "soundex_fk") and sim(nysiis_fk). I then tried it with the default and it did run. I prefer soundex_fx because it runs MUCH faster than the default, and it also makes more sense given the string variables that I am using to match these datasets.

I tried reinstalling the package, I used adoupdate, update to make sure all my packages are updated, and my Stata is also up-to-date (version 14.2).

Have you run into similar issues, or do you have an idea of what could be going wrong and what I could do to fix it?

Thanks in advance,
Comment

Carlos Zambrana

Join Date: Feb 2015
Posts: 25

#30

22 Mar 2017, 16:53

Just in case you might want to inspect my code, here it is (after renaming the variables):

Code:

tempfile master using

use tempmastersource , clear
keep idmaster txtmaster 
duplicates drop
drop if txtmaster==""
duplicates report idmaster
compress
save `master'
              
use tempusingsource, clear
keep idusing txtusing
duplicates drop
drop if txtusing==""
duplicates report idusing
compress
save `using'


use `master', clear
matchit idmaster txtmaster using `using', idusing(idusing) txtusing(txtusing) sim(tokenwrap, "soundex_fk") di time stopw gen(namematch11)

Everything runs well, but at the last step this happens:

Code:

. matchit idmaster txtmaster using `using', idusing(idusing) txtusing(txtusing) di time sim(tokenwrap, "soundex_fk") stopw gen(namematch11)
Matching current dataset with C:\Users\zambrana\AppData\Local\Temp\ST_04000002.tmp
Similarity function: tokenwrap
22 Mar 2017 17:36:11
 
Performing preliminary diagnosis
--------------------------------
 
Analyzing Master file
tokenwrap not found as a similarity function. Check spelling.
Mata run-time error
r(3499);

end of do-file

r(3499);

The text variables I am using to match the tables contain names of educational institutions, and neither has missing values. There are no duplicates by either ID variable. Also, the first file has 2457 observations, and the second one 6496.

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment