What are the differences between -matchit- and -reclink-?

Julio Raffo

Join Date: May 2014

Posts: 132
#16

12 Jan 2017, 04:46

William Blackmon : I'm confused by the setup of your problem. How likely is that observations in the first file are repeated in the second file (and vice-versa)? Do all observations in the first file are expected to have one (or many) matches in the second file?

In any case, you could use duplicates to help you eliminate cases. For instance,

Code:

gsort id1 -superscore3 duplicates drop id1, force

will keep only the best score for each id1 in my previous example posted in #3. Beware that you may miss cases with identical superscore3 within the same id1.
Comment
William Blackmon

Join Date: Oct 2016

Posts: 40
#17

12 Jan 2017, 07:02

Sorry for the confusion. There should not be any actual duplicates across the files. I'm talking about the duplicates generated by matchit itself. The issue I'm struggling with is automating the process of choosing the best match (according to matchit similarity score) in a way that doesn't create duplicates in terms of either dataset. For example:

Dataset 1: Dataset 2:
John Doe John Doe
Jon Dow Jonn Doe
Mark Williams Mark Williams

Clearly, John Doe and Mark Williams will match, no problems. However, it is possible that Jonn Doe will match with John Doe, not Jon Dow as it should. Because of cases like this, I need the program to only choose the top match in dataset 2 for any orbs in dataset 1, and then choose the 2nd best match (from the pool of unmatched dataset 2 obs) for the remaining unmatched obs in dataset 1. In fact this process would need to repeat until each obs has a unique match.
Comment
Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#18

12 Jan 2017, 07:30

Julio Raffo Yes, the blocking in reclink2 (and in reclink) relies on blocking variables without errors, which is the standard. Therefore, you typically run sequential phases using different combinations of blocking variables. Two main references are Christen (2012), and Harron, Goldstein, and Dibben (2015). This standard blocking logic originates from AutoMatch which was the dominant record linkage software up until maybe 10 years ago.

Your original description for matchit as using joinby for the blocking is better than your recent referring to many-to-many matching, I guess. Stata's joinby is better known outside of the Stata community as SQL outer joins. Like your matchit program, R's record linkage package RecordLinkage by Sariyar and Borg (2010) also uses this "joinby" logic for blocking. On testing, I found that using R's RecordLinkage in Stata is faster than using reclink2. The reason for R's faster speed here is that the R package uses C code.

reclink2 is based on the well known theory of Fellegi and Sunter (1969) which, in practice, requires a clerical review of potential matches. Is matchit instead based on unsupervised machine learning? In that case, my concern with matchit is not overall speed (it should be its main advantage!) but rather how can we trust the results? Sariyar and Borg (2010, 65) note that, for unsupervised machine learning, "the quality of the resulting classification varies significantly for different datasets and poor results can occur".

References:
Christen, Peter. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entitity Resolution, and Duplicate Detection. Springer.
Fellegi, Ivan, and Alan Sunter. 1969. A Theory of Record Linkage. Journal of the American Statistical Association 64 (328), 1183-1210.
Harron, Katie, Harvey Goldstein, and Chris Dibben (editors). 2015. Methodological Developments in Data Linkage. Wiley.
Sariyar, Murat, and Andreas Borg. 2010. The RecordLinkage Package: Detecting Errors in Data. The R Journal 2(2), 61-67.
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

#19

13 Jan 2017, 01:14

William Blackmon (#17): Then the simple approach suggested in #16 should help you with your step a. I guess for b it will imply using a recursively. For instance, something like this may help you:

Code:

duplicates report id1
scalar dup=cond(`r(unique_value)'!=_N,1,0)
duplicates report id2
scalar dup=cond(`r(unique_value)'!=_N,1,dup)
while dup==1 {
 bysort id1: egen rank1=rank(superscore3), field
 bysort id2: egen rank2=rank(superscore3), field
 gen bestmatch=cond(rank1==1 & rank2==1,1,0)
 bysort id1: egen done1=max(bestmatch)
 bysort id2: egen done2=max(bestmatch)
 drop if (done1==1 | done2==1) & bestmatch==0
 duplicates report id1
 scalar dup=cond(`r(unique_value)'!=_N,1,0)
 duplicates report id2
 scalar dup=cond(`r(unique_value)'!=_N,1,dup)
 drop rank1 rank2 done1 done2 bestmatch
}

Please check it thoroughly because I haven’t test it beyond adding the following entries to the #3 example.

Code:

// file1
"Daniel Kennedy"   "Seattle, Washington" 12
"Daniel Kennedy"   "Washington" 20
 
// file2
"Damiel Kennedy"   "seattle" 12
"Damiel Kennedy"   "WASHINGTON, DC" 20

Comment

Julio Raffo

Join Date: May 2014

Posts: 132
#20

13 Jan 2017, 06:38

@Anders Alexandersson (#18): thanks.

As I see it, matchit is just a tool (or platform if you prefer) to produce fuzzy string similarity across variables in the same dataset or observations of different datasets. It attempts to provide a wide range of different well established algorithms benefiting from its indexation capabilities. In addition, it allows users to easily add new algorithms by coding them in a relatively simple fashion as MATA functions.

Of course, as such, it can be used as part of a larger deterministic, probabilistic or, even, machine learning record matching. But matchit will always be subject to the assumptions imposed by the users on any of these. Please note that users can always use matchit with a clerical checkup. For instance, the example mentioned in #3 can be combined with clrevmatch simply by running:

Code:

save matched12.dta clrevmatch using matched12.dta, /// idmaster(id1) idusing(id2) varM(name1 city1 age1) varU(name2 city2 age2) /// clrev_result(myscore) clrev_note(mynotes) newfilename(matched_clrev.dta)

But again, this is a matter of taste (or methodology) for each user to decide.
Comment
Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#21

13 Jan 2017, 10:18

Julio Raffo : Thank you. That gave me a better understanding of matchit.

A string matching method I would like to see implemented in Stata is Double-Metaphone. "In general, Double-Metaphone seems to be generating encodings that are closer to correct pronunciation of names than NYSIIS." --Christen in Data Matching: Concepts and Techniques for Record Linkage (2012, 78).

The method is implemented in the R package PGRdup and is being implemented in the R package phonics. You can call R from Stata, so a Stata implementation is not critical but it would be nice to have.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#22

14 Jan 2017, 01:45

@Anders Alexandersson: you probably know it already but I think that wbuchanan has coded double metaphone. But it might well be just the simple one.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#23

14 Jan 2017, 08:24

Anders Alexandersson Julio Raffo is correct. The source is available at https://github.com/wbuchanan/StataStringUtilities and can be installed using:

Code:

net inst http://wbuchanan.github.io/StataStringUtilities help phoneticenc

That should install it and open the help file for the phonetic encodings that are supported. The package is based on the Apache Commons Codec library and has some string distance methods included as well.
Comment
William Blackmon

Join Date: Oct 2016

Posts: 40
#24

14 Jan 2017, 12:14

@Julio Raffo -- Thanks, your code works well!
Comment
Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#25

14 Jan 2017, 12:37

Julio Raffo wbuchanan : Thanks, I was not aware of it. Stata did not find it under search metaphone. I can access the files at https://github.com/wbuchanan/StataStringUtilities using a browser. I get this error message using updated Stata/MP 14.2 for Windows (64-bit x86-64):

Code:

. net inst http://wbuchanan.github.io/StataStringUtilities file http://wbuchanan.github.io/stata.toc not found http://wbuchanan.github.io/ either 1) is not a valid URL, or 2) could not be contacted, or 3) is not a Stata download site (has no stata.toc file). r(601);

I will try again later and on another PC to test if this is a temporary problem.
Comment

wbuchanan

Join Date: Mar 2014
Posts: 1362

#26

15 Jan 2017, 08:26

Anders Alexandersson that's pretty odd/bizarre. Maybe you're connecting from a proxy that doesn't want to allow Stata to communicate with that domain? When I check to see if the toc is present this is what I get:

Code:

. net from http://wbuchanan.github.io/StataStringUtilities
---------------------------------------------------------------------------------------------------------------------------
http://wbuchanan.github.io/StataStringUtilities/
Stata String Utilities
---------------------------------------------------------------------------------------------------------------------------

PACKAGES you could -net describe-:
    strutil           A Java-plugin based package for string distance/similarity metrics and phonetic string encodings.
---------------------------------------------------------------------------------------------------------------------------

You could try using https, but you would need to restart your Stata in order for the Java binary to be found on the classpath.

Comment

Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#27

15 Jan 2017, 12:12

wbuchanan : I can reproduce the results in #26.

The syntax for net install is

Code:

net install pkgname [, all replace force from(directory_or_url)]

Therefore, I am surprised your code in #23 works for you, since you provided a URL instead of a package name. I do not use a Proxy, as checked in Stata under Edit -> Preferences -> General preferences -> Internet.

I was able to install the package strutil using this code:

Code:

net install strutil, from(http://wbuchanan.github.io/StataStringUtilities)
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#28

18 Jan 2017, 03:50

Earlier post was a mistake on my part but glad to hear you were able to get things installed.
Comment

Fleur Ponne

Join Date: Apr 2018
Posts: 6

#29

11 Apr 2018, 04:38

Hi all, thank you very much for your previous explanation. I have a question that is in line with the above question on multiple variables. I am trying to merge two panel data sets and therefore I am wondering in what way I could best use either reclink or matchit to merge these?
I need to merge the data on company name and year. In the name variable the two datasets differ (e.g. "ABN AMRO holding" or "Abn Amro holding NV") and for this variable I thus need a fuzzy match. The year variable, on the other hand, should be merged 1:1.
The first dataset furthermore contains variables on assurance services and the second one contains financials.
A final important remark is that both datasets do not contain the same amount of observations. It could be that a company is found in the first dataset but not in the second.
Here’s an example of what the data looks like:
Source A

name
AUDI AG
AUDI AG
AUDI AG
AVINOR
AVINOR
AVINOR
AVINOR
AWISTA GmbH
AWISTA GmbH
AXA
AXA
AXA
AXA
AXA
AXA

year
2013
2015
2017
2014
2015
2016
2017
2015
2016
2012
2013
2014
2015
2016
2017

Source B

Name
AUDI AG
AUDI AG
AUDI AG
AUDI AG
AUDI AG
AUDI AG
AUDIKA GROUPE
AUDIKA GROUPE
AUDIKA GROUPE
AUDINATE PTY LTD
AUDINATE PTY LTD
AUDINATE PTY LTD
AUDINATE PTY LTD
AUDIO PIXELS HOLDINGS LTD
AUDIO PIXELS HOLDINGS LTD

year
2012
2013
2014
2015
2016
2017
2012
2013
2014
2014
2015
2016
2017
2012
2013

Comment

Julio Raffo

Join Date: May 2014

Posts: 132
#30

25 Apr 2018, 10:20

Fleur Ponne , if I were you I'll:

1) keep the unique names in the two sources
2) match these using matchit
3) drop those not meeting the desired similarity threshold
4) merge with the original two datasets
5) drop all observations where yearA != yearB
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment