What are the differences between -matchit- and -reclink-?

Julio Raffo

Join Date: May 2014

Posts: 132
#1

What are the differences between -matchit- and -reclink-?

29 Apr 2016, 11:08

Someone has asked me the above question in a private message, but I guess the answer may interest other Statalist users. Moreover, other people may have a different take on it which may correct or improve my view.

As a starter, both -reclink- and -matchit- share the trait that they can put together two different Stata datasets based on non-exact string keys (i.e. variables). However, they differ in many other functionalities making them sometimes complementary and other alternative.

Last time I've checked, the main difference in favor of -reclink- over -matchit- was that it applied the bigram fuzzy matching to a set of columns of each datasets in one step (allowing also different scores for each pair of columns) . -matchit- can replicate this functionality but in several steps. You need to run -matchit- using the two datafiles syntax for one text column of each dataset and then (after merging the results with the original master and using files) rerun -matchit- using the two columns syntax as many pairs of columns you will like to compare. Another main advantage of -reclink- is the orblock fuctionality, which one colleague told me is very useful. Again, this can also be performed with -matchit- but in two steps. You need first to perform a -joinby- of the master and using datasets by the variable(s) identifying the subgroups; and then apply -matchit- using the two columns syntax.

Another advantage of -reclink- is that it can be faster than -matchit-. But this actually comes in hand with an often hidden disadvantage of -reclink-, which is that it does not report other potential matches for those pairs of records that are exactly the same. Indeed, as -matchit- takes advantage of indexation, it will be faster than -reclink- the less perfect matches there are between the two datasets.

I see the fact that -matchit- reports all potential matches as a strong advantage in its favor. First, it allows using it to clean one dataset with messy string entries by simply matching it against itself. This is impossible with the current version of -reclink-. But even in the case of two datasets, this could be a strong limitation for -reclink-, as I quite often need to have all potential matches (those that score one but also those that score less) for each observation.

A second group of advantages for -matchit- refer to the more flexibility on the similarity function, as it allows users to pick within a large variety of similarity functions beyond bigram. In particular when these are combined with the weighting grams functionality and the different ways to compute the similarity score. Moreover, if all these fail to satisfy the users, it also allows them to include their own custom techniques by coding them in MATA. In my experience, there is no silver bullet similarity technique for all cases. So it really depends on the kind of data your struggling with, making flexibility a must.

Third, as mentioned above, -matchit- can also be applied to columns within the same dataset. This allows to simply compare the text similarity of two columns with all the different approaches available. Somehow related to this, -matchit- can be applied on str245+ and strL variables, which allows to use -matchit- to compare long pieces of texts instead of only names (e.g. matching scientific papers and patents by the similarity of their abstracts or even full text).

I welcome any comment or correction if someone finds this not clear or correct enough. Cheers,

J.
Tags: None

3 likes
William Blackmon

Join Date: Oct 2016

Posts: 40
#2

11 Jan 2017, 08:54

Hi Julio Raffo,

First -- thank's much for matchit. It's been of real value to my work. Two quick questions if you wouldn't mind...

First, about using matchit to match on multiple variables.

You said...

Last time I've checked, the main difference in favor of -reclink- over -matchit- was that it applied the bigram fuzzy matching to a set of columns of each datasets in one step (allowing also different scores for each pair of columns) . -matchit- can replicate this functionality but in several steps. You need to run -matchit- using the two datafiles syntax for one text column of each dataset and then (after merging the results with the original master and using files) rerun -matchit- using the two columns syntax as many pairs of columns you will like to compare.

I'm still a little fuzzy (no pun intended) on the details of how you would implement this. For the very simple example datasets provided below, how would you implement using matchit on both name and city?

Second, is there any way of fuzzy matching both string and numeric variables at the same time? For example, with the example datasets, would it be possible to fuzzy match on name, city, and age? My only idea was to use matchit on name and city, then choose the closest match that is also within, say, two years in terms of age. Any better ideas?

Thanks so much! Here's a couple example datasets...

Source A:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str13 name str14 city byte age "John Doe" "Atlanta" 28 "William Green" "Seattle" 22 "Emily Jones" "Denver" 48 "Sarah Brown" "Chicago" 60 "Dan Kennedy" "Washington, DC" 12 end

Source B:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str14 name str10 city byte age "Doe, Jon" "ATL" 27 "Will Green" "SEA" 23 "Jone, Emily" "denver" 46 "Sara Brown" "chicago" 62 "Daniel Kennedy" "WASHINGTON" 12 end
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

11 Jan 2017, 10:33

Hi William,

As follows a practical example of how to use matchit in a recursive way in order to mimic reclink multiple column trait. If you want to apply matchit to numeric strings, you simply need to convert the numeric variables to string ones. But I see little value in such approach because "1999" will be considered completely different from "2000" while "2001" relatively similar. On the contrary, you can use more creative numeric approaches for this. I put one example below.

Code:

clear
input str13 name1 str14 city1 byte age1
"John Doe"      "Atlanta"        28
"William Green" "Seattle"        22
"Emily Jones"   "Denver"         48
"Sarah Brown"   "Chicago"        60
"Dan Kennedy"   "Washington, DC" 12
end
gen id1=_n
save file1.dta, replace

clear
input str14 name2 str10 city2 byte age2
"Doe, Jon"       "ATL"        27
"Will Green"     "SEA"        23
"Jone, Emily"    "denver"     46
"Sara Brown"     "chicago"    62
"Daniel Kennedy" "WASHINGTON" 12
end
gen id2=_n
save file2.dta, replace

// 1- matchit first by the most relevant pair of columns
matchit id2 name2 using file1.dta, idu(id1) txtu(name1) t(0)
gsort -similscore
list
// 2- bring back the other columns
joinby id1 using file1.dta
joinby id2 using file2.dta

// 3- match it using the column syntax for as many columns you want
matchit city1 city2 , g(simcity)
// clean the data (e.g. same case or removing symbols)
gen CITY1=upper(city1)
gen CITY2=upper(city2)
matchit CITY1 CITY2 , g(simCITY)

// 4- you can also use non-fuzzy string approaches here (e.g. numeric variables)
gen simage=1-abs(age1-age2)/max(age1,age2)

// 5- Generate the uber similarity score of your taste/needs
gen superscore1=similscore*simCITY*simage
gen superscore2=(similscore+simCITY+simage)/3
gen superscore3=.5*similscore+.25*simCITY+.25*simage
gsort -superscore3

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

11 Jan 2017, 10:57

I would just add that to do approximate matching on a numeric variable such as age, William should look at -rangejoin-, by Robert Picard, available from SSC.
Comment
William Blackmon

Join Date: Oct 2016

Posts: 40
#5

11 Jan 2017, 11:57

Julio Raffo thanks so much, that was exactly what I'm looking for. Clyde Schechter -- great, I'll check out rangejoin as well.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#6

11 Jan 2017, 12:12

Yes, if the most informative pair of variables is numeric, you could use rangejoin as step one and then follow with matchit in column syntax. But take a moment to think which is the best pair to start with because it can easily get out of hands. In the above example, starting by name seems more appropriate than city or age. For instance, if you are working with census data, you may have too many observations in the same city or same age range that you need not to compare. So you will spare a lot of computation time by starting with names. But it really depends on the data your attempting to match.
Comment
William Blackmon

Join Date: Oct 2016

Posts: 40
#7

11 Jan 2017, 12:24

Yes, in my data (survey data of similarly aged school children), there's huge overlap in age, but much less of names. I also have gender and would like to restrict matches to those with the same gender (would use "required" with reclink). Any thoughts on the most efficient way to do that? You said...

Another main advantage of -reclink- is the orblock fuctionality, which one colleague told me is very useful. Again, this can also be performed with -matchit- but in two steps. You need first to perform a -joinby- of the master and using datasets by the variable(s) identifying the subgroups; and then apply -matchit- using the two columns syntax.

I'm thinking this would have the same issue you just mentioned above -- there would obviously be MANY observations overlapping if I started with a joinby using gender.
Comment
Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#8

11 Jan 2017, 12:57

I think that matchit should rather be compared with reclink2 (SJ-15-3) which introduces two new options, manytoone and npairs(). My concern is that reclink2 is slow. So, if -matchit- is slower than the older reclink then there is a problem. Approximately how fast or slow are the programs with larger datasets in your experience? For example, what is the performance with 1,000 or 100,000 or 1 million records in both datasets?
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#9

11 Jan 2017, 13:34

William Blackmon : yes, the principle is the same. You could do first a joinby using the orblock variable and then apply matchit in he column syntax. But this is only computationall efficient if this variable does not exaggerate the amount of unnecessary comparisons. Gender definitely is not a great one, but postal codes maybe a good one. This being said, you can always split files by gender and run matchit on these separately. If gender is correctly coded in your files this will reduce a lot the unnecessary comparisons.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#10

11 Jan 2017, 13:54

Anders Alexandersson : in my experience reclink seems faster, but it is actually slower than matchit. This is due to reclink removing the perfect matches before doing the fuzzy match. The performance of matchit varies with the search space created by the two files and the efficiency of the indexation in avoiding to look everywhere in the search space. The latter can be tweaked by choosing the most appropriate similarity algorithm and using the automatic stopword functionality. You can find some examples here: http://www.statalist.org/forums/foru...-e-fuzzy-match

I have tried reclink2 once but I couldn't make it behave differently than reclink about the perfect match. But I probably didn't tried hard enough. Any example on this that you could share with me?
Comment
Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#11

11 Jan 2017, 14:25

Julio Raffo: You wrote,

I have tried reclink2 once but I couldn't make it behave differently than reclink about the perfect match. But I probably didn't tried hard enough. Any example on this that you could share with me?

That is what the manytoone option is for. The help file has an example but the example in the SJ article on pages 690-691 is much better. In the SJ example, only by adding the manytoone option you find the potential matches:

Now, we see that the rid #6 and rid #7 are matched with the correct record from the firm dataset.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#12

11 Jan 2017, 15:06

I remember having the same issue described here: http://www.statalist.org/forums/foru...-with-reclink2

but I'll try the SJ example tomorrow when I'm at my computer.
Comment
Andrew Lover

Join Date: Apr 2014

Posts: 182
#13

11 Jan 2017, 17:15

Also check out - strgroup- (SSC), which does fuzzy matching via Levenshtein distance, which allows you to set a threshold to 'tune' any typos in the matches. FWIW, this saved my bacon on a project!

__________________________________________________ __
Assistant Professor, Department of Biostatistics and Epidemiology
School of Public Health and Health Sciences
University of Massachusetts- Amherst
Comment

William Blackmon

Join Date: Oct 2016
Posts: 40

#14

11 Jan 2017, 19:10

Okay one more question -- and it may be that this is too much of an ask -- if so please disregard!

I'm still struggling with the next step after actually implementing matchit -- that is, selecting the best matches. I am matching across two datasets: baseline and endline. Not sure if this really matters, but I'm conceptually trying to find a match for every endline observation from the baseline dataset (and not the opposite).

I'd like to, for every endline observation, take the best match from the baseline set. However, this will lead to duplicates in terms of baseline data, as some baseline observations will be the best match for multiple endline observations. So whenever this happens, I'd like to keep the best match, but "unmatch" the rest of the duplicate baseline matches. For all the endline observations that get unmatched, I'd like to find the 2nd best match (that isn't already a first best match), and use these. Again, there will be duplicates on baseline, so I'd continue this process until all endline observations have a match, and the final dataset has no duplicates in terms of endline and baseline observations. Quite confusing! Maybe there's an easier way to think about it?

Anyway, here's an attempt to design a code that follows that thought process -- but I think I have many errors still. No need to provided detailed debugging, but any advice on how to approach the problem is much appreciated.

Code:

// 7- Select the best matches and remove the rest. 
//      a. For every endline observation, select baseline observation with best matchscore.
            egen maxscore = max(matchscore_overall), by(id_individual_el) //maxscore of all baseline obs matched to any given endline ob
            gen id_individual_bl_matched =. 
            replace id_individual_bl_matched = id_individual_bl if matchscore_overall == maxscore & matchscore_overall>=$minscore 
            drop maxscore
//      b. It is possible that the same baseline observation is the best match for multiple endline observations.
//           In this case, keep the baseline observation matched to the endline observation with the highest match score, and "unmatch" the rest.
            egen maxscore = max(matchscore_overall), by(id_individual_bl_matched) // maxscore of all baseline obs "finally" matched to an endline ob
            replace id_individual_bl_matched =. if matchscore_overall != maxscore // "unmatching" the rest
            gen matchscore_final = .
            replace matchscore_final = matchscore_overall if id_individual_bl_matched != . // setting a final matchscore for finally matched obs
            replace matchscore_overall = -1 if matchscore_final != .  // resetting overall matchscore for finally matched obs to -1
            drop maxscore
//        c. For those endline observations now without a baseline match, assign the second highest baseline match.
            if id_individual_bl_matched == . {
                egen maxscore = max(matchscore_overall), by(id_individual_bl_matched) // creating new max score (now the second highest score)
                replace id_individual_bl_matched = id_individual_bl if matchscore_overall == maxscore & matchscore_overall>=$minscore 
                drop maxscore
            }
//        d. Repeat steps b-c until all endline observations have a unique baseline match.
            gen incomplete = 1
            egen totalincomplete = total(incomplete)
            while totalincomplete > 0 {
                repeatbc
                drop incomplete totalincomplete
                gen incomplete = matchscore_final == .
                egen totalincomplete = total(incomplete)
            }
            drop incomplete remainingincomplete
            
save "$dta\intermediate\matched_`1'", replace

end

* Create a program that repeats steps b-c
capture program drop repeatbc
program define repeatbc
//      b. It is possible that the same baseline observation is the best match for multiple endline observations.
//           In this case, keep the baseline observation matched to the endline observation with the highest match score, and "unmatch" the rest.
            egen maxscore = max(matchscore_overall), by(id_individual_bl_matched) // maxscore of all baseline obs "finally" matched to an endline ob
            replace id_individual_bl_matched =. if matchscore_overall != maxscore // "unmatching" the rest
            replace matchscore_final = matchscore_overall if id_individual_bl_matched != . // setting a final matchscore for finally matched obs
            replace matchscore_overall = -1 if matchscore_final != .  // resetting overall matchscore for finally matched obs to -1
            drop maxscore
//        c. For those endline observations now without a baseline match, assign the second highest baseline match.
            if id_individual_bl_matched == . {
                egen maxscore = max(matchscore_overall), by(id_individual_bl_matched) // creating new max score (now the second highest score)
                replace id_individual_bl_matched = id_individual_bl if matchscore_overall == maxscore & matchscore_overall>=$minscore 
                drop maxscore
            }
end

Comment

Julio Raffo

Join Date: May 2014
Posts: 132

#15

12 Jan 2017, 03:05

Anders Alexandersson : The manytoone is not doing then what I thought which is probably better described as a many-to-many. Both reclink and reclink2, stop looking for alternative candidates if they found a perfect match. Matchit continues looking for them. That's is why it is hard to compare performance, because they do not deliver the same kind of output. I personally prefer to get everything as you never know what will disappear through the cracks. But I can see that other people may prefer such option.

See an example as follows using the SJ data.

Code:

. use respondent_employers, clear
. cap stnd_compname firm_name, gen(stn_name stn_dbaname stn_fkaname entitytype attn_name)
. save respondent_employers.dta, replace
file respondent_employers.dta saved

. use firm_dataset, clear
. cap stnd_compname firm_name, gen(stn_name stn_dbaname stn_fkaname entitytype attn_name)
. save firm_dataset.dta, replace
file firm_dataset.dta saved

. use respondent_employers, clear
. qui reclink2 stn_name using firm_dataset, idm(rid) idu(firm_id) gen(rlsc)
. list rid stn_name firm_id Ustn_name rlsc , table str(15) sep(0)

     +------------------------------------------------------------+
     | rid          stn_name   firm_id         Ustn_name     rlsc |
     |------------------------------------------------------------|
  1. |   1              7 11         1              7 11   1.0000 |
  2. |   2            BT & T        14            BB & T   0.7712 |
  3. |   3            AT & T         2            AT & T   1.0000 |
  4. |   4            KROGER         7            KROGER   1.0000 |
  5. |   5   WAL MART STORES         8   WAL MART STORES   1.0000 |
  6. |   6           WLAMART         .                          . |
  7. |   7           WALMART         .                          . |
     +------------------------------------------------------------+

. use respondent_employers, clear
. qui reclink2 stn_name using firm_dataset, idm(rid) idu(firm_id) gen(rlsc) many
. list rid stn_name firm_id Ustn_name rlsc , table str(15) sep(0)

     +------------------------------------------------------------+
     | rid          stn_name   firm_id         Ustn_name     rlsc |
     |------------------------------------------------------------|
  1. |   1              7 11         1              7 11   1.0000 |
  2. |   2            BT & T         2            AT & T   0.9471 |
  3. |   3            AT & T         2            AT & T   1.0000 |
  4. |   4            KROGER         7            KROGER   1.0000 |
  5. |   5   WAL MART STORES         8   WAL MART STORES   1.0000 |
  6. |   6           WLAMART         .                          . |
  7. |   7           WALMART         8   WAL MART STORES   0.7901 |
     +------------------------------------------------------------+

. use respondent_employers, clear
. qui matchit rid stn_name using firm_dataset.dta, idu(firm_id) txtu(stn_name)
. gsort rid -similscore
. list, table str(15) sep(0)

     +---------------------------------------------------------------+
     | rid          stn_name   firm_id         stn_name1   similsc~e |
     |---------------------------------------------------------------|
  1. |   1              7 11         1              7 11           1 |
  2. |   2            BT & T         2            AT & T          .8 |
  3. |   2            BT & T        14            BB & T          .6 |
  4. |   3            AT & T         2            AT & T           1 |
  5. |   3            AT & T        14            BB & T          .6 |
  6. |   4            KROGER         7            KROGER           1 |
  7. |   5   WAL MART STORES         8   WAL MART STORES           1 |
  8. |   6           WLAMART         9             KMART   .61237244 |
  9. |   7           WALMART         9             KMART   .61237244 |
 10. |   7           WALMART         8   WAL MART STORES   .54554473 |
     +---------------------------------------------------------------+

Announcement