Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What are the differences between -matchit- and -reclink-?

    Someone has asked me the above question in a private message, but I guess the answer may interest other Statalist users. Moreover, other people may have a different take on it which may correct or improve my view.

    As a starter, both -reclink- and -matchit- share the trait that they can put together two different Stata datasets based on non-exact string keys (i.e. variables). However, they differ in many other functionalities making them sometimes complementary and other alternative.

    Last time I've checked, the main difference in favor of -reclink- over -matchit- was that it applied the bigram fuzzy matching to a set of columns of each datasets in one step (allowing also different scores for each pair of columns) . -matchit- can replicate this functionality but in several steps. You need to run -matchit- using the two datafiles syntax for one text column of each dataset and then (after merging the results with the original master and using files) rerun -matchit- using the two columns syntax as many pairs of columns you will like to compare. Another main advantage of -reclink- is the orblock fuctionality, which one colleague told me is very useful. Again, this can also be performed with -matchit- but in two steps. You need first to perform a -joinby- of the master and using datasets by the variable(s) identifying the subgroups; and then apply -matchit- using the two columns syntax.

    Another advantage of -reclink- is that it can be faster than -matchit-. But this actually comes in hand with an often hidden disadvantage of -reclink-, which is that it does not report other potential matches for those pairs of records that are exactly the same. Indeed, as -matchit- takes advantage of indexation, it will be faster than -reclink- the less perfect matches there are between the two datasets.

    I see the fact that -matchit- reports all potential matches as a strong advantage in its favor. First, it allows using it to clean one dataset with messy string entries by simply matching it against itself. This is impossible with the current version of -reclink-. But even in the case of two datasets, this could be a strong limitation for -reclink-, as I quite often need to have all potential matches (those that score one but also those that score less) for each observation.

    A second group of advantages for -matchit- refer to the more flexibility on the similarity function, as it allows users to pick within a large variety of similarity functions beyond bigram. In particular when these are combined with the weighting grams functionality and the different ways to compute the similarity score. Moreover, if all these fail to satisfy the users, it also allows them to include their own custom techniques by coding them in MATA. In my experience, there is no silver bullet similarity technique for all cases. So it really depends on the kind of data your struggling with, making flexibility a must.

    Third, as mentioned above, -matchit- can also be applied to columns within the same dataset. This allows to simply compare the text similarity of two columns with all the different approaches available. Somehow related to this, -matchit- can be applied on str245+ and strL variables, which allows to use -matchit- to compare long pieces of texts instead of only names (e.g. matching scientific papers and patents by the similarity of their abstracts or even full text).

    I welcome any comment or correction if someone finds this not clear or correct enough. Cheers,

    J.

  • #2
    Hi Julio Raffo,

    First -- thank's much for matchit. It's been of real value to my work. Two quick questions if you wouldn't mind...

    First, about using matchit to match on multiple variables.

    You said...
    Last time I've checked, the main difference in favor of -reclink- over -matchit- was that it applied the bigram fuzzy matching to a set of columns of each datasets in one step (allowing also different scores for each pair of columns) . -matchit- can replicate this functionality but in several steps. You need to run -matchit- using the two datafiles syntax for one text column of each dataset and then (after merging the results with the original master and using files) rerun -matchit- using the two columns syntax as many pairs of columns you will like to compare.
    I'm still a little fuzzy (no pun intended) on the details of how you would implement this. For the very simple example datasets provided below, how would you implement using matchit on both name and city?

    Second, is there any way of fuzzy matching both string and numeric variables at the same time? For example, with the example datasets, would it be possible to fuzzy match on name, city, and age? My only idea was to use matchit on name and city, then choose the closest match that is also within, say, two years in terms of age. Any better ideas?

    Thanks so much! Here's a couple example datasets...

    Source A:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str13 name str14 city byte age
    "John Doe"      "Atlanta"        28
    "William Green" "Seattle"        22
    "Emily Jones"   "Denver"         48
    "Sarah Brown"   "Chicago"        60
    "Dan Kennedy"   "Washington, DC" 12
    end
    Source B:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str14 name str10 city byte age
    "Doe, Jon"       "ATL"        27
    "Will Green"     "SEA"        23
    "Jone, Emily"    "denver"     46
    "Sara Brown"     "chicago"    62
    "Daniel Kennedy" "WASHINGTON" 12
    end

    Comment


    • #3

      Hi William,

      As follows a practical example of how to use matchit in a recursive way in order to mimic reclink multiple column trait. If you want to apply matchit to numeric strings, you simply need to convert the numeric variables to string ones. But I see little value in such approach because "1999" will be considered completely different from "2000" while "2001" relatively similar. On the contrary, you can use more creative numeric approaches for this. I put one example below.



      Code:
      clear
      input str13 name1 str14 city1 byte age1
      "John Doe"      "Atlanta"        28
      "William Green" "Seattle"        22
      "Emily Jones"   "Denver"         48
      "Sarah Brown"   "Chicago"        60
      "Dan Kennedy"   "Washington, DC" 12
      end
      gen id1=_n
      save file1.dta, replace
      
      clear
      input str14 name2 str10 city2 byte age2
      "Doe, Jon"       "ATL"        27
      "Will Green"     "SEA"        23
      "Jone, Emily"    "denver"     46
      "Sara Brown"     "chicago"    62
      "Daniel Kennedy" "WASHINGTON" 12
      end
      gen id2=_n
      save file2.dta, replace
      
      // 1- matchit first by the most relevant pair of columns
      matchit id2 name2 using file1.dta, idu(id1) txtu(name1) t(0)
      gsort -similscore
      list
      // 2- bring back the other columns
      joinby id1 using file1.dta
      joinby id2 using file2.dta
      
      // 3- match it using the column syntax for as many columns you want
      matchit city1 city2 , g(simcity)
      // clean the data (e.g. same case or removing symbols)
      gen CITY1=upper(city1)
      gen CITY2=upper(city2)
      matchit CITY1 CITY2 , g(simCITY)
      
      // 4- you can also use non-fuzzy string approaches here (e.g. numeric variables)
      gen simage=1-abs(age1-age2)/max(age1,age2)
      
      // 5- Generate the uber similarity score of your taste/needs
      gen superscore1=similscore*simCITY*simage
      gen superscore2=(similscore+simCITY+simage)/3
      gen superscore3=.5*similscore+.25*simCITY+.25*simage
      gsort -superscore3

      Comment


      • #4
        I would just add that to do approximate matching on a numeric variable such as age, William should look at -rangejoin-, by Robert Picard, available from SSC.

        Comment


        • #5
          Julio Raffo thanks so much, that was exactly what I'm looking for. Clyde Schechter -- great, I'll check out rangejoin as well.

          Comment


          • #6
            Yes, if the most informative pair of variables is numeric, you could use rangejoin as step one and then follow with matchit in column syntax. But take a moment to think which is the best pair to start with because it can easily get out of hands. In the above example, starting by name seems more appropriate than city or age. For instance, if you are working with census data, you may have too many observations in the same city or same age range that you need not to compare. So you will spare a lot of computation time by starting with names. But it really depends on the data your attempting to match.

            Comment


            • #7
              Yes, in my data (survey data of similarly aged school children), there's huge overlap in age, but much less of names. I also have gender and would like to restrict matches to those with the same gender (would use "required" with reclink). Any thoughts on the most efficient way to do that? You said...
              Another main advantage of -reclink- is the orblock fuctionality, which one colleague told me is very useful. Again, this can also be performed with -matchit- but in two steps. You need first to perform a -joinby- of the master and using datasets by the variable(s) identifying the subgroups; and then apply -matchit- using the two columns syntax.
              I'm thinking this would have the same issue you just mentioned above -- there would obviously be MANY observations overlapping if I started with a joinby using gender.

              Comment


              • #8
                I think that matchit should rather be compared with reclink2 (SJ-15-3) which introduces two new options, manytoone and npairs(). My concern is that reclink2 is slow. So, if -matchit- is slower than the older reclink then there is a problem. Approximately how fast or slow are the programs with larger datasets in your experience? For example, what is the performance with 1,000 or 100,000 or 1 million records in both datasets?

                Comment


                • #9
                  William Blackmon : yes, the principle is the same. You could do first a joinby using the orblock variable and then apply matchit in he column syntax. But this is only computationall efficient if this variable does not exaggerate the amount of unnecessary comparisons. Gender definitely is not a great one, but postal codes maybe a good one. This being said, you can always split files by gender and run matchit on these separately. If gender is correctly coded in your files this will reduce a lot the unnecessary comparisons.

                  Comment


                  • #10
                    Anders Alexandersson : in my experience reclink seems faster, but it is actually slower than matchit. This is due to reclink removing the perfect matches before doing the fuzzy match. The performance of matchit varies with the search space created by the two files and the efficiency of the indexation in avoiding to look everywhere in the search space. The latter can be tweaked by choosing the most appropriate similarity algorithm and using the automatic stopword functionality. You can find some examples here: http://www.statalist.org/forums/foru...-e-fuzzy-match

                    I have tried reclink2 once but I couldn't make it behave differently than reclink about the perfect match. But I probably didn't tried hard enough. Any example on this that you could share with me?

                    Comment


                    • #11
                      Julio Raffo: You wrote,
                      I have tried reclink2 once but I couldn't make it behave differently than reclink about the perfect match. But I probably didn't tried hard enough. Any example on this that you could share with me?
                      That is what the manytoone option is for. The help file has an example but the example in the SJ article on pages 690-691 is much better. In the SJ example, only by adding the manytoone option you find the potential matches:

                      Now, we see that the rid #6 and rid #7 are matched with the correct record from the firm dataset.

                      Comment


                      • #12
                        I remember having the same issue described here: http://www.statalist.org/forums/foru...-with-reclink2

                        but I'll try the SJ example tomorrow when I'm at my computer.

                        Comment


                        • #13
                          Also check out - strgroup- (SSC), which does fuzzy matching via Levenshtein distance, which allows you to set a threshold to 'tune' any typos in the matches. FWIW, this saved my bacon on a project!
                          __________________________________________________ __
                          Assistant Professor, Department of Biostatistics and Epidemiology
                          School of Public Health and Health Sciences
                          University of Massachusetts- Amherst

                          Comment


                          • #14
                            Okay one more question -- and it may be that this is too much of an ask -- if so please disregard!

                            I'm still struggling with the next step after actually implementing matchit -- that is, selecting the best matches. I am matching across two datasets: baseline and endline. Not sure if this really matters, but I'm conceptually trying to find a match for every endline observation from the baseline dataset (and not the opposite).

                            I'd like to, for every endline observation, take the best match from the baseline set. However, this will lead to duplicates in terms of baseline data, as some baseline observations will be the best match for multiple endline observations. So whenever this happens, I'd like to keep the best match, but "unmatch" the rest of the duplicate baseline matches. For all the endline observations that get unmatched, I'd like to find the 2nd best match (that isn't already a first best match), and use these. Again, there will be duplicates on baseline, so I'd continue this process until all endline observations have a match, and the final dataset has no duplicates in terms of endline and baseline observations. Quite confusing! Maybe there's an easier way to think about it?

                            Anyway, here's an attempt to design a code that follows that thought process -- but I think I have many errors still. No need to provided detailed debugging, but any advice on how to approach the problem is much appreciated.

                            Code:
                            // 7- Select the best matches and remove the rest. 
                            //      a. For every endline observation, select baseline observation with best matchscore.
                                        egen maxscore = max(matchscore_overall), by(id_individual_el) //maxscore of all baseline obs matched to any given endline ob
                                        gen id_individual_bl_matched =. 
                                        replace id_individual_bl_matched = id_individual_bl if matchscore_overall == maxscore & matchscore_overall>=$minscore 
                                        drop maxscore
                            //      b. It is possible that the same baseline observation is the best match for multiple endline observations.
                            //           In this case, keep the baseline observation matched to the endline observation with the highest match score, and "unmatch" the rest.
                                        egen maxscore = max(matchscore_overall), by(id_individual_bl_matched) // maxscore of all baseline obs "finally" matched to an endline ob
                                        replace id_individual_bl_matched =. if matchscore_overall != maxscore // "unmatching" the rest
                                        gen matchscore_final = .
                                        replace matchscore_final = matchscore_overall if id_individual_bl_matched != . // setting a final matchscore for finally matched obs
                                        replace matchscore_overall = -1 if matchscore_final != .  // resetting overall matchscore for finally matched obs to -1
                                        drop maxscore
                            //        c. For those endline observations now without a baseline match, assign the second highest baseline match.
                                        if id_individual_bl_matched == . {
                                            egen maxscore = max(matchscore_overall), by(id_individual_bl_matched) // creating new max score (now the second highest score)
                                            replace id_individual_bl_matched = id_individual_bl if matchscore_overall == maxscore & matchscore_overall>=$minscore 
                                            drop maxscore
                                        }
                            //        d. Repeat steps b-c until all endline observations have a unique baseline match.
                                        gen incomplete = 1
                                        egen totalincomplete = total(incomplete)
                                        while totalincomplete > 0 {
                                            repeatbc
                                            drop incomplete totalincomplete
                                            gen incomplete = matchscore_final == .
                                            egen totalincomplete = total(incomplete)
                                        }
                                        drop incomplete remainingincomplete
                                        
                            save "$dta\intermediate\matched_`1'", replace
                            
                            end
                            
                            * Create a program that repeats steps b-c
                            capture program drop repeatbc
                            program define repeatbc
                            //      b. It is possible that the same baseline observation is the best match for multiple endline observations.
                            //           In this case, keep the baseline observation matched to the endline observation with the highest match score, and "unmatch" the rest.
                                        egen maxscore = max(matchscore_overall), by(id_individual_bl_matched) // maxscore of all baseline obs "finally" matched to an endline ob
                                        replace id_individual_bl_matched =. if matchscore_overall != maxscore // "unmatching" the rest
                                        replace matchscore_final = matchscore_overall if id_individual_bl_matched != . // setting a final matchscore for finally matched obs
                                        replace matchscore_overall = -1 if matchscore_final != .  // resetting overall matchscore for finally matched obs to -1
                                        drop maxscore
                            //        c. For those endline observations now without a baseline match, assign the second highest baseline match.
                                        if id_individual_bl_matched == . {
                                            egen maxscore = max(matchscore_overall), by(id_individual_bl_matched) // creating new max score (now the second highest score)
                                            replace id_individual_bl_matched = id_individual_bl if matchscore_overall == maxscore & matchscore_overall>=$minscore 
                                            drop maxscore
                                        }
                            end

                            Comment


                            • #15
                              Anders Alexandersson : The manytoone is not doing then what I thought which is probably better described as a many-to-many. Both reclink and reclink2, stop looking for alternative candidates if they found a perfect match. Matchit continues looking for them. That's is why it is hard to compare performance, because they do not deliver the same kind of output. I personally prefer to get everything as you never know what will disappear through the cracks. But I can see that other people may prefer such option.

                              See an example as follows using the SJ data.

                              Code:
                              . use respondent_employers, clear
                              . cap stnd_compname firm_name, gen(stn_name stn_dbaname stn_fkaname entitytype attn_name)
                              . save respondent_employers.dta, replace
                              file respondent_employers.dta saved
                              
                              . use firm_dataset, clear
                              . cap stnd_compname firm_name, gen(stn_name stn_dbaname stn_fkaname entitytype attn_name)
                              . save firm_dataset.dta, replace
                              file firm_dataset.dta saved
                              
                              . use respondent_employers, clear
                              . qui reclink2 stn_name using firm_dataset, idm(rid) idu(firm_id) gen(rlsc)
                              . list rid stn_name firm_id Ustn_name rlsc , table str(15) sep(0)
                              
                                   +------------------------------------------------------------+
                                   | rid          stn_name   firm_id         Ustn_name     rlsc |
                                   |------------------------------------------------------------|
                                1. |   1              7 11         1              7 11   1.0000 |
                                2. |   2            BT & T        14            BB & T   0.7712 |
                                3. |   3            AT & T         2            AT & T   1.0000 |
                                4. |   4            KROGER         7            KROGER   1.0000 |
                                5. |   5   WAL MART STORES         8   WAL MART STORES   1.0000 |
                                6. |   6           WLAMART         .                          . |
                                7. |   7           WALMART         .                          . |
                                   +------------------------------------------------------------+
                              
                              . use respondent_employers, clear
                              . qui reclink2 stn_name using firm_dataset, idm(rid) idu(firm_id) gen(rlsc) many
                              . list rid stn_name firm_id Ustn_name rlsc , table str(15) sep(0)
                              
                                   +------------------------------------------------------------+
                                   | rid          stn_name   firm_id         Ustn_name     rlsc |
                                   |------------------------------------------------------------|
                                1. |   1              7 11         1              7 11   1.0000 |
                                2. |   2            BT & T         2            AT & T   0.9471 |
                                3. |   3            AT & T         2            AT & T   1.0000 |
                                4. |   4            KROGER         7            KROGER   1.0000 |
                                5. |   5   WAL MART STORES         8   WAL MART STORES   1.0000 |
                                6. |   6           WLAMART         .                          . |
                                7. |   7           WALMART         8   WAL MART STORES   0.7901 |
                                   +------------------------------------------------------------+
                              
                              . use respondent_employers, clear
                              . qui matchit rid stn_name using firm_dataset.dta, idu(firm_id) txtu(stn_name)
                              . gsort rid -similscore
                              . list, table str(15) sep(0)
                              
                                   +---------------------------------------------------------------+
                                   | rid          stn_name   firm_id         stn_name1   similsc~e |
                                   |---------------------------------------------------------------|
                                1. |   1              7 11         1              7 11           1 |
                                2. |   2            BT & T         2            AT & T          .8 |
                                3. |   2            BT & T        14            BB & T          .6 |
                                4. |   3            AT & T         2            AT & T           1 |
                                5. |   3            AT & T        14            BB & T          .6 |
                                6. |   4            KROGER         7            KROGER           1 |
                                7. |   5   WAL MART STORES         8   WAL MART STORES           1 |
                                8. |   6           WLAMART         9             KMART   .61237244 |
                                9. |   7           WALMART         9             KMART   .61237244 |
                               10. |   7           WALMART         8   WAL MART STORES   .54554473 |
                                   +---------------------------------------------------------------+

                              Comment

                              Working...
                              X