Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    William Blackmon : I'm confused by the setup of your problem. How likely is that observations in the first file are repeated in the second file (and vice-versa)? Do all observations in the first file are expected to have one (or many) matches in the second file?

    In any case, you could use duplicates to help you eliminate cases. For instance,

    Code:
    gsort id1 -superscore3
    duplicates drop id1, force
    will keep only the best score for each id1 in my previous example posted in #3. Beware that you may miss cases with identical superscore3 within the same id1.

    Comment


    • #17
      Sorry for the confusion. There should not be any actual duplicates across the files. I'm talking about the duplicates generated by matchit itself. The issue I'm struggling with is automating the process of choosing the best match (according to matchit similarity score) in a way that doesn't create duplicates in terms of either dataset. For example:

      Dataset 1: Dataset 2:
      John Doe John Doe
      Jon Dow Jonn Doe
      Mark Williams Mark Williams

      Clearly, John Doe and Mark Williams will match, no problems. However, it is possible that Jonn Doe will match with John Doe, not Jon Dow as it should. Because of cases like this, I need the program to only choose the top match in dataset 2 for any orbs in dataset 1, and then choose the 2nd best match (from the pool of unmatched dataset 2 obs) for the remaining unmatched obs in dataset 1. In fact this process would need to repeat until each obs has a unique match.

      Comment


      • #18
        Julio Raffo Yes, the blocking in reclink2 (and in reclink) relies on blocking variables without errors, which is the standard. Therefore, you typically run sequential phases using different combinations of blocking variables. Two main references are Christen (2012), and Harron, Goldstein, and Dibben (2015). This standard blocking logic originates from AutoMatch which was the dominant record linkage software up until maybe 10 years ago.

        Your original description for matchit as using joinby for the blocking is better than your recent referring to many-to-many matching, I guess. Stata's joinby is better known outside of the Stata community as SQL outer joins. Like your matchit program, R's record linkage package RecordLinkage by Sariyar and Borg (2010) also uses this "joinby" logic for blocking. On testing, I found that using R's RecordLinkage in Stata is faster than using reclink2. The reason for R's faster speed here is that the R package uses C code.

        reclink2 is based on the well known theory of Fellegi and Sunter (1969) which, in practice, requires a clerical review of potential matches. Is matchit instead based on unsupervised machine learning? In that case, my concern with matchit is not overall speed (it should be its main advantage!) but rather how can we trust the results? Sariyar and Borg (2010, 65) note that, for unsupervised machine learning, "the quality of the resulting classification varies significantly for different datasets and poor results can occur".

        References:
        Christen, Peter. 2012. Data Matching: Concepts and Techniques for Record Linkage, Entitity Resolution, and Duplicate Detection. Springer.
        Fellegi, Ivan, and Alan Sunter. 1969. A Theory of Record Linkage. Journal of the American Statistical Association 64 (328), 1183-1210.
        Harron, Katie, Harvey Goldstein, and Chris Dibben (editors). 2015. Methodological Developments in Data Linkage. Wiley.
        Sariyar, Murat, and Andreas Borg. 2010. The RecordLinkage Package: Detecting Errors in Data. The R Journal 2(2), 61-67.

        Comment


        • #19
          William Blackmon (#17): Then the simple approach suggested in #16 should help you with your step a. I guess for b it will imply using a recursively. For instance, something like this may help you:

          Code:
          duplicates report id1
          scalar dup=cond(`r(unique_value)'!=_N,1,0)
          duplicates report id2
          scalar dup=cond(`r(unique_value)'!=_N,1,dup)
          while dup==1 {
           bysort id1: egen rank1=rank(superscore3), field
           bysort id2: egen rank2=rank(superscore3), field
           gen bestmatch=cond(rank1==1 & rank2==1,1,0)
           bysort id1: egen done1=max(bestmatch)
           bysort id2: egen done2=max(bestmatch)
           drop if (done1==1 | done2==1) & bestmatch==0
           duplicates report id1
           scalar dup=cond(`r(unique_value)'!=_N,1,0)
           duplicates report id2
           scalar dup=cond(`r(unique_value)'!=_N,1,dup)
           drop rank1 rank2 done1 done2 bestmatch
          }
          Please check it thoroughly because I haven’t test it beyond adding the following entries to the #3 example.

          Code:
          // file1
          "Daniel Kennedy"   "Seattle, Washington" 12
          "Daniel Kennedy"   "Washington" 20
           
          // file2
          "Damiel Kennedy"   "seattle" 12
          "Damiel Kennedy"   "WASHINGTON, DC" 20

          Comment


          • #20
            @Anders Alexandersson (#18): thanks.

            As I see it, matchit is just a tool (or platform if you prefer) to produce fuzzy string similarity across variables in the same dataset or observations of different datasets. It attempts to provide a wide range of different well established algorithms benefiting from its indexation capabilities. In addition, it allows users to easily add new algorithms by coding them in a relatively simple fashion as MATA functions.

            Of course, as such, it can be used as part of a larger deterministic, probabilistic or, even, machine learning record matching. But matchit will always be subject to the assumptions imposed by the users on any of these. Please note that users can always use matchit with a clerical checkup. For instance, the example mentioned in #3 can be combined with clrevmatch simply by running:
            Code:
            save matched12.dta
            clrevmatch using matched12.dta, ///
            idmaster(id1) idusing(id2) varM(name1 city1 age1) varU(name2 city2 age2) ///
            clrev_result(myscore) clrev_note(mynotes) newfilename(matched_clrev.dta)
            But again, this is a matter of taste (or methodology) for each user to decide.

            Comment


            • #21
              Julio Raffo : Thank you. That gave me a better understanding of matchit.

              A string matching method I would like to see implemented in Stata is Double-Metaphone. "In general, Double-Metaphone seems to be generating encodings that are closer to correct pronunciation of names than NYSIIS." --Christen in Data Matching: Concepts and Techniques for Record Linkage (2012, 78).

              The method is implemented in the R package PGRdup and is being implemented in the R package phonics. You can call R from Stata, so a Stata implementation is not critical but it would be nice to have.

              Comment


              • #22
                @Anders Alexandersson: you probably know it already but I think that wbuchanan has coded double metaphone. But it might well be just the simple one.

                Comment


                • #23
                  Anders Alexandersson Julio Raffo is correct. The source is available at https://github.com/wbuchanan/StataStringUtilities and can be installed using:

                  Code:
                  net inst http://wbuchanan.github.io/StataStringUtilities
                  help phoneticenc
                  That should install it and open the help file for the phonetic encodings that are supported. The package is based on the Apache Commons Codec library and has some string distance methods included as well.

                  Comment


                  • #24
                    @Julio Raffo -- Thanks, your code works well!

                    Comment


                    • #25
                      Julio Raffo wbuchanan : Thanks, I was not aware of it. Stata did not find it under search metaphone. I can access the files at https://github.com/wbuchanan/StataStringUtilities using a browser. I get this error message using updated Stata/MP 14.2 for Windows (64-bit x86-64):

                      Code:
                      . net inst http://wbuchanan.github.io/StataStringUtilities
                      file http://wbuchanan.github.io/stata.toc not found
                      http://wbuchanan.github.io/ either
                        1)  is not a valid URL, or
                        2)  could not be contacted, or
                        3)  is not a Stata download site (has no stata.toc file).
                      r(601);
                      I will try again later and on another PC to test if this is a temporary problem.

                      Comment


                      • #26
                        Anders Alexandersson that's pretty odd/bizarre. Maybe you're connecting from a proxy that doesn't want to allow Stata to communicate with that domain? When I check to see if the toc is present this is what I get:

                        Code:
                        . net from http://wbuchanan.github.io/StataStringUtilities
                        ---------------------------------------------------------------------------------------------------------------------------
                        http://wbuchanan.github.io/StataStringUtilities/
                        Stata String Utilities
                        ---------------------------------------------------------------------------------------------------------------------------
                        
                        PACKAGES you could -net describe-:
                            strutil           A Java-plugin based package for string distance/similarity metrics and phonetic string encodings.
                        ---------------------------------------------------------------------------------------------------------------------------
                        You could try using https, but you would need to restart your Stata in order for the Java binary to be found on the classpath.

                        Comment


                        • #27
                          wbuchanan : I can reproduce the results in #26.

                          The syntax for net install is
                          Code:
                          net install pkgname [, all replace force from(directory_or_url)]
                          Therefore, I am surprised your code in #23 works for you, since you provided a URL instead of a package name. I do not use a Proxy, as checked in Stata under Edit -> Preferences -> General preferences -> Internet.

                          I was able to install the package strutil using this code:
                          Code:
                          net install strutil, from(http://wbuchanan.github.io/StataStringUtilities)


                          Comment


                          • #28
                            Earlier post was a mistake on my part but glad to hear you were able to get things installed.

                            Comment


                            • #29
                              Hi all, thank you very much for your previous explanation. I have a question that is in line with the above question on multiple variables. I am trying to merge two panel data sets and therefore I am wondering in what way I could best use either reclink or matchit to merge these?
                              I need to merge the data on company name and year. In the name variable the two datasets differ (e.g. "ABN AMRO holding" or "Abn Amro holding NV") and for this variable I thus need a fuzzy match. The year variable, on the other hand, should be merged 1:1.
                              The first dataset furthermore contains variables on assurance services and the second one contains financials.
                              A final important remark is that both datasets do not contain the same amount of observations. It could be that a company is found in the first dataset but not in the second.
                              Here’s an example of what the data looks like:
                              Source A
                              name
                              AUDI AG
                              AUDI AG
                              AUDI AG
                              AVINOR
                              AVINOR
                              AVINOR
                              AVINOR
                              AWISTA GmbH
                              AWISTA GmbH
                              AXA
                              AXA
                              AXA
                              AXA
                              AXA
                              AXA
                              year
                              2013
                              2015
                              2017
                              2014
                              2015
                              2016
                              2017
                              2015
                              2016
                              2012
                              2013
                              2014
                              2015
                              2016
                              2017
                              Source B
                              Name
                              AUDI AG
                              AUDI AG
                              AUDI AG
                              AUDI AG
                              AUDI AG
                              AUDI AG
                              AUDIKA GROUPE
                              AUDIKA GROUPE
                              AUDIKA GROUPE
                              AUDINATE PTY LTD
                              AUDINATE PTY LTD
                              AUDINATE PTY LTD
                              AUDINATE PTY LTD
                              AUDIO PIXELS HOLDINGS LTD
                              AUDIO PIXELS HOLDINGS LTD
                              year
                              2012
                              2013
                              2014
                              2015
                              2016
                              2017
                              2012
                              2013
                              2014
                              2014
                              2015
                              2016
                              2017
                              2012
                              2013

                              Comment


                              • #30
                                Fleur Ponne , if I were you I'll:

                                1) keep the unique names in the two sources
                                2) match these using matchit
                                3) drop those not meeting the desired similarity threshold
                                4) merge with the original two datasets
                                5) drop all observations where yearA != yearB

                                Comment

                                Working...
                                X