Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Jaro-Winkler matching

    Hello everybody,
    I would like to know if with stata is it possible to match pair of records using the Jaro-Winkler distance.I have found the command jarowinkler which calculates the edis string distance, but you need to have the pair-record matched previously. Am I wrong?
    Thank you in advance.

  • #2
    What about creating a file of paired records, and then applying the -jarowinkler- command [from SSC]. If you are working with a large data file, -cross- as used below would yield an excessively large file, but there are ways to handle that problem. To start with, though, would the following be something you could work with?

    Code:
    clear
    // Example data
    input str10 name
    "Smith"
    "Miller"
    "White"
    "Smythe"
    "Mueller"
    "Weiss"
    end
    //
    // Make a file of pairs
    tempfile temp
    save `temp'
    rename name name2
    cross using `temp'
    // Closeness of all possible pairs.
    jarowinkler name name2, gen(closeness)
    sort name closeness
    by name: keep if _n > (_N-2)  // perfect and next closest match
    list

    Comment


    • #3
      Thank you very much Mike, this is a really useful information. But I am working with two large datasets (350,000 and 55,000 registers approximately). How is the best way to handle this?

      Comment


      • #4
        I would break the big file into K different smaller parts (perhaps as many as K = 100), and then proceed as in #2 to create K files showing the closeness of each record in your small file to each part of the large file. These K files can then be combined along the way, and you can pick out the matches you want. Exactly what to do with this file, how many close matches to keep (per above), and so forth, would likely depend on the substance of your situation and what you want to do with these matches.

        Here's a quick attempt to *illustrate* something in that direction. I don't have time to really check out the following right now, so you (and others) should examine it and see if it's on target and without errors.

        Code:
        clear
        cd "SomeWorkingDataFolder"
        // Simulate two data files.  Better test data than this would be nice.
        // Two characters strings chosen from a small set.
        local charset = "a b c d e"
        local nchar = wordcount("`charset'")
        set seed `nchar'739
        clear
        set obs 100
        gen name = word(c(alpha), ceil(runiform() * `nchar')) + ///
                   word(c(alpha), ceil(runiform() * `nchar'))
        tempfile big
        save `big'
        //
        clear
        set obs 10
        gen name = word(c(alpha), ceil(runiform() * `nchar')) + ///
                   word(c(alpha), ceil(runiform() * `nchar'))
        tempfile small
        save `small'
        // end simulate data
        //
        // Real stuff starts here.
        // Break the big file into parts.
        local npart = `nchar' // you would probably need more parts
        describe using `big'
        local last = r(N)
        local size = ceil(`last'/`npart')
        local start = 1
        use `big', clear
        forval i = 1/`npart' {
            local stop = min(`start' + `size' -1, `last')
            preserve
            keep in `start'/`stop'
            save part`i', replace
            restore
            local start = `stop' + 1
        }
        //  Match the small file to each part of the big file, and accumulate
        //  a file of pairs/closeness along the way.
        clear
        save allparts.dta, emptyok replace
        forval i = 1/`npart' {
            use `small'
            tempfile temp
            save `temp', replace
            rename name name2
            cross using `temp'
            // Closeness of all possible pairs.
            jarowinkler name name2, gen(closeness)
            sort name closeness
            by name: keep if _n > (_N-2)  // perfect and next closest match
            //
            // Build up file of matches
            append using allparts.dta
            save allparts, replace
            clear
        }
        use allparts.dta
        Last edited by Mike Lacy; 16 Aug 2021, 12:06.

        Comment


        • #5
          I just noticed some misleading if not harmful typos I made. In my code block at #4, the statement
          Code:
          local npart = `nchar' // you would probably need more parts
          should be
          Code:
          local npart = 5 // you would probably need more parts
          as an illustration of some chosen number of parts. (A global substitution for "5" bit me. There's no reason to make the number of parts the string length used for creating example data.) A similar red herring occurred at
          Code:
          set seed `nchar'739
          Sorry for any confusion.

          Comment


          • #6
            Thank you again Mike. I have tried to split the big database in 5000 parts, but when stata saves the 4963 I just got the following error:
            r(198);observation numbers out of range
            What can I do to solve it? Would it be possible to divide the "small" (mine has more than 50,000 registers) into several parts or the process becomes too complex?

            Thanks for all your help.

            Comment


            • #7
              This puzzles me. I just simulated a "big" file with 355,000 observations on my own machine, and ran the relevant part of the code above to divide it into 5,000 parts. Anyway, in about 5 minutes, the 5,000 part files were produced without any error. First, as a quick experiment, can you get this to work OK if you try to break the big file into just 10 parts? Presuming that does work, let's try to diagnose the problem with 5,000 parts by inserting a line at the relevant point to let us know what is happening:
              Code:
              local npart = 5000
              .... snip snip ...
              ...
              forval i = 1/`npart' {
                 local stop = min(`start' + `size' -1, `last')
                 preserve
                 di "working on part `i', start = `start', stop = `stop'." _newline
                 keep in `start'/`stop'
                 save part`i', replace
                 restore
                 local start = `stop' + 1
              }
              And yes, we could break up *both* files into parts, but it would be nice to avoid that. There likely are other ways to do this, but what I'm suggesting here in principle should be simple and fast enough, even though not very elegant or efficient.

              After running the preceding , please post back the exact code you used and the results you get.

              Comment


              • #8
                Hello,

                Firstly, thank you for all your help.

                For being more precise, my datasets are:
                "Big dataset"
                use "HDSS_fullnames(NONDUPLICATES)", clear
                count // 333,157

                "Small dataset"
                use "ePTS_fullnames(NONDUPLICATES).dta", clear
                count // 64,225

                I tried to part the big dataset in 10 parts and it worked, but when I tried it to do it in 1000 parts it didn't work. I am attaching the code and the error:

                The code...

                local npart = 1000
                describe using "HDSS_fullnames(NONDUPLICATES)"
                local last = r(N)
                local size = ceil(`last'/`npart')
                local start = 1
                use "HDSS_fullnames(NONDUPLICATES)", clear
                forval i = 1/`npart' {
                local stop = min(`start' + `size' -1, `last')
                preserve
                keep in `start'/`stop'
                save part`i', replace
                restore
                local start = `stop' + 1
                }

                The error..

                (337,324 observations deleted)
                file part999.dta saved
                observation numbers out of range
                r(198);

                Comment


                • #9
                  Using a test file with _N = 333,517, and crucially including the bolded line from the code I suggested at at #7, I saw:
                  Code:
                   working on part 999, start = 333158, stop = 333157.
                   observation numbers out of range
                   r(198);
                  This revealed I made the common mistake of not handling the last part of the file correctly. The simple solution is to insert an if/else in the code and break out of the loop if the starting point goes past the last observation of the original big file.

                  I also realized that using the [in] range option on the -use- command was about 10X faster than preserve/restore as I did before, so, with those two changes, give this a try:

                  Code:
                  describe using "HDSS_fullnames(NONDUPLICATES)"
                  local npart = 1000 
                  local last = r(N)
                  local size = ceil(`last'/`npart')
                  local start = 1
                  forval i = 1/`npart' {
                      local stop = min(`start' + `size' -1, `last')
                      if (`start' <= `last') {
                         use "HDSS_fullnames(NONDUPLICATES)" in `start'/`stop', clear
                         save part`i', replace
                         local start = `stop' + 1
                      }    
                      else {  // past last observation of file so quit
                         continue, break
                      }
                  }
                  Presuming this works, you'll want to pick a value for -npart- that yields part files which, when crossed with your _N = 64,225 file, yields something small enough to fit into your Stata, since each crossed file will have (64225 * size) observations. You'll want to strip down your two files to just the minimum before crossing, i.e., just the string variables of interest, with -compress- applied to make them as small as possible.

                  Comment


                  • #10
                    Thanks Mike, all of this is being very useful for my work.
                    I tried to divide the big file in 20,000 parts. No error had ocurred, but only 19583 parts were saved. I don't think this is a problem, isn't it?
                    But, now I have a new problem with the second part of the code... an error of sintaxys ocurrs, but I can't figure out what is happening. I copy below the exact code I am using:

                    The c:
                    describe using "HDSS_fullnames(NONDUPLICATES)_JW"
                    local npart = 10
                    local last = r(N)
                    local size = ceil(`last'/`npart')
                    local start = 1
                    forval i = 1/`npart' {
                    local stop = min(`start' + `size' -1, `last')
                    if (`start' <= `last') {
                    use "HDSS_fullnames(NONDUPLICATES)_JW" in `start'/`stop', clear
                    save part`i', replace
                    local start = `stop' + 1
                    }
                    else { // past last observation of file so quit
                    continue, break
                    }
                    }
                    // Match the small file to each part of the big file, and accumulate
                    // a file of pairs/closeness along the way.
                    save allparts.dta, emptyok replace
                    forval i = 1/`npart' {
                    use "ePTS_fullnames(NONDUPLICATES)_JW"
                    tempfile temp
                    save `temp', replace
                    cross using `temp'
                    // Closeness of all possible pairs.
                    jarowinkler fullname_hdss fullname_epts, gen(closeness)
                    sort name closeness
                    by name: keep if _n > (_N-2) // perfect and next closest match
                    //
                    // Build up file of matches
                    append using allparts.dta
                    save allparts, replace
                    clear
                    }


                    The error:

                    invalid syntax
                    r(198);

                    Thanks again

                    Comment


                    • #11
                      Hi Anna,

                      Several points to consider here:

                      1) "but only 19583 parts were saved"
                      a) The use of
                      -size = ceil(`last'/`npart')-
                      is just a way to get approximately the right number of parts, so getting something different from 20000 is not necessarily a problem. *However,* if it is *really* true that that your big file has 333,157 observations, I don’t see how this code could end up with 19,583 parts saved, which can be determined by algebra, but you can just run the following demonstration (as I did) to see what's happening:
                      Code:
                      local last = 333157
                      local npart = 20000
                      local size = ceil(`last'/`npart')
                      local start = 1
                      forval i = 1/`npart' {
                         local stop = min(`start' + `size' -1, `last')
                         if (`start' <= `last') {
                            // would -save etc.- here
                            // show what is happening at the end
                            if `i' > 19580   di "part = `i', start = `start', stop = `stop'"
                         }
                         local start = `stop' + 1
                      }
                      Perhaps your computer operating system is not allowing 20,000 files in a folder?

                      b) In any event, I think that choosing 20,000 parts not necessary and not a good idea. If you chose just e.g., 3000 parts, each part would have a file size of ceil(333157/3000) = 112. When crossed with your file that has 64,225 observations, this would yield a file with 112 * 64225 = 7,193,200 observations. If each file contained just the string variable to be matched, and if each string variable is 30 bytes, this would give a crossed file of size about 430 megabytes, which would not cause Stata any problems on even my modest laptop computer.

                      2) “invalid syntax r(198);”

                      My best wild guess is that you tried to run the code in bold at #10 all on its own. That will not work because the local -npart- will be undefined at that point, since it only exists for Stata within the current block of code being run, and does not remain after some previous block was run. That issue would be solved by resetting -npart- to whatever its actual value before starting that part of the code, e.g., local npart = 19583
                      (I’m saying that without thinking that 19583 is necessarily the right value for the number of part files) And, in any event, we know that there will not necessarily be exactly -npart- files created anyway, so setting the -npart- local yourself is a good idea.

                      Anyway, we don’t need to *guess* what the syntax problem is. A good first step to diagnose a syntax problem in Stata is to -set trace on- and see what it shows about exactly where the problem occurred.

                      Code:
                      set trace on
                      save allparts.dta, emptyok replace
                      forval i = 1/`npart' {
                      use "ePTS_fullnames(NONDUPLICATES)_JW"
                      ...
                      ... snip, snip
                      ...
                      append using allparts.dta
                      save allparts, replace
                      clear
                      }
                      set trace off
                      This will be messy, but should show exactly where the syntax problem occurred.


                      3) Are you completely tied to using the Jaro-Winkler method of comparing strings to match your files? I have no particular knowledge about the details of various string-matching methods, but I’d note that there is a very nice community-contributed program -matchit- that might solve your problem and save trying to work it out ourselves. I quote -ssc describe- here:

                      “matchit is a tool to join observations from two datasets based on string variables which do not necessarily need to be exactly the same. It performs many different string-based matching techniques, allowing for a fuzzy similarity between the two text variables.”

                      I have not used this program, but it appears that it would automatically do much of what we are trying to do here, and presumably more efficiently.

                      Comment


                      • #12
                        Also see the user-written command strdist. It is a Stata wrapper for a java plugin. See https://github.com/wbuchanan/StataStringUtilities. Also see phoneticenc in my separate posting response about Double Metaphone for related phonetic encodings, including alternatives such as NYSIIS. Nowadays, for the related broader problem of record linkage I prefer the R package fastLink.

                        Comment

                        Working...
                        X