Jaro-Winkler matching

Anna Saura

Join Date: Aug 2021

Posts: 6
#1

Jaro-Winkler matching

16 Aug 2021, 08:34

Hello everybody,
I would like to know if with stata is it possible to match pair of records using the Jaro-Winkler distance.I have found the command jarowinkler which calculates the edis string distance, but you need to have the pair-record matched previously. Am I wrong?
Thank you in advance.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#2

16 Aug 2021, 09:56

What about creating a file of paired records, and then applying the -jarowinkler- command [from SSC]. If you are working with a large data file, -cross- as used below would yield an excessively large file, but there are ways to handle that problem. To start with, though, would the following be something you could work with?

Code:

clear // Example data input str10 name "Smith" "Miller" "White" "Smythe" "Mueller" "Weiss" end // // Make a file of pairs tempfile temp save `temp' rename name name2 cross using `temp' // Closeness of all possible pairs. jarowinkler name name2, gen(closeness) sort name closeness by name: keep if _n > (_N-2) // perfect and next closest match list
Comment
Anna Saura

Join Date: Aug 2021

Posts: 6
#3

16 Aug 2021, 10:48

Thank you very much Mike, this is a really useful information. But I am working with two large datasets (350,000 and 55,000 registers approximately). How is the best way to handle this?
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2417

16 Aug 2021, 12:00

I would break the big file into K different smaller parts (perhaps as many as K = 100), and then proceed as in #2 to create K files showing the closeness of each record in your small file to each part of the large file. These K files can then be combined along the way, and you can pick out the matches you want. Exactly what to do with this file, how many close matches to keep (per above), and so forth, would likely depend on the substance of your situation and what you want to do with these matches.

Here's a quick attempt to *illustrate* something in that direction. I don't have time to really check out the following right now, so you (and others) should examine it and see if it's on target and without errors.

Code:

clear
cd "SomeWorkingDataFolder"
// Simulate two data files.  Better test data than this would be nice.
// Two characters strings chosen from a small set.
local charset = "a b c d e"
local nchar = wordcount("`charset'")
set seed `nchar'739
clear
set obs 100
gen name = word(c(alpha), ceil(runiform() * `nchar')) + ///
           word(c(alpha), ceil(runiform() * `nchar'))
tempfile big
save `big'
//
clear
set obs 10
gen name = word(c(alpha), ceil(runiform() * `nchar')) + ///
           word(c(alpha), ceil(runiform() * `nchar'))
tempfile small
save `small'
// end simulate data
//
// Real stuff starts here.
// Break the big file into parts.
local npart = `nchar' // you would probably need more parts
describe using `big'
local last = r(N)
local size = ceil(`last'/`npart')
local start = 1
use `big', clear
forval i = 1/`npart' {
    local stop = min(`start' + `size' -1, `last')
    preserve
    keep in `start'/`stop'
    save part`i', replace
    restore
    local start = `stop' + 1
}
//  Match the small file to each part of the big file, and accumulate
//  a file of pairs/closeness along the way.
clear
save allparts.dta, emptyok replace
forval i = 1/`npart' {
    use `small'
    tempfile temp
    save `temp', replace
    rename name name2
    cross using `temp'
    // Closeness of all possible pairs.
    jarowinkler name name2, gen(closeness)
    sort name closeness
    by name: keep if _n > (_N-2)  // perfect and next closest match
    //
    // Build up file of matches
    append using allparts.dta
    save allparts, replace
    clear
}
use allparts.dta

Last edited by Mike Lacy; 16 Aug 2021, 12:06.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2417
#5

16 Aug 2021, 15:12

I just noticed some misleading if not harmful typos I made. In my code block at #4, the statement

Code:

local npart = `nchar' // you would probably need more parts

should be

Code:

local npart = 5 // you would probably need more parts

as an illustration of some chosen number of parts. (A global substitution for "5" bit me. There's no reason to make the number of parts the string length used for creating example data.) A similar red herring occurred at

Code:

set seed `nchar'739

Sorry for any confusion.
Comment
Anna Saura

Join Date: Aug 2021

Posts: 6
#6

17 Aug 2021, 04:46

Thank you again Mike. I have tried to split the big database in 5000 parts, but when stata saves the 4963 I just got the following error:
r(198);observation numbers out of range
What can I do to solve it? Would it be possible to divide the "small" (mine has more than 50,000 registers) into several parts or the process becomes too complex?

Thanks for all your help.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#7

17 Aug 2021, 10:28

This puzzles me. I just simulated a "big" file with 355,000 observations on my own machine, and ran the relevant part of the code above to divide it into 5,000 parts. Anyway, in about 5 minutes, the 5,000 part files were produced without any error. First, as a quick experiment, can you get this to work OK if you try to break the big file into just 10 parts? Presuming that does work, let's try to diagnose the problem with 5,000 parts by inserting a line at the relevant point to let us know what is happening:

Code:

local npart = 5000 .... snip snip ... ... forval i = 1/`npart' { local stop = min(`start' + `size' -1, `last') preserve di "working on part `i', start = `start', stop = `stop'." _newline keep in `start'/`stop' save part`i', replace restore local start = `stop' + 1 }

And yes, we could break up *both* files into parts, but it would be nice to avoid that. There likely are other ways to do this, but what I'm suggesting here in principle should be simple and fast enough, even though not very elegant or efficient.

After running the preceding , please post back the exact code you used and the results you get.
Comment
Anna Saura

Join Date: Aug 2021

Posts: 6
#8

18 Aug 2021, 10:37

Hello,

Firstly, thank you for all your help.

For being more precise, my datasets are:
"Big dataset"
use "HDSS_fullnames(NONDUPLICATES)", clear
count // 333,157

"Small dataset"
use "ePTS_fullnames(NONDUPLICATES).dta", clear
count // 64,225

I tried to part the big dataset in 10 parts and it worked, but when I tried it to do it in 1000 parts it didn't work. I am attaching the code and the error:

The code...

local npart = 1000
describe using "HDSS_fullnames(NONDUPLICATES)"
local last = r(N)
local size = ceil(`last'/`npart')
local start = 1
use "HDSS_fullnames(NONDUPLICATES)", clear
forval i = 1/`npart' {
local stop = min(`start' + `size' -1, `last')
preserve
keep in `start'/`stop'
save part`i', replace
restore
local start = `stop' + 1
}

The error..

(337,324 observations deleted)
file part999.dta saved
observation numbers out of range
r(198);
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#9

18 Aug 2021, 11:58

Using a test file with _N = 333,517, and crucially including the bolded line from the code I suggested at at #7, I saw:

Code:

working on part 999, start = 333158, stop = 333157. observation numbers out of range r(198);

This revealed I made the common mistake of not handling the last part of the file correctly. The simple solution is to insert an if/else in the code and break out of the loop if the starting point goes past the last observation of the original big file.

I also realized that using the [in] range option on the -use- command was about 10X faster than preserve/restore as I did before, so, with those two changes, give this a try:

Code:

describe using "HDSS_fullnames(NONDUPLICATES)" local npart = 1000 local last = r(N) local size = ceil(`last'/`npart') local start = 1 forval i = 1/`npart' { local stop = min(`start' + `size' -1, `last') if (`start' <= `last') { use "HDSS_fullnames(NONDUPLICATES)" in `start'/`stop', clear save part`i', replace local start = `stop' + 1 } else { // past last observation of file so quit continue, break } }

Presuming this works, you'll want to pick a value for -npart- that yields part files which, when crossed with your _N = 64,225 file, yields something small enough to fit into your Stata, since each crossed file will have (64225 * size) observations. You'll want to strip down your two files to just the minimum before crossing, i.e., just the string variables of interest, with -compress- applied to make them as small as possible.
1 like
Comment
Anna Saura

Join Date: Aug 2021

Posts: 6
#10

19 Aug 2021, 02:13

Thanks Mike, all of this is being very useful for my work.
I tried to divide the big file in 20,000 parts. No error had ocurred, but only 19583 parts were saved. I don't think this is a problem, isn't it?
But, now I have a new problem with the second part of the code... an error of sintaxys ocurrs, but I can't figure out what is happening. I copy below the exact code I am using:

The c:
describe using "HDSS_fullnames(NONDUPLICATES)_JW"
local npart = 10
local last = r(N)
local size = ceil(`last'/`npart')
local start = 1
forval i = 1/`npart' {
local stop = min(`start' + `size' -1, `last')
if (`start' <= `last') {
use "HDSS_fullnames(NONDUPLICATES)_JW" in `start'/`stop', clear
save part`i', replace
local start = `stop' + 1
}
else { // past last observation of file so quit
continue, break
}
}
// Match the small file to each part of the big file, and accumulate
// a file of pairs/closeness along the way.
save allparts.dta, emptyok replace
forval i = 1/`npart' {
use "ePTS_fullnames(NONDUPLICATES)_JW"
tempfile temp
save `temp', replace
cross using `temp'
// Closeness of all possible pairs.
jarowinkler fullname_hdss fullname_epts, gen(closeness)
sort name closeness
by name: keep if _n > (_N-2) // perfect and next closest match
//
// Build up file of matches
append using allparts.dta
save allparts, replace
clear
}

The error:

invalid syntax
r(198);

Thanks again
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#11

19 Aug 2021, 09:42

Hi Anna,

Several points to consider here:

1) "but only 19583 parts were saved"
a) The use of
-size = ceil(`last'/`npart')-
is just a way to get approximately the right number of parts, so getting something different from 20000 is not necessarily a problem. *However,* if it is *really* true that that your big file has 333,157 observations, I don’t see how this code could end up with 19,583 parts saved, which can be determined by algebra, but you can just run the following demonstration (as I did) to see what's happening:

Code:

local last = 333157 local npart = 20000 local size = ceil(`last'/`npart') local start = 1 forval i = 1/`npart' { local stop = min(`start' + `size' -1, `last') if (`start' <= `last') { // would -save etc.- here // show what is happening at the end if `i' > 19580 di "part = `i', start = `start', stop = `stop'" } local start = `stop' + 1 }

Perhaps your computer operating system is not allowing 20,000 files in a folder?

b) In any event, I think that choosing 20,000 parts not necessary and not a good idea. If you chose just e.g., 3000 parts, each part would have a file size of ceil(333157/3000) = 112. When crossed with your file that has 64,225 observations, this would yield a file with 112 * 64225 = 7,193,200 observations. If each file contained just the string variable to be matched, and if each string variable is 30 bytes, this would give a crossed file of size about 430 megabytes, which would not cause Stata any problems on even my modest laptop computer.

2) “invalid syntax r(198);”

My best wild guess is that you tried to run the code in bold at #10 all on its own. That will not work because the local -npart- will be undefined at that point, since it only exists for Stata within the current block of code being run, and does not remain after some previous block was run. That issue would be solved by resetting -npart- to whatever its actual value before starting that part of the code, e.g., local npart = 19583
(I’m saying that without thinking that 19583 is necessarily the right value for the number of part files) And, in any event, we know that there will not necessarily be exactly -npart- files created anyway, so setting the -npart- local yourself is a good idea.

Anyway, we don’t need to *guess* what the syntax problem is. A good first step to diagnose a syntax problem in Stata is to -set trace on- and see what it shows about exactly where the problem occurred.

Code:

set trace on save allparts.dta, emptyok replace forval i = 1/`npart' { use "ePTS_fullnames(NONDUPLICATES)_JW" ... ... snip, snip ... append using allparts.dta save allparts, replace clear } set trace off

This will be messy, but should show exactly where the syntax problem occurred.

3) Are you completely tied to using the Jaro-Winkler method of comparing strings to match your files? I have no particular knowledge about the details of various string-matching methods, but I’d note that there is a very nice community-contributed program -matchit- that might solve your problem and save trying to work it out ourselves. I quote -ssc describe- here:

“matchit is a tool to join observations from two datasets based on string variables which do not necessarily need to be exactly the same. It performs many different string-based matching techniques, allowing for a fuzzy similarity between the two text variables.”

I have not used this program, but it appears that it would automatically do much of what we are trying to do here, and presumably more efficiently.
1 like
Comment
Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#12

20 Aug 2021, 15:20

Also see the user-written command strdist. It is a Stata wrapper for a java plugin. See https://github.com/wbuchanan/StataStringUtilities. Also see phoneticenc in my separate posting response about Double Metaphone for related phonetic encodings, including alternatives such as NYSIIS. Nowadays, for the related broader problem of record linkage I prefer the R package fastLink.
Comment

Announcement