Longest Common Subsequence (LCS) (special Levenshtein distance command)

Walter Hopkins

Join Date: Jul 2014

Posts: 14
#1

Longest Common Subsequence (LCS) (special Levenshtein distance command)

04 Jul 2014, 07:22

Dear readers,

"Definition: The Levenshtein distance (or edit distance) between two strings is the minimal number of insertions, deletions, and substitutions of one character for another that will transform one string into the other."

"the longest common subsequence (LCS) metric allows only insertion and deletion, not substitution"

The problem: below are 2 examples. I want to keep the first example and eventually drop the second example. I believe that it's possible with the LCS-method to distinct the 1st example with the 2nd in a good manner, however, I don't know how to use this command in Stata. (I've already searched the Internet for 3,5 hours).

mgr_name _merge Fundmanagername levendistance
J. Luther King, Jr. 3 Luther King 8

mgr_name _merge Fundmanagername levendistance
Matthew Friedman 3 Matthew Schuldt 8

I appreciate any suggestion and any other method to make a distinction between the examples. I've more than 100,000 observations and except for this problem, the multiple datasets were correctly merged.

Kind regards,

Walter Hopkins
Tags: None
Walter Hopkins

Join Date: Jul 2014

Posts: 14
#2

04 Jul 2014, 08:50

To make it more clear...................I know how to run the levenshtein-distance command, however, I want to add substitution costs so that Matthew Friedman Matthew Schuldt doesn't result in 8 but in 13 or so.

http://odur.let.rug.nl/kleiweg/lev/ <--That's exactly what I mean....here I can just change the substitution costs to 5 instead of 1 (for example)....This results in 13 for the Matthew combination and just 8 for the Luther combination.

I also found this on an other website, however, I don't know yet if I can use this in Stata and how I could use it. Changing var1name into s and var2name int t doesn't work.

if s[i] = t[j] then d[i, j] := d[i-1, j-1] else if i > 0 and j > 0 and s[i] = t[j - 1] and s[i - 1] = t[j] then d[i, j] := minimum ( d[i-2, j-2] + 1 // transpose d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + 1 // substitution ) else d[i, j] := minimum ( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + 1 // substitution )

In PHP it's levenshtein(var1name, var2name, 1, 5, 1) ...however, I need it in Stata. The use of other programs isn't allowed.

Last edited by Walter Hopkins; 04 Jul 2014, 09:03.
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#3

04 Jul 2014, 09:31

My suggestion would be to use the record linkage software developed by Schnell et al. (2004), which you can download from the German Record Linkage Center. That can work with Stata files as input, and includes a large number of distance measures appropriate for comparing names (including LCS).

I presume (based on your other post) that you are using either levenshtein or strgroup (written by Julian Reif). Those work well (I've used them), but IIRC they are written as C plugins, and I don't know whether the source code is available. As an alternative, you could try strdist (written by Michael Barker) which computes Levenshtein distance and is written in Mata (with source available). You could then modify the source to add the relative weights you describe.

Schnell, R.; Bachteler, T. & Bender, S. (2004): A Toolbox for record linkage; in: Austrian Journal of Statistics 33 (1-2) 125-133.
Comment
Walter Hopkins

Join Date: Jul 2014

Posts: 14
#4

04 Jul 2014, 09:49

Thank you for your for fast advice.The hamming_distance() command is also a possible solution which I'm also trying to use. I checked the record linkage software you advised, however, it's necessary to use Java in combination with Stata. I'm only allowed to use Stata unfortunately. I already checked strdist, but probably not good enough because I did not see a way to add the relative weights. At the moment, I'm looking again into strdist thanks to you otherwise I would not have done that so soon again.

What should I change to let Stata run the following code for my 2 variables: var1name and var2name.

real scalar matalev(string scalar key, string scalar trymatch) { keylength = strlen(key) trylength = strlen(trymatch) // declare distance matrix D = J(keylength , trylength , .) // Add starting penalties in first column D = ((1::keylength) , D) // Add starting penalties in first row D = ((0..trylength) \ D) // add penalty for each operation required to reconcile the two strings for (i = 1 ; i <= keylength ; i++ ) { for (j = 1 ; j <= trylength ; j++ ) { if (substr(key, i, 1) == substr(trymatch, j, 1)) { D[i+1,j+1] = D[i,j] } else { // (deletion , insertion , substition) D[i+1,j+1] = min((D[i,j+1]+1 , D[i+1,j]+1 , D[i,j]+1 )) } } } return(D[i,j]) } end Thank you for this code Phil.

The part at the bottom of http://fmwww.bc.edu/repec/bocode/s/strdist.ado

Last edited by Walter Hopkins; 04 Jul 2014, 10:21.
Comment
Walter Hopkins

Join Date: Jul 2014

Posts: 14
#5

04 Jul 2014, 11:03

*The hamming_distance() command can only be used for words with equal length, so I cannot use it*
Comment
Walter Hopkins

Join Date: Jul 2014

Posts: 14
#6

04 Jul 2014, 12:35

page/slide 14 of http://web.stanford.edu/class/cs124/lec/med.pdf describes the solution I need....instead of cost 2 for substitution I would need a higher number,,,a 4 for example......however, I cannot use the D-command in Stata

D(i,0) = i
unrecognized command: D not defined by D.ado
r(199);
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#7

04 Jul 2014, 14:28

Originally posted by MrHopkins View Post

The part at the bottom of http://fmwww.bc.edu/repec/bocode/s/strdist.ado

That's Mata code—not Stata. You can't run it in Stata. You may (and I can't guarantee this because I haven't looked at it carefully enough) be able to make a small modification to strdist.ado so that it does what you want. Ideally, you'd spend some time with [U] 18 Programming Stata and the Mata Reference Manual first, or at least consult these if you have questions. People here can help you if you run into trouble.
Comment
Walter Hopkins

Join Date: Jul 2014

Posts: 14
#8

05 Jul 2014, 14:52

Thanks for clearing that up Phil. Unfortunately, I don't have any programming skills. However, if noone post a reply how to make that small modification, I'll try to learn the related programming if I get enough time for my deadlines.
Comment

Tobias Keller

Join Date: Aug 2014
Posts: 2

08 Aug 2014, 07:32

Hi Walter,

I was looking for the same thing today and found a pseudo code on Wikipedia (http://en.wikipedia.org/wiki/Longest...string_problem) that I was able to translate into a STATA program.

Since I am used to MATA yet, usage of the program below is somewhat cumbersome and probably inefficient but it seems to work nicely. Also, the program is merely 5 minutes old, so be careful, it might contain bugs ...

Good luck
Tobias

Example and program:

Code:

// This code has to be run befor you can use the program
capture program drop longestcommonsubstr
program define longestcommonsubstr, rclass
    
    syntax, s(string) t(string)
    
    local m = strlen("`s'")
    local n = strlen("`t'")
    tempname L
    matrix `L' = J(`m',`n',0) 
    local z = 0
        
    forvalues i = 1 / `m' {
        forvalues j = 1 / `n' {
            local s_i = substr("`s'", `i', 1)
            local t_j = substr("`t'", `j', 1)
            if "`s_i'" == "`t_j'" {
                if `i' == 1 | `j' == 1 {
                    matrix `L'[`i',`j'] = 1
                }
                else {
                    matrix `L'[`i',`j'] = `L'[`=`i'-1',`=`j'-1']  + 1
                }
                if `L'[`i',`j'] > `z' {
                    local z = `L'[`i',`j']
                    local lcs = substr("`s'", `=`i'-`z'+1', `z')
                }
                else {
                    if `L'[`i',`j'] == `z' {
                        local lcs = "`lcs'" + substr("`s'", `=`i'-`z'+1', `z')
                    }     
                }
            }
        }
    }
    return local lcs `lcs'
end

// This will compare two strings and save the longest common substring in r(lcs)
longestcommonsubstr, s("This is a test string") t("This is another test string")

// This is how you access the result:
display "`r(lcs)'"

Last edited by Tobias Keller; 08 Aug 2014, 07:39. Reason: minor change: using temporary name for matrix L and local instead of scalar for z

Comment

Matthew J. Baker

Join Date: Mar 2014
Posts: 126

#10

08 Aug 2014, 09:27

Walter --

I don't know if this helps (it may actually hurt!), but I had a project going awhile ago that required string distances where I had to assign different costs to deletions and substitutions. Anyways, I had the following mata code that 1) might not be what you need, 2) might not work exactly right, but 3) seems to give the right answers (at least in terms of your examples). The mata function:

Code:

mata: 
real scalar wordscore(string scalar word1,string scalar word2,|gap,subs)
{
    real scalar i,j
    string matrix W1,W2
    real matrix Sc,TV
    if (args()==2) {
        gap=1
        subs=1
    }
    
    Sc=J(strlen(word1)+1,strlen(word2)+1,.)
    Sc[1,.]=(0..strlen(word2))*gap
    Sc[.,1]=(0::strlen(word1))*gap
    W1=J(1,strlen(word1),"")
    W2=J(1,strlen(word2),"")

    for (i=1;i<=strlen(word1);i++) {
        W1[i]=substr(word1,i,1)
    }

    for (j=1;j<=strlen(word2);j++) {
        W2[j]=substr(word2,j,1)
    }

    for (i=2;i<=rows(Sc);i++) {
        for (j=2;j<=cols(Sc);j++) {
            TV=Sc[i-1,j-1]+subs*(W1[i-1]!=W2[j-1]),Sc[i-1,j]+gap,Sc[i,j-1]+gap
            Sc[i,j]=min(TV)
        }
    }
    return(Sc[rows(Sc),cols(Sc)])
}
end

If first create the function, and then try it out with examples like:

Code:

mata: wordscore("kitten","sitting")
mata: wordscore("matthew friedman","matthew schuldt")
mata: wordscore("matthew friedman","matthew schuldt",1,3)

It seems to give the right answers. The last example assigns a higher cost to a substitution than a deletion, I think. Again, I don't know if this helps, but at least it might be used as a template or something.

Best,

Matt Baker

Comment

Tobias Keller

Join Date: Aug 2014

Posts: 2
#11

08 Aug 2014, 10:16

Thanks, Matt! I have just realized that I was actually working on a somewhat different problem.
Comment
Mike Barker

Join Date: Apr 2014

Posts: 37
#12

08 Aug 2014, 16:16

I can give you advice on the strdist program, if you want to modify it for your purpose.
If you want to change the penalties for each operation, you should be able to do it by changing the code on line 135 of the .ado file.

Code:

134: else { // (deletion , insertion , substition) 135: D[i+1,j+1] = min((D[i,j+1]+1 , D[i+1,j]+1 , D[i,j]+1 )) 136: }

The final +1 for each argument of the min() function are the penalties for each operation. If you want to change the penalty for substitution to 4, you could change the code like this:

Code:

134: else { // (deletion , insertion , substition) 135: D[i+1,j+1] = min((D[i,j+1]+1 , D[i+1,j]+1 , D[i,j]+4 )) 136: }

I would just install the original strdist program from ssc. Then type -which strdist- to find where the .ado file is located.
Make the change and save the program.
You may have to type -discard- in Stata before the changes you made will take effect.

Mike
Comment

Announcement