Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Longest Common Subsequence (LCS) (special Levenshtein distance command)

    Dear readers,

    "Definition: The Levenshtein distance (or edit distance) between two strings is the minimal number of insertions, deletions, and substitutions of one character for another that will transform one string into the other."

    "the longest common subsequence (LCS) metric allows only insertion and deletion, not substitution"

    The problem: below are 2 examples. I want to keep the first example and eventually drop the second example. I believe that it's possible with the LCS-method to distinct the 1st example with the 2nd in a good manner, however, I don't know how to use this command in Stata. (I've already searched the Internet for 3,5 hours).


    mgr_name _merge Fundmanagername levendistance
    J. Luther King, Jr. 3 Luther King 8

    mgr_name _merge Fundmanagername levendistance
    Matthew Friedman 3 Matthew Schuldt 8

    I appreciate any suggestion and any other method to make a distinction between the examples. I've more than 100,000 observations and except for this problem, the multiple datasets were correctly merged.

    Kind regards,

    Walter Hopkins



  • #2
    To make it more clear...................I know how to run the levenshtein-distance command, however, I want to add substitution costs so that Matthew Friedman Matthew Schuldt doesn't result in 8 but in 13 or so.

    http://odur.let.rug.nl/kleiweg/lev/ <--That's exactly what I mean....here I can just change the substitution costs to 5 instead of 1 (for example)....This results in 13 for the Matthew combination and just 8 for the Luther combination.

    I also found this on an other website, however, I don't know yet if I can use this in Stata and how I could use it. Changing var1name into s and var2name int t doesn't work.

    if s[i] = t[j] then d[i, j] := d[i-1, j-1] else if i > 0 and j > 0 and s[i] = t[j - 1] and s[i - 1] = t[j] then d[i, j] := minimum ( d[i-2, j-2] + 1 // transpose d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + 1 // substitution ) else d[i, j] := minimum ( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + 1 // substitution )

    In PHP it's levenshtein(var1name, var2name, 1, 5, 1) ...however, I need it in Stata. The use of other programs isn't allowed.
    Last edited by Walter Hopkins; 04 Jul 2014, 09:03.

    Comment


    • #3
      My suggestion would be to use the record linkage software developed by Schnell et al. (2004), which you can download from the German Record Linkage Center. That can work with Stata files as input, and includes a large number of distance measures appropriate for comparing names (including LCS).

      I presume (based on your other post) that you are using either levenshtein or strgroup (written by Julian Reif). Those work well (I've used them), but IIRC they are written as C plugins, and I don't know whether the source code is available. As an alternative, you could try strdist (written by Michael Barker) which computes Levenshtein distance and is written in Mata (with source available). You could then modify the source to add the relative weights you describe.


      Schnell, R.; Bachteler, T. & Bender, S. (2004): A Toolbox for record linkage; in: Austrian Journal of Statistics 33 (1-2) 125-133.

      Comment


      • #4
        Thank you for your for fast advice.The hamming_distance() command is also a possible solution which I'm also trying to use. I checked the record linkage software you advised, however, it's necessary to use Java in combination with Stata. I'm only allowed to use Stata unfortunately. I already checked strdist, but probably not good enough because I did not see a way to add the relative weights. At the moment, I'm looking again into strdist thanks to you otherwise I would not have done that so soon again.

        What should I change to let Stata run the following code for my 2 variables: var1name and var2name.

        real scalar matalev(string scalar key, string scalar trymatch) { keylength = strlen(key) trylength = strlen(trymatch) // declare distance matrix D = J(keylength , trylength , .) // Add starting penalties in first column D = ((1::keylength) , D) // Add starting penalties in first row D = ((0..trylength) \ D) // add penalty for each operation required to reconcile the two strings for (i = 1 ; i <= keylength ; i++ ) { for (j = 1 ; j <= trylength ; j++ ) { if (substr(key, i, 1) == substr(trymatch, j, 1)) { D[i+1,j+1] = D[i,j] } else { // (deletion , insertion , substition) D[i+1,j+1] = min((D[i,j+1]+1 , D[i+1,j]+1 , D[i,j]+1 )) } } } return(D[i,j]) } end Thank you for this code Phil.

        The part at the bottom of http://fmwww.bc.edu/repec/bocode/s/strdist.ado
        Last edited by Walter Hopkins; 04 Jul 2014, 10:21.

        Comment


        • #5
          *The hamming_distance() command can only be used for words with equal length, so I cannot use it*

          Comment


          • #6
            page/slide 14 of http://web.stanford.edu/class/cs124/lec/med.pdf describes the solution I need....instead of cost 2 for substitution I would need a higher number,,,a 4 for example......however, I cannot use the D-command in Stata

            D(i,0) = i
            unrecognized command: D not defined by D.ado
            r(199);

            Comment


            • #7
              Originally posted by MrHopkins View Post
              That's Mata code—not Stata. You can't run it in Stata. You may (and I can't guarantee this because I haven't looked at it carefully enough) be able to make a small modification to strdist.ado so that it does what you want. Ideally, you'd spend some time with [U] 18 Programming Stata and the Mata Reference Manual first, or at least consult these if you have questions. People here can help you if you run into trouble.

              Comment


              • #8
                Thanks for clearing that up Phil. Unfortunately, I don't have any programming skills. However, if noone post a reply how to make that small modification, I'll try to learn the related programming if I get enough time for my deadlines.

                Comment


                • #9
                  Hi Walter,

                  I was looking for the same thing today and found a pseudo code on Wikipedia (http://en.wikipedia.org/wiki/Longest...string_problem) that I was able to translate into a STATA program.

                  Since I am used to MATA yet, usage of the program below is somewhat cumbersome and probably inefficient but it seems to work nicely. Also, the program is merely 5 minutes old, so be careful, it might contain bugs ...

                  Good luck
                  Tobias


                  Example and program:

                  Code:
                  // This code has to be run befor you can use the program
                  capture program drop longestcommonsubstr
                  program define longestcommonsubstr, rclass
                      
                      syntax, s(string) t(string)
                      
                      local m = strlen("`s'")
                      local n = strlen("`t'")
                      tempname L
                      matrix `L' = J(`m',`n',0) 
                      local z = 0
                          
                      forvalues i = 1 / `m' {
                          forvalues j = 1 / `n' {
                              local s_i = substr("`s'", `i', 1)
                              local t_j = substr("`t'", `j', 1)
                              if "`s_i'" == "`t_j'" {
                                  if `i' == 1 | `j' == 1 {
                                      matrix `L'[`i',`j'] = 1
                                  }
                                  else {
                                      matrix `L'[`i',`j'] = `L'[`=`i'-1',`=`j'-1']  + 1
                                  }
                                  if `L'[`i',`j'] > `z' {
                                      local z = `L'[`i',`j']
                                      local lcs = substr("`s'", `=`i'-`z'+1', `z')
                                  }
                                  else {
                                      if `L'[`i',`j'] == `z' {
                                          local lcs = "`lcs'" + substr("`s'", `=`i'-`z'+1', `z')
                                      }     
                                  }
                              }
                          }
                      }
                      return local lcs `lcs'
                  end
                  
                  // This will compare two strings and save the longest common substring in r(lcs)
                  longestcommonsubstr, s("This is a test string") t("This is another test string")
                  
                  // This is how you access the result:
                  display "`r(lcs)'"
                  Last edited by Tobias Keller; 08 Aug 2014, 07:39. Reason: minor change: using temporary name for matrix L and local instead of scalar for z

                  Comment


                  • #10
                    Walter --

                    I don't know if this helps (it may actually hurt!), but I had a project going awhile ago that required string distances where I had to assign different costs to deletions and substitutions. Anyways, I had the following mata code that 1) might not be what you need, 2) might not work exactly right, but 3) seems to give the right answers (at least in terms of your examples). The mata function:

                    Code:
                    mata: 
                    real scalar wordscore(string scalar word1,string scalar word2,|gap,subs)
                    {
                        real scalar i,j
                        string matrix W1,W2
                        real matrix Sc,TV
                        if (args()==2) {
                            gap=1
                            subs=1
                        }
                        
                        Sc=J(strlen(word1)+1,strlen(word2)+1,.)
                        Sc[1,.]=(0..strlen(word2))*gap
                        Sc[.,1]=(0::strlen(word1))*gap
                        W1=J(1,strlen(word1),"")
                        W2=J(1,strlen(word2),"")
                    
                        for (i=1;i<=strlen(word1);i++) {
                            W1[i]=substr(word1,i,1)
                        }
                    
                        for (j=1;j<=strlen(word2);j++) {
                            W2[j]=substr(word2,j,1)
                        }
                    
                        for (i=2;i<=rows(Sc);i++) {
                            for (j=2;j<=cols(Sc);j++) {
                                TV=Sc[i-1,j-1]+subs*(W1[i-1]!=W2[j-1]),Sc[i-1,j]+gap,Sc[i,j-1]+gap
                                Sc[i,j]=min(TV)
                            }
                        }
                        return(Sc[rows(Sc),cols(Sc)])
                    }
                    end
                    If first create the function, and then try it out with examples like:
                    Code:
                    mata: wordscore("kitten","sitting")
                    mata: wordscore("matthew friedman","matthew schuldt")
                    mata: wordscore("matthew friedman","matthew schuldt",1,3)
                    It seems to give the right answers. The last example assigns a higher cost to a substitution than a deletion, I think. Again, I don't know if this helps, but at least it might be used as a template or something.

                    Best,

                    Matt Baker




                    Comment


                    • #11
                      Thanks, Matt! I have just realized that I was actually working on a somewhat different problem.

                      Comment


                      • #12
                        I can give you advice on the strdist program, if you want to modify it for your purpose.
                        If you want to change the penalties for each operation, you should be able to do it by changing the code on line 135 of the .ado file.

                        Code:
                        134:            else {         //    (deletion   , insertion  , substition)
                        135:                D[i+1,j+1] = min((D[i,j+1]+1 , D[i+1,j]+1 , D[i,j]+1  ))
                        136:           }
                        The final +1 for each argument of the min() function are the penalties for each operation. If you want to change the penalty for substitution to 4, you could change the code like this:

                        Code:
                        134:            else {         //    (deletion   , insertion  , substition)
                        135:                D[i+1,j+1] = min((D[i,j+1]+1 , D[i+1,j]+1 , D[i,j]+4  ))    
                        136:           }
                        I would just install the original strdist program from ssc. Then type -which strdist- to find where the .ado file is located.
                        Make the change and save the program.
                        You may have to type -discard- in Stata before the changes you made will take effect.


                        Mike





                        Comment

                        Working...
                        X