Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying pairs of strings that differ by only two consecutive, inverted characters

    I want to identify when two putatively matched surnames (stored as two string variables) differ only by the switching of two consecutive letters. For example, to flag someone who has "CARSLAKE" in String1 and "CARLSAKE" in String2. I don't want to accept other pairs with a Levenshtein distance of 2 as they look more like distinct names (not typos).

    I can imagine something looping through each letter in turn using substr(), but this would be very long-winded and clunky since the surnames of course vary in length between pairs (within pairs, I'm only interested if they're the same length). Does anyone know of a more sensible solution? Thanks.


  • #2
    I think this will do it:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str8(var1 var2)
    "carslake" "carlsake"
    "smith"    "simth"  
    "jones"    "jones"  
    "label"    "labrl"  
    end
    
    capture program drop one_obs
    program define one_obs
        gen byte pair_reversal = 0
        strdist var1 var2, levenshtein(ldist)
        if ldist == 2 {
            charlist var1
            local c1 `r(chars)'
            charlist var2
            local c2 `r(chars)'
            replace pair_reversal = (`"`c1'"' == `"`c2'"')
        }
        drop ldist
        exit
    end
    
    runby one_obs, by(var1 var2)
    Note: This requires installation (if you do not already have them) of Michael Barker's -strdist-, and Robert Picard and my -runby-, both available from SSC.

    Added: -charlist- is also from SSC, written by Nick Cox.

    Also, this may not be exactly what you want. This will pick up situations where var1 and var2 differ by reversal of two letters that are not necessarily adjacent. (E.g. this will call a pair_reversal on "program" vs "pragrom")
    Last edited by Clyde Schechter; 04 Dec 2018, 09:04.

    Comment


    • #3
      (crossed in the ether with Clyde's posting)

      David, if I'm correctly understanding and implementing what you ask for, I don't think a brute-force loop is a problem.

      The code below took my machine less than 2 sec. to do 100,000 observations. My example data uses names of the same length, to make sure to randomly produce enough reversal cases to see that the code works. I suppose the code might have some errors, but at least in concept, I don't see this as too clunky. To implement this for varying lengths of names, the "20" could just be the maximum name length, since (e.g.) -substr()- does not choke when pointed at a substring that goes beyond the length of the candidate string.

      Code:
      // simulate data
      clear
      set seed 8476
      set obs 100000
      gen name1 = ""
      gen name2 = ""
      forval j = 1/2 {
          forval i = 1/20 {     // 20 character names.
             quiet replace name`j' = name`j' + char(98 + trunc(runiform() * 25))
          }
      }
      // The canonical example
      replace name1 = "Carslake" in 1
      replace name2 = "Carlsake" in 1
      // end simulate data
      //
      // Compare successive pairs in each name
      gen str2 pair1 = ""
      gen str2 pair2 = ""
      gen rpairs = ""  // hold a list of the reversed pairs of letters to check the results
      gen byte found = 0
      gen L1 = strlen(name1)
      gen L2 = strlen(name2)
      forval i = 1/19 {
         quiet {
            replace pair1 = substr(name1, `i', 2)
            replace pair2 = substr(name2, `i', 2)
            replace found = (L1 == L2) & ///
                            (substr(pair1,1,1) == substr(pair2,2,1)) & ///
                            (substr(pair1,2,1) == substr(pair2,1,1))
            replace rpairs = rpairs + pair1 + pair2 + " " if found
         }  
      }
      br if !missing(rpairs)
      Last edited by Mike Lacy; 04 Dec 2018, 09:16.

      Comment


      • #4
        Just to update - here's my solution with a loop and substr(), which seems to work and isn't as long as I'd feared. It also tags identical names which include a double letter, which I don't mind; these could easily be excluded. I'd still be interested to know if anyone knows of a cleaner way of doing it!

        generate Length1 = length(Name1)
        generate Length2 = length(Name2)
        summarize Length1 if Length1==Length2
        local max_start = r(max)-1
        forvalues L = 1/`max_start'{
        replace Flag = 1 if Length1==Length2 & Length1>=5 & substr(Name1,1,`L'-1)==substr(Name2,1,`L'-1) & substr(Name1,`L',2)==strreverse(substr(Name2,`L',2 )) & substr(Name1,`L'+2,.)==substr(Name2,`L'+2,.)
        }

        Comment


        • #5
          Thank you both - your solutions arrived while I was writing my update.

          Comment


          • #6
            Code:
            * differ only by the switching of two consecutive letters:
                
            clear
            input str20 (v1 v2)
            "CARSLAK" "CARLSAK" 
            "CARSLAKABC" "KARSLACABC" 
            "CASRLAKABC" "KARSLACABC" 
            end
            
            strdist v1 v2, levenshtein(ldist)
            
            local maxlen 20
            
            forvalues c = 1/`maxlen' {
            
                local cc = cond(`c' < `maxlen' , " + " , "" ) 
                local cmd = `"`cmd'"' + `"string(substr(v1,`c',1)==substr(v2,`c',1),"%1.0f") `cc' "'
            } 
            
            gen test = substr(`cmd',1,length(v1)) if length(v1)==length(v2) & ldist==2
            gen byte OK = strpos(test,"00") > 0 
            
            list
            Code:
                 +---------------------------------------------------+
                 |         v1           v2   ldist         test   OK |
                 |---------------------------------------------------|
              1. |    CARSLAK      CARLSAK       2      1110011    1 |
              2. | CARSLAKABC   KARSLACABC       2   0111110111    0 |
              3. | CASRLAKABC   KARSLACABC       4                 0 |
                 +---------------------------------------------------+

            Comment


            • #7

              My suggested solution in #6 did match with more than just alterations.

              Below is an improved solution based on that a Damerau–Levenshtein distance (also matching transpositions) of 1 combined with a Levenshtein distance of 2 should match 1 alteration only.

              The program strdist used is from the strutil package https://github.com/wbuchanan/StataStringUtilities/

              Code:
              clear
              input str20 (v1 v2)
              "CARSLAK" "CARSLAK" 
              "CARSLAK" "CARSLAX" 
              "CARSLAK" "CARLSAK" 
              "CARSLAK" "CARSLXX" 
              "PAULA" "PUAAL" 
              "CASRLAKABC" "KARSLACABC" 
              end
              
              strdist v1 v2, levenshtein(LD) damerau(DD) 
              gen byte one_alt = ( LD == 2 & DD == 1 ) if ( length(v1) == length(v2) ) 
              
              list v1 v2 LD DD one_alt , abbrev(20) sepby(one_alt)
              Code:
                   +---------------------------------------------+
                   |         v1           v2   LD   DD   one_alt |
                   |---------------------------------------------|
                1. |    CARSLAK      CARSLAK    0    0         0 |
                2. |    CARSLAK      CARSLAX    1    1         0 |
                   |---------------------------------------------|
                3. |    CARSLAK      CARLSAK    2    1         1 |
                   |---------------------------------------------|
                4. |    CARSLAK      CARSLXX    2    2         0 |
                5. |      PAULA        PUAAL    3    2         0 |
                6. | CASRLAKABC   KARSLACABC    4    3         0 |
                   +---------------------------------------------+


              Comment

              Working...
              X