Identifying pairs of strings that differ by only two consecutive, inverted characters

David Carslake

Join Date: Jun 2014

Posts: 3
#1

Identifying pairs of strings that differ by only two consecutive, inverted characters

04 Dec 2018, 05:47

I want to identify when two putatively matched surnames (stored as two string variables) differ only by the switching of two consecutive letters. For example, to flag someone who has "CARSLAKE" in String1 and "CARLSAKE" in String2. I don't want to accept other pairs with a Levenshtein distance of 2 as they look more like distinct names (not typos).

I can imagine something looping through each letter in turn using substr(), but this would be very long-winded and clunky since the surnames of course vary in length between pairs (within pairs, I'm only interested if they're the same length). Does anyone know of a more sensible solution? Thanks.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

04 Dec 2018, 09:00

I think this will do it:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str8(var1 var2) "carslake" "carlsake" "smith" "simth" "jones" "jones" "label" "labrl" end capture program drop one_obs program define one_obs gen byte pair_reversal = 0 strdist var1 var2, levenshtein(ldist) if ldist == 2 { charlist var1 local c1 `r(chars)' charlist var2 local c2 `r(chars)' replace pair_reversal = (`"`c1'"' == `"`c2'"') } drop ldist exit end runby one_obs, by(var1 var2)

Note: This requires installation (if you do not already have them) of Michael Barker's -strdist-, and Robert Picard and my -runby-, both available from SSC.

Added: -charlist- is also from SSC, written by Nick Cox.

Also, this may not be exactly what you want. This will pick up situations where var1 and var2 differ by reversal of two letters that are not necessarily adjacent. (E.g. this will call a pair_reversal on "program" vs "pragrom")

Last edited by Clyde Schechter; 04 Dec 2018, 09:04.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

04 Dec 2018, 09:12

(crossed in the ether with Clyde's posting)

David, if I'm correctly understanding and implementing what you ask for, I don't think a brute-force loop is a problem.

The code below took my machine less than 2 sec. to do 100,000 observations. My example data uses names of the same length, to make sure to randomly produce enough reversal cases to see that the code works. I suppose the code might have some errors, but at least in concept, I don't see this as too clunky. To implement this for varying lengths of names, the "20" could just be the maximum name length, since (e.g.) -substr()- does not choke when pointed at a substring that goes beyond the length of the candidate string.

Code:

// simulate data
clear
set seed 8476
set obs 100000
gen name1 = ""
gen name2 = ""
forval j = 1/2 {
    forval i = 1/20 {     // 20 character names.
       quiet replace name`j' = name`j' + char(98 + trunc(runiform() * 25))
    }
}
// The canonical example
replace name1 = "Carslake" in 1
replace name2 = "Carlsake" in 1
// end simulate data
//
// Compare successive pairs in each name
gen str2 pair1 = ""
gen str2 pair2 = ""
gen rpairs = ""  // hold a list of the reversed pairs of letters to check the results
gen byte found = 0
gen L1 = strlen(name1)
gen L2 = strlen(name2)
forval i = 1/19 {
   quiet {
      replace pair1 = substr(name1, `i', 2)
      replace pair2 = substr(name2, `i', 2)
      replace found = (L1 == L2) & ///
                      (substr(pair1,1,1) == substr(pair2,2,1)) & ///
                      (substr(pair1,2,1) == substr(pair2,1,1))
      replace rpairs = rpairs + pair1 + pair2 + " " if found
   }  
}
br if !missing(rpairs)

Last edited by Mike Lacy; 04 Dec 2018, 09:16.

Comment

David Carslake

Join Date: Jun 2014

Posts: 3
#4

04 Dec 2018, 09:23

Just to update - here's my solution with a loop and substr(), which seems to work and isn't as long as I'd feared. It also tags identical names which include a double letter, which I don't mind; these could easily be excluded. I'd still be interested to know if anyone knows of a cleaner way of doing it!

generate Length1 = length(Name1)
generate Length2 = length(Name2)
summarize Length1 if Length1==Length2
local max_start = r(max)-1
forvalues L = 1/`max_start'{
replace Flag = 1 if Length1==Length2 & Length1>=5 & substr(Name1,1,`L'-1)==substr(Name2,1,`L'-1) & substr(Name1,`L',2)==strreverse(substr(Name2,`L',2 )) & substr(Name1,`L'+2,.)==substr(Name2,`L'+2,.)
}
1 like
Comment
David Carslake

Join Date: Jun 2014

Posts: 3
#5

04 Dec 2018, 09:25

Thank you both - your solutions arrived while I was writing my update.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

04 Dec 2018, 09:44

Code:

* differ only by the switching of two consecutive letters:
    
clear
input str20 (v1 v2)
"CARSLAK" "CARLSAK" 
"CARSLAKABC" "KARSLACABC" 
"CASRLAKABC" "KARSLACABC" 
end

strdist v1 v2, levenshtein(ldist)

local maxlen 20

forvalues c = 1/`maxlen' {

    local cc = cond(`c' < `maxlen' , " + " , "" ) 
    local cmd = `"`cmd'"' + `"string(substr(v1,`c',1)==substr(v2,`c',1),"%1.0f") `cc' "'
} 

gen test = substr(`cmd',1,length(v1)) if length(v1)==length(v2) & ldist==2
gen byte OK = strpos(test,"00") > 0 

list

Code:

     +---------------------------------------------------+
     |         v1           v2   ldist         test   OK |
     |---------------------------------------------------|
  1. |    CARSLAK      CARLSAK       2      1110011    1 |
  2. | CARSLAKABC   KARSLACABC       2   0111110111    0 |
  3. | CASRLAKABC   KARSLACABC       4                 0 |
     +---------------------------------------------------+

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

05 Dec 2018, 06:40

My suggested solution in #6 did match with more than just alterations.

Below is an improved solution based on that a Damerau–Levenshtein distance (also matching transpositions) of 1 combined with a Levenshtein distance of 2 should match 1 alteration only.

The program strdist used is from the strutil package https://github.com/wbuchanan/StataStringUtilities/

Code:

clear
input str20 (v1 v2)
"CARSLAK" "CARSLAK" 
"CARSLAK" "CARSLAX" 
"CARSLAK" "CARLSAK" 
"CARSLAK" "CARSLXX" 
"PAULA" "PUAAL" 
"CASRLAKABC" "KARSLACABC" 
end

strdist v1 v2, levenshtein(LD) damerau(DD) 
gen byte one_alt = ( LD == 2 & DD == 1 ) if ( length(v1) == length(v2) ) 

list v1 v2 LD DD one_alt , abbrev(20) sepby(one_alt)

Code:

     +---------------------------------------------+
     |         v1           v2   LD   DD   one_alt |
     |---------------------------------------------|
  1. |    CARSLAK      CARSLAK    0    0         0 |
  2. |    CARSLAK      CARSLAX    1    1         0 |
     |---------------------------------------------|
  3. |    CARSLAK      CARLSAK    2    1         1 |
     |---------------------------------------------|
  4. |    CARSLAK      CARSLXX    2    2         0 |
  5. |      PAULA        PUAAL    3    2         0 |
  6. | CASRLAKABC   KARSLACABC    4    3         0 |
     +---------------------------------------------+

Announcement