Compare strings with different ordering of words

Felix Stips

Join Date: Nov 2014

Posts: 110
#1

Compare strings with different ordering of words

14 Feb 2019, 02:18

Dear statalists,

I am not sure if this has been asked already as this is not easy to search for.

I was wondering if it possible to assert whether strings contain the same words irrespective of their ordering. My data looks like this:

Name 1 Name 2

David Joe Joe David

... ...

Any ideas? Thanks!

Felix
Tags: None
Felix Stips

Join Date: Nov 2014

Posts: 110
#2

14 Feb 2019, 05:15

Dear all,

one of the functions to use is

indexnot()

, for example

gen tag = indexnot(Name1,Name2)

, then do whatever with it.

Best,
Felix
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Feb 2019, 08:12

Here is some sample code that may start you in a useful direction.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str20 name1 str20 name2
"David Joe" "Joe David"  
"dog cat" "cat pig dog"
"dog dog" "cat cat"
end

split name1, generate(A)
split name2, generate(B)
generate id = _n
rename (A* B*) (word=)
list, clean noobs

reshape long word, i(id) j(wid) string
drop if missing(word)
replace wid = substr(wid,1,1)
list, clean noobs

by id word wid, sort: generate count=_N
duplicates drop id wid word, force
reshape wide count, i(id word) j(wid) string
list, clean noobs

by id, sort: egen same = min(countA==countB)
drop word countA countB
by id, sort: keep if _n==1
list, clean noobs

Code:

. list, clean noobs

    id       name1         name2   same  
     1   David Joe     Joe David      1  
     2     dog cat   cat pig dog      0  
     3     dog dog       cat cat      0

Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2423

14 Feb 2019, 09:16

Here's another approach that I believe works. The basis is: If each member of the set of words occurring in *either* name can be found in *each* name, the two names must have contained the same words to start with. Note that, according to this approach, "dog" is the same as "dog dog."

Code:

clear
// Combine the example data I had already made with William Lisowski's.
input str30 (name1  name2)
"Bill Fred"  "Fred Bill"
"Alice Betty Cheryl" "Alice"
"Danielle" "Alice Betty Cheryl"
"Eve Fred George" "George Eve Fred"
"Helen Helen" "Helen"
"David Joe" "Joe David"  
"dog cat" "cat pig dog"
"dog dog" "cat cat"
"dog" "dog dog"
end
//
gen str WordsEither = name1 + " " + name2
// Need maximum count of words to control loop below.
gen int wordcount = wordcount(WordsEither)
summ wordcount
local maxwords = r(max)
//
gen byte FailedToFind = 0
gen str next = ""
forval i = 1/`maxwords' {
   replace next = word(WordsEither, `i')
   replace FailedToFind = 1 if ///
      (FailedToFind == 0) &  ///
      ((strpos(name1,next) ==0) | (strpos(name2,next) == 0))
}
tab FailedToFind

Name 1	Name 2
David Joe	Joe David
...	...

Announcement

Compare strings with different ordering of words

Comment

Comment

Comment