Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare strings with different ordering of words

    Dear statalists,

    I am not sure if this has been asked already as this is not easy to search for.

    I was wondering if it possible to assert whether strings contain the same words irrespective of their ordering. My data looks like this:
    Name 1 Name 2
    David Joe Joe David
    ... ...
    Any ideas? Thanks!


    Felix

  • #2
    Dear all,

    one of the functions to use is
    indexnot()
    , for example
    gen tag = indexnot(Name1,Name2)
    , then do whatever with it.

    Best,
    Felix

    Comment


    • #3
      Here is some sample code that may start you in a useful direction.
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str20 name1 str20 name2
      "David Joe" "Joe David"  
      "dog cat" "cat pig dog"
      "dog dog" "cat cat"
      end
      
      split name1, generate(A)
      split name2, generate(B)
      generate id = _n
      rename (A* B*) (word=)
      list, clean noobs
      
      reshape long word, i(id) j(wid) string
      drop if missing(word)
      replace wid = substr(wid,1,1)
      list, clean noobs
      
      by id word wid, sort: generate count=_N
      duplicates drop id wid word, force
      reshape wide count, i(id word) j(wid) string
      list, clean noobs
      
      by id, sort: egen same = min(countA==countB)
      drop word countA countB
      by id, sort: keep if _n==1
      list, clean noobs
      Code:
      . list, clean noobs
      
          id       name1         name2   same  
           1   David Joe     Joe David      1  
           2     dog cat   cat pig dog      0  
           3     dog dog       cat cat      0

      Comment


      • #4
        Here's another approach that I believe works. The basis is: If each member of the set of words occurring in *either* name can be found in *each* name, the two names must have contained the same words to start with. Note that, according to this approach, "dog" is the same as "dog dog."
        Code:
        clear
        // Combine the example data I had already made with William Lisowski's.
        input str30 (name1  name2)
        "Bill Fred"  "Fred Bill"
        "Alice Betty Cheryl" "Alice"
        "Danielle" "Alice Betty Cheryl"
        "Eve Fred George" "George Eve Fred"
        "Helen Helen" "Helen"
        "David Joe" "Joe David"  
        "dog cat" "cat pig dog"
        "dog dog" "cat cat"
        "dog" "dog dog"
        end
        //
        gen str WordsEither = name1 + " " + name2
        // Need maximum count of words to control loop below.
        gen int wordcount = wordcount(WordsEither)
        summ wordcount
        local maxwords = r(max)
        //
        gen byte FailedToFind = 0
        gen str next = ""
        forval i = 1/`maxwords' {
           replace next = word(WordsEither, `i')
           replace FailedToFind = 1 if ///
              (FailedToFind == 0) &  ///
              ((strpos(name1,next) ==0) | (strpos(name2,next) == 0))
        }
        tab FailedToFind

        Comment

        Working...
        X