Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding matching string(s) within two different variables

    Hi All,

    I am hoping there is an "easier" solution than a row by row analysis:

    I've got one variable called "group1" and another called "group2". I want to evaluate whether any of the string values in group 1 are found within group2. If so, a third variable "found" is coded as 1, else 0.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str33 group1 str34 group2 byte found
    "restart hold-still closer further" "closer right"                       1
    "closer up look-down right"         "restart hold-still"                 0
    "closer look-down up right"         "closer left"                        1
    "closer look-down up right"         "closer up"                          1
    "closer up look-down left"          "restart hold-still"                 0
    "up closer look-down right"         "right up"                           1
    "closer up look-down left"          "up right"                           1
    "closer up look-down left"          "look-right hold-still further left" 1
    "closer right up left"              "look-right look-up closer"          1
    "closer left right up"              "restart hold-still further"         0
    end
    Thanks in advance!

    Ariel


  • #2
    Hi Ariel. Hmmmmmmmm.... does this approach, approach a solution?

    I don't know how much python you know, but this
    Code:
    import pandas as pd
    
    # Create a DataFrame
    data = {
        'v1': ["Up down", "Here there"],
        'v2': ["down there", "blue car"]
    }
    
    df = pd.DataFrame(data)
    
    # Function to check for common words
    def has_common_word(row):
        words1 = set(row['v1'].split())
        words2 = set(row['v2'].split())
        return bool(words1 & words2)
    
    # Apply the function to the DataFrame
    df['found'] = df.apply(has_common_word, axis=1)
    
    # Print the DataFrame
    print(df)
    gives me
    Code:
               v1          v2  found
    0     Up down  down there   True
    1  Here there    blue car  False
    May be easier doing this in Python, if it's possible, but presumably there's a way in native Stata, too
    Last edited by Jared Greathouse; 05 Oct 2023, 19:04.

    Comment


    • #3
      First, I guess I'm not entirely understanding your sense of a value in group1 being found in group2, since in your example the first value of group1 is "restart hold-still closer further," which is not in its entirety found among the values of group2, yet your found = 1. (My sense of "found" fits more closely for values of group2 found within group1 for your example data, but even that is not perfect.)

      I'm thinking of found as "does the entire string of group1 occur anywhere among all the values of group2." If this sense is what you had in mind, and there's just some inadvertent error in your example, then the following approach might be of use:

      Presuming that the number of observations in the file and the length of the strings in group2 are not so big as to exceed the limits of characters in a macro (about 15e6 for Stata MP), I'd consider the following:

      Code:
      levelsof group2 
      local g2strings = r(levels) 
      gen found = strpos(`"`g2strings'"', group1) > 0
      If this approach is attractive but the local is too big, you could use more than one local to contain the list of strings in group2:
      Code:
      local half1 = trunc(_N/2)
      local half2 = half1 + 1
      levelsof group2 in 1/`half1'
      local list1 = r(levels)
      levels of group2 in `half2'/`=_N'
      local list2  = r(levels)
      gen found = (strpos(`"`list1'"', group1) > 0 ) | (strpos(`"`list2'"', group1) > 0
      Neither of the preceding has been rigorously tested.

      Comment


      • #4
        First off, Thank you Jared for your code. It does what I want it to do, but I'd prefer to keep the process in Stata. Nonetheless, I'll use this code if there is no good option in Stata.

        Second, Mike, you misunderstood my request. The idea is that if ANY of the string values within group1 are also within group2 the variable found will equal 1. So in the first example,"closer" is found in both group1 and group2 and thus found = 1. In the second row, none of the string values in group1 are found in group2, thus found = 0.

        I hope that clarifies the issue?

        Thanks!

        Ariel

        Comment


        • #5
          I like Mike Lacy do not get it.

          So in the first example,"closer" is found in both group1 and group2 and thus found = 1. In the second row, none of the string values in group1 are found in group2, thus found = 0.
          The second observation of group 1 is "closer up look-down right" and closer, up and right are all found in group 2. Can you be more explicit?

          Comment


          • #6
            Hi Andrew!

            Yes, I'll be more explicit!

            The second observation in group1 has 4 "sub-strings" (individual words): closer up look-down right. The second observation in group2 has 2 "substrings": restart hold-still.

            None of those individual words (or substrings or whatever we call them) from row 2 in group1 is found in row 2 of group2 . More specifically, "closer" "up" "look-down" "right" is not found in "restart" "hold-still"

            I am hoping to avoid evaluating this at each observation, but I am guessing it will involve (1) parsing the string value of group1 and group2 into individual strings, and then running a loop over each string from group1 (for each observation) and seeing if it matches a string value in group2...

            Thanks in advance!

            Ariel

            Comment


            • #7
              Code:
              generate `c(obs_t)' i = _n
              
              split group1 , generate(words1)
              split group2 , generate(words2)
              
              reshape long words , i(i)
              
              drop if words == ""
              
              sort i words
              by i words : generate byte found = (_N > 1)
              
              sort i found
              by i (found) : keep if _n == _N
              
              drop i _j words

              Edit:

              I should probably mention that the approach rests on the assumption of no duplicate words within group1 or group2, as implied in the example. Duplicates within groups will be (mis-)classified found.
              Last edited by daniel klein; 06 Oct 2023, 14:24.

              Comment


              • #8
                Originally posted by Ariel Linden View Post
                None of those individual words (or substrings or whatever we call them) from row 2 in group1 is found in row 2 of group2 . More specifically, "closer" "up" "look-down" "right" is not found in "restart" "hold-still"
                Got it. Here's how I'd do it.

                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str33 group1 str34 group2 byte found
                "restart hold-still closer further" "closer right"                       1
                "closer up look-down right"         "restart hold-still"                 0
                "closer look-down up right"         "closer left"                        1
                "closer look-down up right"         "closer up"                          1
                "closer up look-down left"          "restart hold-still"                 0
                "up closer look-down right"         "right up"                           1
                "closer up look-down left"          "up right"                           1
                "closer up look-down left"          "look-right hold-still further left" 1
                "closer right up left"              "look-right look-up closer"          1
                "closer left right up"              "restart hold-still further"         0
                end
                
                gen tocompare= "("+subinstr(trim(itrim(group2)), " ", "|", .)+")"
                gen wanted= ustrregexm(group1, tocompare)
                Res.:

                Code:
                . l, sep(0)
                
                     +--------------------------------------------------------------------------------------------------------------------------------+
                     |                            group1                               group2   found                              tocompare   wanted |
                     |--------------------------------------------------------------------------------------------------------------------------------|
                  1. | restart hold-still closer further                         closer right       1                         (closer|right)        1 |
                  2. |         closer up look-down right                   restart hold-still       0                   (restart|hold-still)        0 |
                  3. |         closer look-down up right                          closer left       1                          (closer|left)        1 |
                  4. |         closer look-down up right                            closer up       1                            (closer|up)        1 |
                  5. |          closer up look-down left                   restart hold-still       0                   (restart|hold-still)        0 |
                  6. |         up closer look-down right                             right up       1                             (right|up)        1 |
                  7. |          closer up look-down left                             up right       1                             (up|right)        1 |
                  8. |          closer up look-down left   look-right hold-still further left       1   (look-right|hold-still|further|left)        1 |
                  9. |              closer right up left            look-right look-up closer       1            (look-right|look-up|closer)        1 |
                 10. |              closer left right up           restart hold-still further       0           (restart|hold-still|further)        0 |
                     +--------------------------------------------------------------------------------------------------------------------------------+

                Comment


                • #9
                  Andrew, you are awesome!!! Thanks! Very succinct!

                  Comment


                  • #10
                    #8 seems elegant. The problem with the regular expressions, as suggested, is that, e.g., (restart) will match "restart", but it will also match "SOMETHINGrestartSOMETHING_ELSE". Perhaps this can be solved with more advanced regular expression syntax; but I am no expert on this.

                    Here is a revised version #7 that deals with possible duplicates within groups:

                    Code:
                    generate `c(obs_t)' i = _n
                    
                    reshape long group , i(i) j(j)
                    
                    split group , generate(words)
                    
                    reshape long words , i(i j)
                    
                    drop if words == ""
                    
                    sort i words j
                    by i words (j) : generate byte found = (_N > 1)
                    by i words j   : replace       found = found*(_N == 1)
                    
                    drop _j words
                    
                    sort i found
                    by i (found) : replace found = found[_N]
                    
                    sort i j
                    by i j : keep if _n == _N
                    
                    reshape wide group , i(i) j(j)
                    
                    drop i

                    Comment


                    • #11
                      OK, I believe this fixes potential problems in #8:

                      Code:
                      generate tocompare = "(\b"+usubinstr(ustrtrim(stritrim(group2)), " ", "\b|\b", .)+"\b)"

                      Comment


                      • #12
                        No need to enclose each word with word boundaries. You can instead have

                        Code:
                        gen tocompare= "\b("+subinstr(trim(itrim(group2)), " ", "|", .)+")\b"
                        Compare:

                        Code:
                        di ustrregexm("bordering", "(one|order|two)")
                        di ustrregexm("bordering", "\b(one|order|two)\b")
                        di ustrregexm("order", "\b(one|order|two)\b")
                        Res.:

                        Code:
                        . di ustrregexm("bordering", "(one|order|two)")
                        1
                        
                        . 
                        . di ustrregexm("bordering", "\b(one|order|two)\b")
                        0
                        
                        . 
                        . di ustrregexm("order", "\b(one|order|two)\b")
                        1

                        Comment


                        • #13
                          Thanks for #12, Andrew. I have and continue to benefit a lot from your knowledge on regular expressions.

                          Comment

                          Working...
                          X