Finding matching string(s) within two different variables

Ariel Linden

Join Date: Apr 2014
Posts: 170

Finding matching string(s) within two different variables

05 Oct 2023, 17:03

Hi All,

I am hoping there is an "easier" solution than a row by row analysis:

I've got one variable called "group1" and another called "group2". I want to evaluate whether any of the string values in group 1 are found within group2. If so, a third variable "found" is coded as 1, else 0.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str33 group1 str34 group2 byte found
"restart hold-still closer further" "closer right"                       1
"closer up look-down right"         "restart hold-still"                 0
"closer look-down up right"         "closer left"                        1
"closer look-down up right"         "closer up"                          1
"closer up look-down left"          "restart hold-still"                 0
"up closer look-down right"         "right up"                           1
"closer up look-down left"          "up right"                           1
"closer up look-down left"          "look-right hold-still further left" 1
"closer right up left"              "look-right look-up closer"          1
"closer left right up"              "restart hold-still further"         0
end

Thanks in advance!

Ariel

Tags: None

Jared Greathouse

Join Date: Sep 2021
Posts: 2172

05 Oct 2023, 18:41

Hi Ariel. Hmmmmmmmm.... does this approach, approach a solution?

I don't know how much python you know, but this

Code:

import pandas as pd

# Create a DataFrame
data = {
    'v1': ["Up down", "Here there"],
    'v2': ["down there", "blue car"]
}

df = pd.DataFrame(data)

# Function to check for common words
def has_common_word(row):
    words1 = set(row['v1'].split())
    words2 = set(row['v2'].split())
    return bool(words1 & words2)

# Apply the function to the DataFrame
df['found'] = df.apply(has_common_word, axis=1)

# Print the DataFrame
print(df)

gives me

Code:

           v1          v2  found
0     Up down  down there   True
1  Here there    blue car  False

May be easier doing this in Python, if it's possible, but presumably there's a way in native Stata, too

Last edited by Jared Greathouse; 05 Oct 2023, 19:04.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2421
#3

05 Oct 2023, 20:34

First, I guess I'm not entirely understanding your sense of a value in group1 being found in group2, since in your example the first value of group1 is "restart hold-still closer further," which is not in its entirety found among the values of group2, yet your found = 1. (My sense of "found" fits more closely for values of group2 found within group1 for your example data, but even that is not perfect.)

I'm thinking of found as "does the entire string of group1 occur anywhere among all the values of group2." If this sense is what you had in mind, and there's just some inadvertent error in your example, then the following approach might be of use:

Presuming that the number of observations in the file and the length of the strings in group2 are not so big as to exceed the limits of characters in a macro (about 15e6 for Stata MP), I'd consider the following:

Code:

levelsof group2 local g2strings = r(levels) gen found = strpos(`"`g2strings'"', group1) > 0

If this approach is attractive but the local is too big, you could use more than one local to contain the list of strings in group2:

Code:

local half1 = trunc(_N/2) local half2 = half1 + 1 levelsof group2 in 1/`half1' local list1 = r(levels) levels of group2 in `half2'/`=_N' local list2 = r(levels) gen found = (strpos(`"`list1'"', group1) > 0 ) | (strpos(`"`list2'"', group1) > 0

Neither of the preceding has been rigorously tested.
Comment
Ariel Linden

Join Date: Apr 2014

Posts: 170
#4

06 Oct 2023, 11:43

First off, Thank you Jared for your code. It does what I want it to do, but I'd prefer to keep the process in Stata. Nonetheless, I'll use this code if there is no good option in Stata.

Second, Mike, you misunderstood my request. The idea is that if ANY of the string values within group1 are also within group2 the variable found will equal 1. So in the first example,"closer" is found in both group1 and group2 and thus found = 1. In the second row, none of the string values in group1 are found in group2, thus found = 0.

I hope that clarifies the issue?

Thanks!

Ariel
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10273
#5

06 Oct 2023, 12:59

I like Mike Lacy do not get it.

So in the first example,"closer" is found in both group1 and group2 and thus found = 1. In the second row, none of the string values in group1 are found in group2, thus found = 0.

The second observation of group 1 is "closer up look-down right" and closer, up and right are all found in group 2. Can you be more explicit?
Comment
Ariel Linden

Join Date: Apr 2014

Posts: 170
#6

06 Oct 2023, 13:44

Hi Andrew!

Yes, I'll be more explicit!

The second observation in group1 has 4 "sub-strings" (individual words): closer up look-down right. The second observation in group2 has 2 "substrings": restart hold-still.

None of those individual words (or substrings or whatever we call them) from row 2 in group1 is found in row 2 of group2 . More specifically, "closer" "up" "look-down" "right" is not found in "restart" "hold-still"

I am hoping to avoid evaluating this at each observation, but I am guessing it will involve (1) parsing the string value of group1 and group2 into individual strings, and then running a loop over each string from group1 (for each observation) and seeing if it matches a string value in group2...

Thanks in advance!

Ariel
Comment
daniel klein

Join Date: Mar 2014

Posts: 3885
#7

06 Oct 2023, 14:00

Code:

generate `c(obs_t)' i = _n split group1 , generate(words1) split group2 , generate(words2) reshape long words , i(i) drop if words == "" sort i words by i words : generate byte found = (_N > 1) sort i found by i (found) : keep if _n == _N drop i _j words

Edit:

I should probably mention that the approach rests on the assumption of no duplicate words within group1 or group2, as implied in the example. Duplicates within groups will be (mis-)classified found.

Last edited by daniel klein; 06 Oct 2023, 14:24.
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10273

06 Oct 2023, 14:27

Originally posted by Ariel Linden View Post

None of those individual words (or substrings or whatever we call them) from row 2 in group1 is found in row 2 of group2 . More specifically, "closer" "up" "look-down" "right" is not found in "restart" "hold-still"

Got it. Here's how I'd do it.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str33 group1 str34 group2 byte found
"restart hold-still closer further" "closer right"                       1
"closer up look-down right"         "restart hold-still"                 0
"closer look-down up right"         "closer left"                        1
"closer look-down up right"         "closer up"                          1
"closer up look-down left"          "restart hold-still"                 0
"up closer look-down right"         "right up"                           1
"closer up look-down left"          "up right"                           1
"closer up look-down left"          "look-right hold-still further left" 1
"closer right up left"              "look-right look-up closer"          1
"closer left right up"              "restart hold-still further"         0
end

gen tocompare= "("+subinstr(trim(itrim(group2)), " ", "|", .)+")"
gen wanted= ustrregexm(group1, tocompare)

Res.:

Code:

. l, sep(0)

     +--------------------------------------------------------------------------------------------------------------------------------+
     |                            group1                               group2   found                              tocompare   wanted |
     |--------------------------------------------------------------------------------------------------------------------------------|
  1. | restart hold-still closer further                         closer right       1                         (closer|right)        1 |
  2. |         closer up look-down right                   restart hold-still       0                   (restart|hold-still)        0 |
  3. |         closer look-down up right                          closer left       1                          (closer|left)        1 |
  4. |         closer look-down up right                            closer up       1                            (closer|up)        1 |
  5. |          closer up look-down left                   restart hold-still       0                   (restart|hold-still)        0 |
  6. |         up closer look-down right                             right up       1                             (right|up)        1 |
  7. |          closer up look-down left                             up right       1                             (up|right)        1 |
  8. |          closer up look-down left   look-right hold-still further left       1   (look-right|hold-still|further|left)        1 |
  9. |              closer right up left            look-right look-up closer       1            (look-right|look-up|closer)        1 |
 10. |              closer left right up           restart hold-still further       0           (restart|hold-still|further)        0 |
     +--------------------------------------------------------------------------------------------------------------------------------+

Comment

Ariel Linden

Join Date: Apr 2014

Posts: 170
#9

06 Oct 2023, 15:08

Andrew, you are awesome!!! Thanks! Very succinct!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3885
#10

06 Oct 2023, 15:14

#8 seems elegant. The problem with the regular expressions, as suggested, is that, e.g., (restart) will match "restart", but it will also match "SOMETHINGrestartSOMETHING_ELSE". Perhaps this can be solved with more advanced regular expression syntax; but I am no expert on this.

Here is a revised version #7 that deals with possible duplicates within groups:

Code:

generate `c(obs_t)' i = _n reshape long group , i(i) j(j) split group , generate(words) reshape long words , i(i j) drop if words == "" sort i words j by i words (j) : generate byte found = (_N > 1) by i words j : replace found = found*(_N == 1) drop _j words sort i found by i (found) : replace found = found[_N] sort i j by i j : keep if _n == _N reshape wide group , i(i) j(j) drop i
Comment
daniel klein

Join Date: Mar 2014

Posts: 3885
#11

06 Oct 2023, 15:20

OK, I believe this fixes potential problems in #8:

Code:

generate tocompare = "(\b"+usubinstr(ustrtrim(stritrim(group2)), " ", "\b|\b", .)+"\b)"
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10273

#12

06 Oct 2023, 16:17

No need to enclose each word with word boundaries. You can instead have

Code:

gen tocompare= "\b("+subinstr(trim(itrim(group2)), " ", "|", .)+")\b"

Compare:

Code:

di ustrregexm("bordering", "(one|order|two)")
di ustrregexm("bordering", "\b(one|order|two)\b")
di ustrregexm("order", "\b(one|order|two)\b")

Res.:

Code:

. di ustrregexm("bordering", "(one|order|two)")
1

. 
. di ustrregexm("bordering", "\b(one|order|two)\b")
0

. 
. di ustrregexm("order", "\b(one|order|two)\b")
1

Comment

daniel klein

Join Date: Mar 2014

Posts: 3885
#13

08 Oct 2023, 04:22

Thanks for #12, Andrew. I have and continue to benefit a lot from your knowledge on regular expressions.
1 like
Comment

Announcement

Finding matching string(s) within two different variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment