Find a specific word in a string

Megan Forecki

Join Date: Nov 2021

Posts: 7
#1

Find a specific word in a string

30 Sep 2022, 12:44

Hi all,

I am trying to find a specific word in a string. This seems like it should be simple, but looking through all the documentation and prior forum messages on strpos, substr, and regex, I haven't been able to find something that will work for the data I am using. Example below. I am trying to create a new variable "apple" that only includes observations from var fruit="APPLE" (and as such, exclude "REAPPLE."

strpos doesn't seem to work because it will include REAPPLE. The regex commands are tricky for me and it seems like most of the indicators (e.g. ^, ., $) require the word to be in a certain spot in the string(?) - I feel like I am misunderstanding the regex documentation so feel free to correct me there. My issue is that the word could show up at any time in the string, and APPLE and REAPPLE could show up in the same string. I'm wondering if there is maybe a solution that says, search for a word that starts with "AP" or search for these characters "APPLE" and exclude if more than 5 characters? Any help is so appreciated.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str33 fruithashtags "GRAPE FRUIT REAPPLE" "GRAPE FRUIT REAPPLE REBANANA" "FRUIT REAPPLE REBANANA REKIWI" "APPLE CANTALOUPE MELON KIWI" "APPLE BANANA" "KIWI APPLE GRAPE REAPPLE" "KIWI FRUIT REAPPLE REBANANA APPLE" "CANTALOUPE MELON APPLE BANANA" "REAPPLE" "APPLE" end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#2

30 Sep 2022, 12:59

My guess is that there is a simpler solution using regular expressions, but as I am a complete nitwit when it comes to those, here's one that does what you want using simple string functions:

Code:

gen byte foundit = strpos(fruithashtags, "APPLE") == 1 /// | strpos(fruithashtags, " APPLE ") > 0 /// | strpos(strreverse(fruithashtags), strreverse(" APPLE")) == 1

That said, this relies heavily on there being no extra blanks padding the beginning or end of the strings in fruithashtags. If you are not sure about that constraint holding in your data, -replace fruithashtags = strtrim(fruithashtags)- will accomplish that.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

30 Sep 2022, 13:01

Code:

generate apple = ustrregexm(fruithashtags,"\bAPPLE\b")

Code:

fruithashtags apple 1. GRAPE FRUIT REAPPLE 0 2. GRAPE FRUIT REAPPLE REBANANA 0 3. FRUIT REAPPLE REBANANA REKIWI 0 4. APPLE CANTALOUPE MELON KIWI 1 5. APPLE BANANA 1 6. KIWI APPLE GRAPE REAPPLE 1 7. KIWI FRUIT REAPPLE REBANANA APPLE 1 8. CANTALOUPE MELON APPLE BANANA 1 9. REAPPLE 0 10. APPLE 1

I have assumed that all your observations have no lower-case letters in the fruithasthags variable (e.g. "Apple" instead of "APPLE").

Using the Unicode regular expression function ustrregexm allows us to take advantage of the regular expression meta character "\b" which indicates any break character (space, punctuation, etc.) that separates "words".

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

Last edited by William Lisowski; 30 Sep 2022, 13:04.
Comment
Megan Forecki

Join Date: Nov 2021

Posts: 7
#4

30 Sep 2022, 14:17

Thank you both SO MUCH! I think either of these solutions would work for my issue.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35651

01 Oct 2022, 04:47

As mentioned somewhere on Statalist previously a Tip on this topic is in press for Stata Journal 22(4), but that won't be visible for 3 months. In addition to searching for

Code:

" APPLE "

within

Code:

" " + fruithashtags + " "

that Tip covers this method.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str33 fruithashtags
"GRAPE FRUIT REAPPLE"              
"GRAPE FRUIT REAPPLE REBANANA"     
"FRUIT REAPPLE REBANANA REKIWI"    
"APPLE CANTALOUPE MELON KIWI"      
"APPLE BANANA"                     
"KIWI APPLE GRAPE REAPPLE"         
"KIWI FRUIT REAPPLE REBANANA APPLE"
"CANTALOUPE MELON APPLE BANANA"    
"REAPPLE"                          
"APPLE"                            
end

gen found = strlen(fruithashtags) > strlen(subinword(fruithashtags, "APPLE", "", .)) 

l 

     +-------------------------------------------+
     |                     fruithashtags   found |
     |-------------------------------------------|
  1. |               GRAPE FRUIT REAPPLE       0 |
  2. |      GRAPE FRUIT REAPPLE REBANANA       0 |
  3. |     FRUIT REAPPLE REBANANA REKIWI       0 |
  4. |       APPLE CANTALOUPE MELON KIWI       1 |
  5. |                      APPLE BANANA       1 |
     |-------------------------------------------|
  6. |          KIWI APPLE GRAPE REAPPLE       1 |
  7. | KIWI FRUIT REAPPLE REBANANA APPLE       1 |
  8. |     CANTALOUPE MELON APPLE BANANA       1 |
  9. |                           REAPPLE       0 |
 10. |                             APPLE       1 |
     +-------------------------------------------+

So, we get Stata to tell us whether the length of the string variable is greater than the length that the string variable would be if we replaced

Code:

 "APPLE"

by an empty string, conditional on that string occurring as a word (which is the nub of the problem). If it is greater, we found the word in question.

Logically, we just need to see what would happen if we replaced with anything that is a shorter string, but empty string will work fine. Note that we don't in fact replace the variable or create a new variable -- although do that if you want to.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35651
#6

31 Jan 2023, 15:04

https://journals.sagepub.com/doi/pdf...6867X221141068 is the publication predicted in =5.
Comment
Jake Scollan-Rowley

Join Date: Feb 2024

Posts: 1
#7

15 Apr 2024, 02:51

This thread is so helpful! Thanks to all! I went with the code you put forth, Nick, and it works great! Thank you!
Comment

Announcement