Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find a specific word in a string

    Hi all,

    I am trying to find a specific word in a string. This seems like it should be simple, but looking through all the documentation and prior forum messages on strpos, substr, and regex, I haven't been able to find something that will work for the data I am using. Example below. I am trying to create a new variable "apple" that only includes observations from var fruit="APPLE" (and as such, exclude "REAPPLE."

    strpos doesn't seem to work because it will include REAPPLE. The regex commands are tricky for me and it seems like most of the indicators (e.g. ^, ., $) require the word to be in a certain spot in the string(?) - I feel like I am misunderstanding the regex documentation so feel free to correct me there. My issue is that the word could show up at any time in the string, and APPLE and REAPPLE could show up in the same string. I'm wondering if there is maybe a solution that says, search for a word that starts with "AP" or search for these characters "APPLE" and exclude if more than 5 characters? Any help is so appreciated.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str33 fruithashtags
    "GRAPE FRUIT REAPPLE"              
    "GRAPE FRUIT REAPPLE REBANANA"     
    "FRUIT REAPPLE REBANANA REKIWI"    
    "APPLE CANTALOUPE MELON KIWI"      
    "APPLE BANANA"                     
    "KIWI APPLE GRAPE REAPPLE"         
    "KIWI FRUIT REAPPLE REBANANA APPLE"
    "CANTALOUPE MELON APPLE BANANA"    
    "REAPPLE"                          
    "APPLE"                            
    end

  • #2
    My guess is that there is a simpler solution using regular expressions, but as I am a complete nitwit when it comes to those, here's one that does what you want using simple string functions:
    Code:
    gen byte foundit = strpos(fruithashtags, "APPLE") == 1 ///
                        | strpos(fruithashtags, " APPLE ") > 0 ///
                        | strpos(strreverse(fruithashtags), strreverse(" APPLE")) == 1
    That said, this relies heavily on there being no extra blanks padding the beginning or end of the strings in fruithashtags. If you are not sure about that constraint holding in your data, -replace fruithashtags = strtrim(fruithashtags)- will accomplish that.

    Comment


    • #3
      Code:
      generate apple = ustrregexm(fruithashtags,"\bAPPLE\b")
      Code:
                                 fruithashtags   apple  
        1.                 GRAPE FRUIT REAPPLE       0  
        2.        GRAPE FRUIT REAPPLE REBANANA       0  
        3.       FRUIT REAPPLE REBANANA REKIWI       0  
        4.         APPLE CANTALOUPE MELON KIWI       1  
        5.                        APPLE BANANA       1  
        6.            KIWI APPLE GRAPE REAPPLE       1  
        7.   KIWI FRUIT REAPPLE REBANANA APPLE       1  
        8.       CANTALOUPE MELON APPLE BANANA       1  
        9.                             REAPPLE       0  
       10.                               APPLE       1
      I have assumed that all your observations have no lower-case letters in the fruithasthags variable (e.g. "Apple" instead of "APPLE").

      Using the Unicode regular expression function ustrregexm allows us to take advantage of the regular expression meta character "\b" which indicates any break character (space, punctuation, etc.) that separates "words".

      The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

      A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.
      Last edited by William Lisowski; 30 Sep 2022, 13:04.

      Comment


      • #4
        Thank you both SO MUCH! I think either of these solutions would work for my issue.

        Comment


        • #5
          As mentioned somewhere on Statalist previously a Tip on this topic is in press for Stata Journal 22(4), but that won't be visible for 3 months. In addition to searching for

          Code:
          " APPLE "
          within
          Code:
          " " + fruithashtags + " "
          that Tip covers this method.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str33 fruithashtags
          "GRAPE FRUIT REAPPLE"              
          "GRAPE FRUIT REAPPLE REBANANA"     
          "FRUIT REAPPLE REBANANA REKIWI"    
          "APPLE CANTALOUPE MELON KIWI"      
          "APPLE BANANA"                     
          "KIWI APPLE GRAPE REAPPLE"         
          "KIWI FRUIT REAPPLE REBANANA APPLE"
          "CANTALOUPE MELON APPLE BANANA"    
          "REAPPLE"                          
          "APPLE"                            
          end
          
          gen found = strlen(fruithashtags) > strlen(subinword(fruithashtags, "APPLE", "", .)) 
          
          l 
          
               +-------------------------------------------+
               |                     fruithashtags   found |
               |-------------------------------------------|
            1. |               GRAPE FRUIT REAPPLE       0 |
            2. |      GRAPE FRUIT REAPPLE REBANANA       0 |
            3. |     FRUIT REAPPLE REBANANA REKIWI       0 |
            4. |       APPLE CANTALOUPE MELON KIWI       1 |
            5. |                      APPLE BANANA       1 |
               |-------------------------------------------|
            6. |          KIWI APPLE GRAPE REAPPLE       1 |
            7. | KIWI FRUIT REAPPLE REBANANA APPLE       1 |
            8. |     CANTALOUPE MELON APPLE BANANA       1 |
            9. |                           REAPPLE       0 |
           10. |                             APPLE       1 |
               +-------------------------------------------+

          So, we get Stata to tell us whether the length of the string variable is greater than the length that the string variable would be if we replaced
          Code:
           "APPLE"
          by an empty string, conditional on that string occurring as a word (which is the nub of the problem). If it is greater, we found the word in question.

          Logically, we just need to see what would happen if we replaced with anything that is a shorter string, but empty string will work fine. Note that we don't in fact replace the variable or create a new variable -- although do that if you want to.

          Comment


          • #6
            https://journals.sagepub.com/doi/pdf...6867X221141068 is the publication predicted in =5.

            Comment


            • #7
              This thread is so helpful! Thanks to all! I went with the code you put forth, Nick, and it works great! Thank you!

              Comment

              Working...
              X