Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Within a string, find a keyword and extract a certain number of words before and after that keyword--how?

    Dear all
    As the question title tries to summarize, assume I have a string variable; it is a long string, strL, in case that could matter. Within that string, I want to find a keyword, say "language". Then, I want to extract, say, x words before that keyword, and y words after that keyword.

    Consider the example below. My keyword is "language", and I want to create new variables, that contain, for example, the 2 words before my keyword and the 4 words after my keyword. My guess is this can be done in elegant fashion, with a combination of regular expressions and string functions. I just am not able to figure out how to do it. Can you help?

    Code:
    clear
    input strL stringExample
    "The definitions in the dictionary are simple and clarify the meaning. But for help with using the language, the examples are especially important, and the 7,500 frequent words are accompanied by a wealth of examples. These examples show a variety of significant features".
    Thank you in advance!
    Hannes

  • #2
    If the keyword "language" is found one time only in the string, a possible solution may be
    Code:
    gen after4 = ustrregexs(1) if ustrregexm(stringExample, "language\p{P}?\s?((?:\p{L}+\s){1,4})" )
    
    gen before2 = ustrregexs(1) if ustrregexm(stringExample, "((?:\p{L}+\s){1,2})language" )
    Last edited by Bjarte Aagnes; 23 Nov 2021, 10:08.

    Comment


    • #3
      Here's an approach without regular expressions:
      Code:
      local keyword = "language"
      local back = 2
      local ahead = 4
      local punctuation = ". , / ; :"   // Perhaps other stuff that would not be wanted?
      // Clean up string
      replace stringExample = strlower(stringExample)
      foreach c of local punctuation {
         quiet replace stringExample = subinstr(stringExample, "`c'", "", .)
      }
      // Assume the general case of multiple observations.
      gen nword = wordcount(stringExample)
      summ nword
      local maxword = r(max)
      // Find word position.
      gen pos = 0
      forval i = 1/`maxword' {
         quiet replace pos = `i' if (pos == 0) & (`i' <= nword) & ///
            (word(stringExample, `i') == "`keyword'")
      }
      // Extract desired words.
      gen str before = ""
      forval i = `back'(-1)1 {
        quiet replace before = before + word(stringExample, pos - `i') + " "
      }
      gen str after = ""
      forval i = 1/`ahead' {
        quiet replace after = after + word(stringExample, pos + `i') + " "
      }

      Comment


      • #4
        These are two excellent, and completely different solutions! Thank you so much! I am actually amazed how different they are. I have understood the basics of regular expressions, but Bjarte's code is far beyond what I can understand. It works, and to a regex newbie seems almost like magic. Mike's code is extremely versatile and perfectly human readable. Thank you for those great suggestions.

        Comment


        • #5
          ref# 4: some useful references for understanding and testing of regular expressions:

          https://regex101.com/
          https://unicode-org.github.io/icu/us...gs/regexp.html
          https://www.regular-expressions.info/unicode.html
          Last edited by Bjarte Aagnes; 24 Nov 2021, 07:50.

          Comment


          • #6
            Bjarte, these are oustanding links. Thanks! Regular Expressions 101 especially is impressive. an awesome sandbox!

            Comment


            • #7
              Deleted post
              Last edited by arjun bhadhuri; 08 Nov 2022, 10:29.

              Comment


              • #8
                Do you want to extract just a number or text that may include numbers? If the latter, try this:

                Code:
                gen after4 = ustrregexs(1) if ustrregexm(stringExample, "language\p{P}?\s?((?:(?:\p{L}|\p{Nd})+\s){1,4})" )
                gen before2 = ustrregexs(1) if ustrregexm(stringExample, "((?:(?:\p{L}|\p{Nd})+\s){1,2})language" )

                Comment

                Working...
                X