Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keyword-in-context analysis

    Dear Statalist,

    I am trying to extract the five words before and after a given keyword in a string variable (keyword-in-context analysis). Each keyword can occur multiple times in a string and each context should be written to a new variable. Also, the context should be displayed only up to the period, exclamation mark or question mark. Note that the text may have double spacing and line breaks.

    Theoretically, this should be possible with "regexs" and "regexm", but I'm stuck. Maybe you have an idea?

    Here is an example where the keyword of interest is "we" (including variants such as "we've", "we've", and "WE") and the text is in the variable "string". The variables of interest are we1, we2, we3 and we_freq. In the example, apostrophes are considered separate words, but that's not really important.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte id str121 string str45 we1 str42 we2 str27 we3 byte we_freq
    1 "Overall, we're confident that it's a great example. But what we really value is your feedback."                            "Overall, we're confident that it's"            "But what we really value is your feedback." ""                            2
    2 "Now, there are  also some things we are afraid of. Here's a list of what We  think is scary.  Wait until we've shown you!" "there are  also some things we are afraid of." "Here's a list of what We  think is  scary." "Wait until we've shown you!" 3
    3 "This is just a random text. Thanks, WE love it ;-)."                                                                       "Thanks, WE love it ;-)."                       ""                                           ""                            1
    end

    With thanks and regards,

    Marvin

  • #2
    EDITED TO ACCOUNT FOR:

    Also, the context should be displayed only up to the period, exclamation mark or question mark.
    The following will work for single matches, and should start you in a useful direction. I do exact text-matching, so if you have "we're" or "we've", you have to specify these explicitly. For strings with several matches, iterate the code, at each point deleting the already matched keyword.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id str121 string
    1 "Overall, we're confident that it's a great example. But what we really value is your feedback."                          
    2 "Now, there are  also some things we are afraid of. Here's a list of what We  think is scary.  Wait until we've shown you!"
    3 "This is just a random text. Thanks, WE love it ;-)."                                                                      
    end
    
    gen beforewe = word(ustrregexs(1), -5)+" "+ word(ustrregexs(1), -4)+" "+ ///
    word(ustrregexs(1), -3)+" " + word(ustrregexs(1), -2)+" "+ word(ustrregexs(1), -1) ///
    if ustrregexm(" "+lower(string)+ " ","([^\.\?\!]+)[\s]we[\s]")
    
    gen afterwe2 = word(ustrregexs(1), 1)+" "+ word(ustrregexs(1), 2)+" "+ ///
    word(ustrregexs(1), 3)+" "+ word(ustrregexs(1), 4)+" "+ word(ustrregexs(1), 5) ///
    if ustrregexm(" "+lower(string)+ " ","[\s]we[\s]([^\.]+[\.\?\!])")
    
    gen wanted= beforewe+ " we "+ afterwe
    Res.:

    Code:
    . gen wanted= beforewe+ " we "+ afterwe
    
    . l wanted
    
         +------------------------------------------------+
         |                                         wanted |
         |------------------------------------------------|
      1. |     but what we really value is your feedback. |
      2. | there are also some things we are afraid of.   |
      3. |                      thanks, we love it ;-).   |
         +------------------------------------------------+
    Last edited by Andrew Musau; 04 May 2021, 14:08.

    Comment


    • #3
      Dear Andrew,

      Thank you very much! This is helpful indeed.

      I have been using the user-written "moss" command to get the frequency and position of keywords. This way you can take into account multiple occurrences of a keyword. My solution is less elegant than Andrew's and only returns a certain number of characters before and after the keyword, but it might still be useful for some people. Just be careful with the regular expression (regex), which would also return words such as "welcome" or "well" if you search for "we".

      Code:
      ssc install moss
      replace string = lower(string) // Convert string to lower case
      moss string, match("(we)") prefix(word_) regex // You can change the keyword here
      unab varlist: word_pos*
          foreach x of local varlist {
              gen ante_`x' = substr(string,`x'-30,32) // You can set the number of characters you want before the keyword here
              gen post_`x' = substr(string,`x',30) // You can set the number of characters you want after the keyword here
          }
      drop word_*
      Disclaimer: The code produces incorrect output if the keyword is at the beginning of the string and the "ante" context is longer than the available text.

      Best regards,

      Marvin
      Last edited by Marvin Hanisch; 06 May 2021, 02:28.

      Comment

      Working...
      X