Keyword-in-context analysis

Marvin Hanisch

Join Date: Oct 2017

Posts: 50
#1

Keyword-in-context analysis

04 May 2021, 03:18

Dear Statalist,

I am trying to extract the five words before and after a given keyword in a string variable (keyword-in-context analysis). Each keyword can occur multiple times in a string and each context should be written to a new variable. Also, the context should be displayed only up to the period, exclamation mark or question mark. Note that the text may have double spacing and line breaks.

Theoretically, this should be possible with "regexs" and "regexm", but I'm stuck. Maybe you have an idea?

Here is an example where the keyword of interest is "we" (including variants such as "we've", "we've", and "WE") and the text is in the variable "string". The variables of interest are we1, we2, we3 and we_freq. In the example, apostrophes are considered separate words, but that's not really important.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte id str121 string str45 we1 str42 we2 str27 we3 byte we_freq 1 "Overall, we're confident that it's a great example. But what we really value is your feedback." "Overall, we're confident that it's" "But what we really value is your feedback." "" 2 2 "Now, there are also some things we are afraid of. Here's a list of what We think is scary. Wait until we've shown you!" "there are also some things we are afraid of." "Here's a list of what We think is scary." "Wait until we've shown you!" 3 3 "This is just a random text. Thanks, WE love it ;-)." "Thanks, WE love it ;-)." "" "" 1 end

With thanks and regards,

Marvin
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10216

04 May 2021, 13:33

EDITED TO ACCOUNT FOR:

Also, the context should be displayed only up to the period, exclamation mark or question mark.

The following will work for single matches, and should start you in a useful direction. I do exact text-matching, so if you have "we're" or "we've", you have to specify these explicitly. For strings with several matches, iterate the code, at each point deleting the already matched keyword.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str121 string
1 "Overall, we're confident that it's a great example. But what we really value is your feedback."                          
2 "Now, there are  also some things we are afraid of. Here's a list of what We  think is scary.  Wait until we've shown you!"
3 "This is just a random text. Thanks, WE love it ;-)."                                                                      
end

gen beforewe = word(ustrregexs(1), -5)+" "+ word(ustrregexs(1), -4)+" "+ ///
word(ustrregexs(1), -3)+" " + word(ustrregexs(1), -2)+" "+ word(ustrregexs(1), -1) ///
if ustrregexm(" "+lower(string)+ " ","([^\.\?\!]+)[\s]we[\s]")

gen afterwe2 = word(ustrregexs(1), 1)+" "+ word(ustrregexs(1), 2)+" "+ ///
word(ustrregexs(1), 3)+" "+ word(ustrregexs(1), 4)+" "+ word(ustrregexs(1), 5) ///
if ustrregexm(" "+lower(string)+ " ","[\s]we[\s]([^\.]+[\.\?\!])")

gen wanted= beforewe+ " we "+ afterwe

Res.:

Code:

. gen wanted= beforewe+ " we "+ afterwe

. l wanted

     +------------------------------------------------+
     |                                         wanted |
     |------------------------------------------------|
  1. |     but what we really value is your feedback. |
  2. | there are also some things we are afraid of.   |
  3. |                      thanks, we love it ;-).   |
     +------------------------------------------------+

Last edited by Andrew Musau; 04 May 2021, 14:08.

Comment

Marvin Hanisch

Join Date: Oct 2017

Posts: 50
#3

06 May 2021, 02:03

Dear Andrew,

Thank you very much! This is helpful indeed.

I have been using the user-written "moss" command to get the frequency and position of keywords. This way you can take into account multiple occurrences of a keyword. My solution is less elegant than Andrew's and only returns a certain number of characters before and after the keyword, but it might still be useful for some people. Just be careful with the regular expression (regex), which would also return words such as "welcome" or "well" if you search for "we".

Code:

ssc install moss replace string = lower(string) // Convert string to lower case moss string, match("(we)") prefix(word_) regex // You can change the keyword here unab varlist: word_pos* foreach x of local varlist { gen ante_`x' = substr(string,`x'-30,32) // You can set the number of characters you want before the keyword here gen post_`x' = substr(string,`x',30) // You can set the number of characters you want after the keyword here } drop word_*

Disclaimer: The code produces incorrect output if the keyword is at the beginning of the string and the "ante" context is longer than the available text.

Best regards,

Marvin

Last edited by Marvin Hanisch; 06 May 2021, 02:28.
Comment

Announcement