Within a string, find a keyword and extract a certain number of words before and after that keyword--how?

Hannes Wagner

Join Date: Sep 2019

Posts: 4
#1

Within a string, find a keyword and extract a certain number of words before and after that keyword--how?

23 Nov 2021, 08:35

Dear all
As the question title tries to summarize, assume I have a string variable; it is a long string, strL, in case that could matter. Within that string, I want to find a keyword, say "language". Then, I want to extract, say, x words before that keyword, and y words after that keyword.

Consider the example below. My keyword is "language", and I want to create new variables, that contain, for example, the 2 words before my keyword and the 4 words after my keyword. My guess is this can be done in elegant fashion, with a combination of regular expressions and string functions. I just am not able to figure out how to do it. Can you help?

Code:

clear input strL stringExample "The definitions in the dictionary are simple and clarify the meaning. But for help with using the language, the examples are especially important, and the 7,500 frequent words are accompanied by a wealth of examples. These examples show a variety of significant features".

Thank you in advance!
Hannes
Tags: None
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#2

23 Nov 2021, 09:52

If the keyword "language" is found one time only in the string, a possible solution may be

Code:

gen after4 = ustrregexs(1) if ustrregexm(stringExample, "language\p{P}?\s?((?:\p{L}+\s){1,4})" ) gen before2 = ustrregexs(1) if ustrregexm(stringExample, "((?:\p{L}+\s){1,2})language" )

Last edited by Bjarte Aagnes; 23 Nov 2021, 10:08.
2 likes
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2411

23 Nov 2021, 10:57

Here's an approach without regular expressions:

Code:

local keyword = "language"
local back = 2
local ahead = 4
local punctuation = ". , / ; :"   // Perhaps other stuff that would not be wanted?
// Clean up string
replace stringExample = strlower(stringExample)
foreach c of local punctuation {
   quiet replace stringExample = subinstr(stringExample, "`c'", "", .)
}
// Assume the general case of multiple observations.
gen nword = wordcount(stringExample)
summ nword
local maxword = r(max)
// Find word position.
gen pos = 0
forval i = 1/`maxword' {
   quiet replace pos = `i' if (pos == 0) & (`i' <= nword) & ///
      (word(stringExample, `i') == "`keyword'")
}
// Extract desired words.
gen str before = ""
forval i = `back'(-1)1 {
  quiet replace before = before + word(stringExample, pos - `i') + " "
}
gen str after = ""
forval i = 1/`ahead' {
  quiet replace after = after + word(stringExample, pos + `i') + " "
}

Comment

Hannes Wagner

Join Date: Sep 2019

Posts: 4
#4

24 Nov 2021, 03:20

These are two excellent, and completely different solutions! Thank you so much! I am actually amazed how different they are. I have understood the basics of regular expressions, but Bjarte's code is far beyond what I can understand. It works, and to a regex newbie seems almost like magic. Mike's code is extremely versatile and perfectly human readable. Thank you for those great suggestions.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#5

24 Nov 2021, 07:48

ref# 4: some useful references for understanding and testing of regular expressions:

https://regex101.com/
https://unicode-org.github.io/icu/us...gs/regexp.html
https://www.regular-expressions.info/unicode.html

Last edited by Bjarte Aagnes; 24 Nov 2021, 07:50.
1 like
Comment
Hannes Wagner

Join Date: Sep 2019

Posts: 4
#6

01 Dec 2021, 08:50

Bjarte, these are oustanding links. Thanks! Regular Expressions 101 especially is impressive. an awesome sandbox!
Comment
arjun bhadhuri

Join Date: Nov 2022

Posts: 1
#7

08 Nov 2022, 10:20

Deleted post

Last edited by arjun bhadhuri; 08 Nov 2022, 10:29.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1364

08 Nov 2022, 10:31

Do you want to extract just a number or text that may include numbers? If the latter, try this:

Code:

gen after4 = ustrregexs(1) if ustrregexm(stringExample, "language\p{P}?\s?((?:(?:\p{L}|\p{Nd})+\s){1,4})" )
gen before2 = ustrregexs(1) if ustrregexm(stringExample, "((?:(?:\p{L}|\p{Nd})+\s){1,2})language" )

Announcement

Within a string, find a keyword and extract a certain number of words before and after that keyword--how?

Comment

Comment

Comment

Comment

Comment

Comment

Comment