Extracting parts of a string

Stein Janssen

Join Date: May 2014

Posts: 10
#1

Extracting parts of a string

18 Sep 2015, 12:54

Dear all,

I'd like to extract a part of a string from a string variable, to be specific: the indication in a radiology report.

I created the following loop to extract a part of text starting with "indication"/"indications"/"history" AND ending with "view"/"views"

However, it looks for the last time "view"/"views" is used in the report, while it should pick up the string between "indication"/"indications"/"history" AND the first time "view"/"views" is used.

foreach x in indication indications history{
foreach y in view views {
replace indication = regexs(2) if regexm(lower(report_text),"(`x': )(.*)(`y')") & indication == ""
}
}

Thank you,

Stein
Tags: None
Joe Canner

Join Date: Mar 2014

Posts: 580
#2

18 Sep 2015, 13:08

Stein,

I'm not sure I entirely understand your question, but I have a feeling I know what's wrong. The (.*) in the middle of your regular expression doesn't care about what you have after and grabs everything until the end of the string. In regular-expression-speak this is known as "greedy" matching. Most standard implementations of regular expressions have a way of making this expression non-greedy, but Stata's does not. They are aware of this limitation and have promised to fix it someday.

In the meantime, there are probably some alternatives, but to better help you it might be helpful if you can give some examples of the kinds of strings you have and what you want to extract from the string. That way, we can test possible solutions against actual data.

Regards,
Joe
1 like
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#3

18 Sep 2015, 13:26

Stein,

Here is one method that assumes that view/views is at the very end of the string:

Code:

input strL report_text "indication I want this 1 view" "indication I want this 2 views" "indications I want this 3 view" "indications I want this 4 views" "history I want this 5 view" "history I want this 6 views" "xxx I dont want this yyy" "indication I dont want this either yyy" "xxx nor this view" end gen indication=regexs(2) if regexm(lower(report_text),"^(indications|indication|history)(.*)(views|view)$") list

The use of the "$" at the end is required to prevent greedy matching. If the assumption that view/views is at the end of the string is not valid, then perhaps you can add a preliminary step that creates a string that conforms to that assumption.

Note also that the order of (indications|indication|history) is important. If you put (indication|indications|history) and the string starts with "indications" it will match "indication" and leave the "s" behind.

Regards.
Joe
1 like
Comment
Stein Janssen

Join Date: May 2014

Posts: 10
#4

18 Sep 2015, 13:29

Thanks for your reply. To simplify it a bit:

a string variable called report_text contains the following text:

"patient 656294 elbow radiograph: indication: pain in the elbow after trauma : 2 elbow view. The imaging demonstrated xx and no fracture. end of report and images viewed by xxx"

I'd like to extract this part: " pain in the elbow after trauma : 2 elbow"

However, using this command: gen indication = regexs(2) if regexm(lower(report_text),"(indication: )(.*)(view)")

extracts this: "pain in the elbow after trauma : 2 elbow view. The imaging demonstrated xx and no fracture. end of report and images"

I'd like to have it stop at the first occasion of "view" not the last.
Comment

Joe Canner

Join Date: Mar 2014
Posts: 580

18 Sep 2015, 13:32

Stein,

Here is an alternative that doesn't assume that view/views is at the end of the string:

Code:

clear

input strL report_text
"indication I want this 1 view xxx"
"indication I want this 2 views yyy"
"indications I want this 3 view zzz"
"indications I want this 4 views aaa"
"history I want this 5 view bbb"
"history I want this 6 views ccc"
"xxx I dont want this yyy"
"indication I dont want this either yyy"
"xxx nor this view"
end

gen subtext=substr(report_text,1,strpos(report_text,"view")-1)
gen indication=regexs(2) if regexm(subtext,"(indications|indication|history)(.*)")

list

Let us know if neither of these methods works with your data.

Regards,
Joe

Comment

Stein Janssen

Join Date: May 2014

Posts: 10
#6

18 Sep 2015, 13:32

Great example, thanks:

But how to handle this one:

"indication I want this 1 view but there are multiple view"

I only want the part "I want this 1"
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#7

18 Sep 2015, 13:48

I think my second example will work for that, since it throws out everything after (and including) the first "view".
1 like
Comment
Stein Janssen

Join Date: May 2014

Posts: 10
#8

18 Sep 2015, 13:50

Brilliant. Great solution; I haven't thought of this. Thank you very much.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

18 Sep 2015, 14:09

Note that with Stata 14, the new unicode versions of regex functions do support non-greedy quantifiers.

Code:

clear
set obs 1
gen report_text = "patient 656294 elbow radiograph: indication: pain in the elbow after trauma : 2 elbow view. The imaging demonstrated xx and no fracture. end of report and images viewed by xxx"
gen indication = ustrregexs(1) if ustrregexm(report_text,"indication: (.+?) view")
list

Comment

Joe Canner

Join Date: Mar 2014

Posts: 580
#10

18 Sep 2015, 15:02

Robert,

Thanks for pointing that out. Is this documented somewhere or did you (or someone else) discover it by accident? I'm surprised StataCorp didn't mention this when I brought up Stata's regular expression limitations at the Stata Conference.

Regards,
Joe
1 like
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#11

18 Sep 2015, 15:21

I haven't seen any documentation about the extra functionality of the new unicode versions. Credit goes to Dimitriy V. Masterov who noted the support for character classes in this post on Stack Overflow.
1 like
Comment
Stein Janssen

Join Date: May 2014

Posts: 10
#12

19 Sep 2015, 06:26

Incredible guys, thank you so much!
Comment
Ches Zin

Join Date: Jun 2023

Posts: 20
#13

22 Jul 2023, 05:43

Dear all,
I have the example of following text and I want to extract "Max 8".

doser "1 tablet when needed . Max 8 tablet per day"
I have done the following but it said the doser max is invalid name

gen max = ustrregexs(1) if ustrregexm(lower(doser,"max [0-9]"))

I have done the following also but it produced all missing values. For information, max can be in various forms such as max, MAX and Max
gen max = ustrregexs(1) if ustrregexm(doser,"(Max|max|MAX)[0-9]$")

Appreciate any sharing.
Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#14

22 Jul 2023, 08:00

I guess you have a parenthesis too late. Try

Code:

lower(closer)

and cut the misplaced character.
Comment
Ches Zin

Join Date: Jun 2023

Posts: 20
#15

22 Jul 2023, 09:25

Hi Nick,
Thanks. I have tried with the following but all missing values generated. I am not sure which is the misplaced character. Appreciate your help.

gen max = ustrregexs(1) if ustrregexm(lower(doser),"max [0-9]")

Thanks
Comment

Announcement

Extracting parts of a string

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment