Regular Expressions in Stata: Regexs

Nachiketh Soodana Prakash

Join Date: Oct 2014

Posts: 30
#1

Regular Expressions in Stata: Regexs

06 Aug 2015, 22:52

I am trying to extract string between two strings.

INDICATION: (string i want capture) DESCRIPTION
"i string i want to capture" has a high degree of variability.
in some observations it is 1 word in others its 2 words and others 3 words.
i am not having problems in capturing the first string but capturing 2 and 3 seems to be a problem.

I am capturing the first string with

Code:

gen indication = regexs(1) if regexm(notes,"INDICATION: ([a-zA-z+])[ ]*DESCRIPTION ")
Tags: None
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#2

06 Aug 2015, 23:47

It's not clear to me what your strings look like, but how about you start with:

Code:

clear set more off set obs 1 gen notes = "INDICATION: (string i want capture) DESCRIPTION" gen indic = regexs(1) if regexm(notes, "INDICATION: (.*) DESCRIPTION") list

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4421
#3

07 Aug 2015, 00:01

It's not a solution using regular expressions, but you could use something like the following to get what you want. Start at the "Begin here" comment.

.ÿversionÿ14.0

.ÿ
.ÿclearÿ*

.ÿsetÿmoreÿoff

.ÿ
.ÿinputÿstr50ÿinput_text

ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿinput_text
ÿÿ1.ÿ"stuffÿINDICATION:ÿ1ÿwordÿDESCRIPTION"
ÿÿ2.ÿ"andÿINDICATION:ÿ2ÿwordsÿDESCRIPTION"
ÿÿ3.ÿ"nonsenseÿINDICATION:ÿ3ÿwordsÿDESCRIPTION"
ÿÿ4.ÿ"INDICATION:ÿ(stringÿiÿwantÿcapture)ÿDESCRIPTION"
ÿÿ5.ÿend

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿlocalÿiÿINDICATION:

.ÿlocalÿiÿstrpos(input_text,ÿ"`i'")ÿ+ÿstrlen("`i'")+1

.ÿ
.ÿlocalÿdÿDESCRIPTION

.ÿlocalÿdÿstrpos(input_text,ÿ"`d'")

.ÿ
.ÿgenerateÿstrÿindicationÿ=ÿsubstr(input_text,ÿ`i',ÿ`d'ÿ-ÿ(`i'))

.ÿ
.ÿlistÿindication,ÿnoobs

ÿÿ+--------------------------+
ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿindicationÿ|
ÿÿ|--------------------------|
ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ1ÿwordÿÿ|
ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ2ÿwordsÿÿ|
ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ3ÿwordsÿÿ|
ÿÿ|ÿ(stringÿiÿwantÿcapture)ÿÿ|
ÿÿ+--------------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.
Comment
Nachiketh Soodana Prakash

Join Date: Oct 2014

Posts: 30
#4

07 Aug 2015, 00:55

Well here you go:

The variable "notes" contains a medical report in each field. The String i want to capture is in bold. There are 526 of such observations. The below report is one observation. The Regex captures only the bph. "bph with luts" is one of the strings i want to capture. There are n number of possibilities what goes in after indications. So its not just "bph with luts"

DATE: 23/56/1987 PATIENT: CVFGB DATE OF BIRTH: 5/4/1876 SURGEON: John Rambo
INDICATION: bph with luts
DESCRIPTION OF PROCEDURE: A 7-French urethral catheter and a 9-French rectal manometer were atraumaticallly placed. Normal saline was instilled via the urethral catheter at a rate of 40-50mL per minute. Direct pressure transduction was used to record both the abdominal and vesicle pressures. The detrusor pressures were derived from the above tracings. Continuous EMG monitoring was performed via electrodes to the perineum.
Filling Stage:
Presence of stress urinary incontinence: No
Presence of detrusor overactivity:No
Bladder Compliance: Normal
Maximum Capacity: 520mL
First Sensation: 161mL
First Desire: 250mL
Strong Desire: 375mL
Urgency: 448mL

Voiding Stage:
Destrusor Contraction: acontractile
Voided Volume: 236
Qmax: 9mL/S
Qavg: 4mL/S
PVR: 300mL
Detrusor maximum pressure: 97cmH2O
Vesical Maximum Pressure: 104cmH2O
Detrusor Pressure at Qmax: 12cmH2O
Vesical Pressure at Qmax: 40cmH2O
Presence of abdominal straining: Yes

EMG: Tracing quality and capture was good. There was appropriate quieting noted during voiding.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

07 Aug 2015, 07:43

Your post at #4 indicates that the regular expression approach suggested in post #2 did not work. Did the substring approach in post #3 also not work?

I suspect that the difficulty is that your variable notes has multiple lines. That is, it has within it "invisible" linefeed (newline) characters or return characters. I do not have time to test at the moment, but I think that something like

Code:

clonevar note2 = note replace note2 = subinstr(note2,char(10)," ",.) replace note2 = subinstr(note2,char(13)," ",.)

will create a new variable note2 with all the possible line break characters replaced by spaces. Then one of the two techniques should work.

If you have no luck, could you post a .dta file containing your example from post #4: 1 variable, 1 observation? I don't want to try to recreate it in Stata and not have it match exactly the particular characteristics of your data.

Last edited by William Lisowski; 07 Aug 2015, 07:48.
Comment
Nachiketh Soodana Prakash

Join Date: Oct 2014

Posts: 30
#6

07 Aug 2015, 08:26

Thank you for the suggestions. I met up with a programmer who resolved the regular expression problem. Here is the code he suggested.

Code:

gen indication1 = regexs(1) if regexm(notes,"INDICATION:\s*(.*)\s*DESCRIPTION") gen indication = trim(indication1)
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#7

07 Aug 2015, 10:30

I assume that since you are matching past the current line to the next line, this text is not read with each line being stored in different observations. I copied the text as is from the sample in #4 to a text file that I called "obs1.txt" and read the whole file into a strL.

If your programmer's suggestion works, it's totally by chance and is probably not what you want. That's because the \s whitespace character class is not supported by regexm(), even in Stata 14. The new unicode versions of regex functions do support it. Here's the difference

Code:

clear set obs 1 gen id = 1 generate strL notes = fileread("obs.txt") gen indication = trim(regexs(1)) if regexm(notes,"INDICATION:\s*(.*)\s*DESCRIPTION") gen indication2 = ustrregexs(1) if ustrregexm(notes,"INDICATION:\s*(.*?)\s*DESCRIPTION") dis "|`=indication[1]'|" dis "|`=indication2[1]'|" dis length(indication) dis strlen(indication2) dis strlen(indication2) dis ustrlen(indication2) assert indication == indication2

And here's the results after the variables are generated

Code:

. dis "|`=indication[1]'|" |bph with luts | . dis "|`=indication2[1]'|" |bph with luts| . . dis length(indication) 14 . dis strlen(indication2) 13 . dis strlen(indication2) 13 . dis ustrlen(indication2) 13 . . assert indication == indication2 assertion is false r(9);

Note that the new unicode versions support lazy quantifiers (i.e. the question mark in the "(.*?)" sub pattern). Without it, the white space is absorbed by ".*" and therefore included in what's returned in ustrregexs(1).
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

07 Aug 2015, 12:22

Note please that the original statement of the problem was

INDICATION: (string i want capture) DESCRIPTION

while the later display of the actual data was

INDICATION: bph with luts
DESCRIPTION OF PROCEDURE:

spread across two lines: this is a signficant difference, and meant that two readers solved a problem different than the one that you needed solved. That is why the Statalist FAQ linked to from the top of each page advises

Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly! If you can, reproduce the error with one of Stata's provided datasets, a small fragment of your dataset, or a simple concocted dataset that you include in your posting.
Comment
Nachiketh Soodana Prakash

Join Date: Oct 2014

Posts: 30
#9

07 Aug 2015, 13:28

I Guess Roberto Ferrer solved my problem, and i specifically wanted a regex solution. Actual data and original problem both exists in the database. The variable's are not uniform. So i mention there are n number of possibility what goes after indication. Thanks
Comment

Announcement

Regular Expressions in Stata: Regexs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment