regexm and regexs

ulas alk

Join Date: Jul 2016

Posts: 84
#1

regexm and regexs

09 Sep 2017, 14:38

Hi everyone,

I am trying to extract a portion of text variable. My aim is to take the characters after "aa:" until "f". (or until a pre-specified word) Below is my simplified code:

gen text = ""
replace text = "aa: inffant bb: insp cc: 35 yrd old. dd: ee:acad" in 1
replace text = "aa: infant ff:no fnote bb: insp cc: 35 yrd old. dd: ee:acad" in 2
gen trial = ""
replace trial = regexs(1) if regexm(text, "aa[. #:-]*([a-z0-9.,&/: ]*)(f)")

My outputs for this code is as follows:

Row 1: inf
Row 2: infant ff:no

Why does it happen? I mean why I do not get "in" for both of them, and also why I get "inf" for the first one although I get "infant ff:no" for the second one?

I appreciate your helps in advance.

Best,
Ulas
Tags: None
Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#2

10 Sep 2017, 08:15

Welcome to the Statalist forum. In the forum FAQ, there is good advice on how to post data, preferably using dataex and "CODE delimiters".

In general, don't use regular expressions if not needed. With your data example and aim it seems like you do not need to use regular expressions.The subinstr(), substr(), strpos() functions will do:

Code:

gen t1 = trim(subinstr(substr(text, 1, -1 + strpos(text, "f") ), "aa:", "", 1 ))

I mean why I do not get "in" for both of them

Because "f" is included in the word class [a-z] and Stata's regexm(), regexr() and regexs() functions are greedy. i.e. match the longest possible string, which also explain your second question:

why I get "inf" for the first one although I get "infant ff:no" for the second one?

From Stata 14 the new regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs() (using the ICU regular expression engine) support non-greedy regular expressions operators:

*? Match 0 or more times. Match as few times as possible.

+? Match 1 or more times. Match as few times as possible.

So, your regular expression can be changed by using the non-greedy "*?":

Code:

gen t2 = ustrregexs(1) if ustrregexm(text, "aa[. #:-]*([a-z0-9.,&/: ]*?)(f)")

Also see this post on Stata's regular espressions

Last edited by Bjarte Aagnes; 10 Sep 2017, 08:23.
1 like
Comment

*?	Match 0 or more times. Match as few times as possible.
+?	Match 1 or more times. Match as few times as possible.

Announcement

Comment