How do I use regular expressions to match a certain number of occurrences of a given expression?

Michael Anbar

Join Date: Aug 2014

Posts: 116
#1

How do I use regular expressions to match a certain number of occurrences of a given expression?

02 Sep 2014, 10:15

I have data like this:

Code:

clear input str14 var1 "a_123456" "b_456789" "777888_c" "12345_d_000111" end

and I want to match the first occurrence of a six-digit number, with missing otherwise, in Stata 13.1. I can do this:

Code:

gen num = regexs(1) if regexm(var1, "([0-9][0-9][0-9][0-9][0-9][0-9])")

but is there a way to abbreviate this? Many programming languages and software packages allow for syntax like this:

Code:

gen num = regexs(1) if regexm(var1, "([0-9]){6}")

to match exactly six occurrences, or something like this, to match anywhere from 3-6 occurrences:

Code:

gen num = regexs(1) if regexm(var1, "([0-9]){3,6}")

or

Code:

gen num = regexs(1) if regexm(var1, "([0-9])\{3,6\}")

The POSIX standard allows this, and Stata's documentation claims that its syntax is "nearly identical" to this standard. Is this supported in Stata?

Thank you,
Michael Anbar

Last edited by Michael Anbar; 02 Sep 2014, 10:38.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35720
#2

02 Sep 2014, 10:23

I think the short answer is No.

The best documentation of what is allowed that I know is http://www.stata.com/support/faqs/da...r-expressions/
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#3

02 Sep 2014, 10:33

That's unfortunate. Maybe the documentation for -regexm()- and the other regex functions is simply incorrect, at least to my reading, because that documentation, unlike the FAQ you linked to, certainly implies that Stata supports the POSIX standard. Maybe improving this functionality beyond the basic syntax (although I would argue that searching for simple repetitions *is* basic) will happen in the next version. Thank you for the help.

Last edited by Michael Anbar; 02 Sep 2014, 10:36.
Comment
Phil Schumm

Join Date: Mar 2014

Posts: 169
#4

02 Sep 2014, 10:43

Originally posted by Michael Anbar View Post

I can do this:

Code:

gen num = regexs(1) if regexm(var1, "([0-9][0-9][0-9][0-9][0-9][0-9])")

but is there a way to abbreviate this?

The notation {m,n} (matching the preceding element at least m and not more than n times) is not currently supported by Stata's implementation of regular expressions. Hopefully, it will be added at some point. Until then, one possible workaround is
regexm(var1, `"(`="[0-9]"*6')"')
This is admittedly a bit ugly, and not that much help in this particular case, but in cases where you are looking for a larger number of consecutive instances, it can be helpful.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#5

02 Sep 2014, 10:56

To be clear: Phil's neat trick here doesn't mean that regexm() supports any special syntax for repeated elements. He's merely instructing Stata to evaluate a string expression before the result is passed to regexm().
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#6

02 Sep 2014, 11:07

Thank you for the help; I'm familiar with Stata's syntax for macro expressions, but I agree that it's definitely a kludge, albeit one that only works for matching a number of occurrences exactly, not the {m, n} syntax (as Phil mentioned).
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#7

02 Sep 2014, 12:40

Cross-posted at

http://stackoverflow.com/questions/2...rrences-of-a-g

Please note our policy on cross-posting, which is that you should tell us about it. This is explicit in the FAQ Advice.
Comment
Keith Finlay

Join Date: Apr 2014

Posts: 6
#8

03 Sep 2014, 11:09

I find Stata's regex functions so limiting that I often use the shell command to access something more powerful, like grep. Probably absurdly slow. grep in Stata 14, please?
Comment

Announcement

How do I use regular expressions to match a certain number of occurrences of a given expression?

Comment

Comment

Comment

Comment

Comment

Comment

Comment