regexs() in MATA

Pablo Bonilla

Join Date: Apr 2014

Posts: 34
#1

regexs() in MATA

09 May 2017, 10:18

Dear all,

I was wondering whether the issue of regexs() in MATA mentioned here has already been solved. In my case the problem is the following. Assume I have matrix X form which I need to get matrix Y using regular expressions (This problem is simplified for the explanatory purposes).

Code:

X = "H 2000 A" \ "H 2001 A" \ "H 2002 A" \ "H 2003 A" Y = 2000 \ 20001 \ 2002 \ 2003

However, if I do the following I only get the value of the last row

Code:

regexm(X, "([0-9]+)") Y = regexs(1) Y

Which is not what I need. Then, I tried to solve this problem by using regexr(). However, notice that the second time it is executed, it does not work as expected.

Code:

Y=X Y = regexr(Y, "([A-Z]+)", "") Y = regexr(Y, "([A-Z]+)", "") // This does not remove the "A". I solved it by including a space " ", but it is not supposed to be so --> regexr(Y, " ([A-Z]+)", "") Y

So, does anyone of you know how to use correctly regexs() in MATA?

Thank you so much,

Pablo

Last edited by Pablo Bonilla; 09 May 2017, 10:19. Reason: MATA, regexm, regexs, regexr

Best,
Pablo Bonilla
Tags: mata, regexm, regexr, regexs, Regular expressions
Sergio Correia

Join Date: Apr 2014

Posts: 420
#2

09 May 2017, 11:50

This will remove the first block of non-digits

Code:

regexr(X, "[^0-9]+", "")

So to remove both the left and right text, do:

Code:

regexr(regexr(X, "[^0-9]+", ""), "[^0-9]+", "")

Of course, you can also use regexs(), but have to do it element-by-element:

Code:

Y = X for (i=1; i<=rows(X); i++) { if (regexm(X[i], "[0-9]+")) { Y[i] = regexs() } }
Comment
Pablo Bonilla

Join Date: Apr 2014

Posts: 34
#3

09 May 2017, 13:09

Dear Sergio,

Thank you for your response. It is very useful.
The regexs() solution is awesome. However, I think it should be something that Stata my solve for version 15. One of the purposes of MATA is precisely to avoid those element-by-element computations when they are not necessary.

I have a small follow up question regarding regexr(). My understanding--at least using regex in Stata--is that the ^ symbol is to indicate that the string begins with the subsequent expression. However, you're using it in a different way. Could you please explain that to me or point me out to some documentation to this type of regex?

Thank you so much!!

Pablo

Best,
Pablo Bonilla
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#4

09 May 2017, 15:57

Hi Pablo,

I agree that vectorizing is always good, although if they go that route they would need to replace their approach, as currently you need two different functions (regexm and regexs) and they would need to replace them with one single function that matches the regex and does the replacement in one go.

Anyways, the ^ sign has two meanings. Inside a set of brackets [ ] it means negation. So if you have "[^0-9]" this means anything that is not a number.
Comment
Pablo Bonilla

Join Date: Apr 2014

Posts: 34
#5

10 May 2017, 13:00

Awesome! Thank you so much, Sergio.

Best,
Pablo

Best,
Pablo Bonilla
Comment

Announcement

Comment

Comment

Comment

Comment