regexm not recognizing "\$" preceded or followed by other characters

Phoebe Scollard

Join Date: Nov 2020

Posts: 2
#1

regexm not recognizing "\$" preceded or followed by other characters

25 Nov 2020, 10:34

Hi,
I ran into an issue while using regexm to search for a string with a "$" followed by capital or lower case letters (i.e. "\$[a-zA-Z]). My expression is recognized in python, but not when using regexm in Stata. I messed a round a bit and it seems like anytime "\$" is preceded or followed by other characters (I've tried numbers and 'special' characters too) Stata doesn't recognize those other characters. I understand that "" is needed to prevent Stata from looking for a global and have provided some code below that I think more clearly illustrates the issue. I've figured out a work around using python, but this was still bothering me so I thought I'd see if anyone knew what was going on. I'm using Stata 16.

Code:

gen dollar = "$" gen check = regexm(dollar,"\$") assert check == 1 gen dollar_abc = "\$abc" gen check_abc = regexm(dollar,"\$abc") assert check_abc == 1 \\ Fails for me because check_abc has all 0's gen abc_dollar = "abc$" gen abc_check = regexm(dollar,"abc\$") assert abc_check == 1 \\ Fails for me because abc_check also has all 0's

Thank you,
Phoebe
Tags: regular expression
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

25 Nov 2020, 12:25

First of all, your code doesn't demonstrate the problem because you left the first argument as the variable dollar in each of the three regexm commands. I think the following is what you meant to demonstrate.

Code:

. gen dollar = "$" . gen check = regexm(dollar,"\$") . gen dollar_abc = "\$abc" . gen check_abc = regexm(dollar_abc,"\$abc") . gen abc_dollar = "abc$" . gen abc_check = regexm(abc_dollar,"abc\$") . . list, clean abbreviate(12) noobs dollar check dollar_abc check_abc abc_dollar abc_check $ 1 $abc 0 abc$ 0

The problem you are having is that not only is $ special to Stata (for global variables), it is also special to regex as the end-of-line character. Matching any expression to the regular expression "$" will yield a match. You can demonstrate this by changing the first generate command in the above to

Code:

gen dollar = "Dollar Sign"

and it will still match.

The easiest way to solve the regex problem in the Stata context is to make the dollar sign the only member of a set - thus [$] - which removes the special meaning of $ to the regular expression code and simultaneously removes the special meaning of $ to Stata, since $] is not an allowable global macro.

Code:

. gen dollar = "$" . gen check = regexm(dollar,"[$]") . gen dollar_abc = "\$abc" . gen check_abc = regexm(dollar_abc,"[$]abc") . gen abc_dollar = "abc$" . gen abc_check = regexm(abc_dollar,"abc[$]") . . list, clean abbreviate(12) noobs dollar check dollar_abc check_abc abc_dollar abc_check $ 1 $abc 1 abc$ 1

With all that said, since you're good with regular expressions, let me add the following advice. I don't think it would have helped here, but it will help you better use your Python regular expression expertise in Stata.

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.
4 likes
Comment
Phoebe Scollard

Join Date: Nov 2020

Posts: 2
#3

25 Nov 2020, 17:28

Thank you for your explanation and advice William! (Yes, your correction to my code is the question I meant to ask)
Comment

Announcement

regexm not recognizing "\$" preceded or followed by other characters

Comment

Comment