Regex matching: Ignoring observations which contain certain characters in string

Tom Cordiez

Join Date: May 2020

Posts: 7
#1

Regex matching: Ignoring observations which contain certain characters in string

10 Jun 2020, 06:31

Hello,

I have looked around online extensively and found some solutions that seem to work on regextester, but when I apply them to STATA my code does not work.

I would like to find observations from a string that contain the words "Single" or "single" but I would like regexm to ignore the observation if "malt" or "Malt" is also included in the string

Code:

gen byte c_singlecask = regexm(name, "(?!.*(alt)).*ingle")==1

STATA returns the error: regexp: ?+* follows nothing

Other threads containing this error code discuss escaped characters, but this is not what I believe I wish to achieve.

Has anyone got any pointers? Thanks.

STATA 15.0
Tags: regex, Regular expressions

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2403

10 Jun 2020, 07:02

I'm not sure if it is possible to use a single regex expression to do what you ask, or if I'm simply not caffeinated enough. One solution is to simply use two matches, which has the advantage of being quite easily understandable even to those without regex knowledge per se.

Code:

input str16(sometext)
"A single malt"
"Single Malt"
"Scotch"
"Single origin"
""
end

gen byte not_single_malt = regexm(sometext, "([sS]ingle)") & !regexm(sometext, "([mM]alt)")
list, abbrev(16)

and the output

Code:

     +---------------------------------+
     |      sometext   not_single_malt |
     |---------------------------------|
  1. | A single malt                 0 |
  2. |   Single Malt                 0 |
  3. |        Scotch                 0 |
  4. | Single origin                 1 |
  5. |                               0 |
     +---------------------------------+

Comment

Tom Cordiez

Join Date: May 2020

Posts: 7
#3

10 Jun 2020, 07:12

This worked a dream. Thank you so kindly.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

10 Jun 2020, 11:38

In post #1, the error message you received may be due to the somewhat limited subset of the regular expression syntax supported by Stata's regex* functions.

When Stata added Unicode support for strings, they also added a new set of regular expression functions ustrregex* that were enhanced to support Unicode text, which the original set does not. But to me, the real benefit of the Unicode regular expression functions is their much more powerful definition of regular expressions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.
Comment

Announcement

Regex matching: Ignoring observations which contain certain characters in string

Comment

Comment

Comment