Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regex matching: Ignoring observations which contain certain characters in string

    Hello,

    I have looked around online extensively and found some solutions that seem to work on regextester, but when I apply them to STATA my code does not work.

    I would like to find observations from a string that contain the words "Single" or "single" but I would like regexm to ignore the observation if "malt" or "Malt" is also included in the string

    Code:
    gen byte c_singlecask = regexm(name, "(?!.*(alt)).*ingle")==1
    STATA returns the error: regexp: ?+* follows nothing

    Other threads containing this error code discuss escaped characters, but this is not what I believe I wish to achieve.

    Has anyone got any pointers? Thanks.

    STATA 15.0

  • #2
    I'm not sure if it is possible to use a single regex expression to do what you ask, or if I'm simply not caffeinated enough. One solution is to simply use two matches, which has the advantage of being quite easily understandable even to those without regex knowledge per se.

    Code:
    input str16(sometext)
    "A single malt"
    "Single Malt"
    "Scotch"
    "Single origin"
    ""
    end
    
    gen byte not_single_malt = regexm(sometext, "([sS]ingle)") & !regexm(sometext, "([mM]alt)")
    list, abbrev(16)
    and the output

    Code:
         +---------------------------------+
         |      sometext   not_single_malt |
         |---------------------------------|
      1. | A single malt                 0 |
      2. |   Single Malt                 0 |
      3. |        Scotch                 0 |
      4. | Single origin                 1 |
      5. |                               0 |
         +---------------------------------+

    Comment


    • #3
      This worked a dream. Thank you so kindly.

      Comment


      • #4
        In post #1, the error message you received may be due to the somewhat limited subset of the regular expression syntax supported by Stata's regex* functions.

        When Stata added Unicode support for strings, they also added a new set of regular expression functions ustrregex* that were enhanced to support Unicode text, which the original set does not. But to me, the real benefit of the Unicode regular expression functions is their much more powerful definition of regular expressions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

        Comment

        Working...
        X