Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I use regular expressions to match a certain number of occurrences of a given expression?

    I have data like this:

    Code:
    clear
    
    input str14 var1
    "a_123456"
    "b_456789"
    "777888_c"
    "12345_d_000111"
    end
    and I want to match the first occurrence of a six-digit number, with missing otherwise, in Stata 13.1. I can do this:

    Code:
    gen num = regexs(1) if regexm(var1, "([0-9][0-9][0-9][0-9][0-9][0-9])")
    but is there a way to abbreviate this? Many programming languages and software packages allow for syntax like this:

    Code:
    gen num = regexs(1) if regexm(var1, "([0-9]){6}")
    to match exactly six occurrences, or something like this, to match anywhere from 3-6 occurrences:

    Code:
    gen num = regexs(1) if regexm(var1, "([0-9]){3,6}")
    or

    Code:
    gen num = regexs(1) if regexm(var1, "([0-9])\{3,6\}")

    The POSIX standard allows this, and Stata's documentation claims that its syntax is "nearly identical" to this standard. Is this supported in Stata?

    Thank you,
    Michael Anbar
    Last edited by Michael Anbar; 02 Sep 2014, 10:38.

  • #2
    I think the short answer is No.

    The best documentation of what is allowed that I know is http://www.stata.com/support/faqs/da...r-expressions/

    Comment


    • #3
      That's unfortunate. Maybe the documentation for -regexm()- and the other regex functions is simply incorrect, at least to my reading, because that documentation, unlike the FAQ you linked to, certainly implies that Stata supports the POSIX standard. Maybe improving this functionality beyond the basic syntax (although I would argue that searching for simple repetitions *is* basic) will happen in the next version. Thank you for the help.
      Last edited by Michael Anbar; 02 Sep 2014, 10:36.

      Comment


      • #4
        Originally posted by Michael Anbar View Post
        I can do this:
        Code:
        gen num = regexs(1) if regexm(var1, "([0-9][0-9][0-9][0-9][0-9][0-9])")
        but is there a way to abbreviate this?
        The notation {m,n} (matching the preceding element at least m and not more than n times) is not currently supported by Stata's implementation of regular expressions. Hopefully, it will be added at some point. Until then, one possible workaround is
        regexm(var1, `"(`="[0-9]"*6')"')
        This is admittedly a bit ugly, and not that much help in this particular case, but in cases where you are looking for a larger number of consecutive instances, it can be helpful.

        Comment


        • #5
          To be clear: Phil's neat trick here doesn't mean that regexm() supports any special syntax for repeated elements. He's merely instructing Stata to evaluate a string expression before the result is passed to regexm().

          Comment


          • #6
            Thank you for the help; I'm familiar with Stata's syntax for macro expressions, but I agree that it's definitely a kludge, albeit one that only works for matching a number of occurrences exactly, not the {m, n} syntax (as Phil mentioned).

            Comment


            • #7
              Cross-posted at

              http://stackoverflow.com/questions/2...rrences-of-a-g

              Please note our policy on cross-posting, which is that you should tell us about it. This is explicit in the FAQ Advice.

              Comment


              • #8
                I find Stata's regex functions so limiting that I often use the shell command to access something more powerful, like grep. Probably absurdly slow. grep in Stata 14, please?

                Comment

                Working...
                X