Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regexs() in MATA

    Dear all,

    I was wondering whether the issue of regexs() in MATA mentioned here has already been solved. In my case the problem is the following. Assume I have matrix X form which I need to get matrix Y using regular expressions (This problem is simplified for the explanatory purposes).

    Code:
    X = "H 2000 A" \  "H 2001 A" \ "H 2002 A" \  "H 2003 A"
    Y = 2000 \ 20001 \ 2002 \ 2003
    However, if I do the following I only get the value of the last row
    Code:
    regexm(X, "([0-9]+)")
    Y = regexs(1)
    Y
    Which is not what I need. Then, I tried to solve this problem by using regexr(). However, notice that the second time it is executed, it does not work as expected.

    Code:
    Y=X
    Y = regexr(Y, "([A-Z]+)", "")
    Y = regexr(Y, "([A-Z]+)", "")  // This does not remove the "A". I solved it by including a space " ", but it is not supposed to be so  --> regexr(Y, " ([A-Z]+)", "")
    Y
    So, does anyone of you know how to use correctly regexs() in MATA?

    Thank you so much,

    Pablo
    Last edited by Pablo Bonilla; 09 May 2017, 10:19. Reason: MATA, regexm, regexs, regexr
    Best,
    Pablo Bonilla

  • #2
    This will remove the first block of non-digits
    Code:
    regexr(X, "[^0-9]+", "")
    So to remove both the left and right text, do:

    Code:
    regexr(regexr(X, "[^0-9]+", ""), "[^0-9]+", "")
    Of course, you can also use regexs(), but have to do it element-by-element:

    Code:
    Y = X
    for (i=1; i<=rows(X); i++) {
        if (regexm(X[i], "[0-9]+")) {
            Y[i] = regexs()
        }
    }

    Comment


    • #3

      Dear Sergio,

      Thank you for your response. It is very useful.
      The regexs() solution is awesome. However, I think it should be something that Stata my solve for version 15. One of the purposes of MATA is precisely to avoid those element-by-element computations when they are not necessary.

      I have a small follow up question regarding regexr(). My understanding--at least using regex in Stata--is that the ^ symbol is to indicate that the string begins with the subsequent expression. However, you're using it in a different way. Could you please explain that to me or point me out to some documentation to this type of regex?

      Thank you so much!!

      Pablo
      Best,
      Pablo Bonilla

      Comment


      • #4
        Hi Pablo,

        I agree that vectorizing is always good, although if they go that route they would need to replace their approach, as currently you need two different functions (regexm and regexs) and they would need to replace them with one single function that matches the regex and does the replacement in one go.

        Anyways, the ^ sign has two meanings. Inside a set of brackets [ ] it means negation. So if you have "[^0-9]" this means anything that is not a number.

        Comment


        • #5
          Awesome! Thank you so much, Sergio.

          Best,
          Pablo
          Best,
          Pablo Bonilla

          Comment

          Working...
          X