No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • New program for regular expressions

    There isn't too much too it at the moment, but I just put together a quick regular expression replace function (replaces either the first occurrence or all occurrences depending on optional arguments). Unlike the native regular expression functions in Stata, -jregex- uses the regular expression capabilities available in Java. The biggest difference that users may notice between how this program and the native Stata functions work is the ability to use the POSIX character classes (e.g., \p{Alpha}, \p{Punct}, etc...), conditional/counting meta characters (e.g., {2, 3} match at least twice but not more than 3 times), creating and referencing named groups (e.g., you can name subexpressions and reference the subexpressions later by name in addition to the typical $1, $2 group indicators), and several other features that are available in Java. You can find more information about the Java implementation of regular expressions by reading the Pattern API Javadocs. To install the program use:

    net inst jregex, from("")
    The current plan is to implement all functionality using a single API with subcommands. The replace functionality is called with

    jregex replace ...
    And provides a method to replace the values in place (e.g., pass it a single variable after the replace subcommand) or to place the new values into a new variable (by passing an existing and new variable name after the replace option). You can find a few examples of how it can be used on the program's project page.

  • #2
    I believe that Stata's new Unicode versions of the regular expression functions (version 14) can do all of this. Here's how to replicate the example in the help file of jregex. Stata uses the standard POSIX bracket expressions while Java uses the \p syntax. It also appears that Stata's implementation of [:punct:]does not include the plus sign so a bit of adjustment is needed.

    . clear
    . input str52 addy
      1. "6675,+Old+Canton+RD,+Ridgeland,+MS,+39157"
      2. "12313,+33RD+Ave+North,+SEATTLE,+wa,+98125"
      3. "310,+Cahir+StreeT,+Providence,+Rhode+Island,+02903"
      4. "22,+Oaklawn+Ave,+Cranston,+RI,+02920"
      5. "61,+pine+st,+Attleboro,+MA,+02703"
      6. "10,+larkspur+R0ad,+Warwick,+ri,+02886"
      7. "91,+FaLLon+Ave,+Providence,+RI,+02908"
      8. "195,+Arlington+AVE,+Providence,+RI,+02906"
      9. "74,+REGENT+aVenuE,+Providence,+RI,+02908"
     10. end
    . clonevar addy0 = addy
    . * This will replace the first instance of the '+' character with the string passed to the rep argument.
    . jregex replace addy, p(`"\+"') rep(`"_this is a replacement string_"') repf
    . gen addy2 = ustrregexrf(addy0,`"\+"',`"_this is a replacement string_"')
    . assert addy == addy2
    . * Now we can replace the replacement string to recreate the original variable.
    . jregex replace addy newaddy, rep(`"\+"') p(`"_this is a replacement string_"')
    . gen newaddy2 = ustrregexrf(addy0,`"_this is a replacement string_"',`"\+"')
    . assert newaddy == newaddy2
    . * You can also use POSIX character classes to replace values, like punctuation marks with a single space
    . jregex replace newaddy, p(`"\p{Punct}"') rep(" ")
    . replace newaddy2 = ustrregexra(newaddy2,`"([:punct:]|\+)"'," ")
    (9 real changes made)
    . assert newaddy == newaddy2
    Here's another example using a repetition quantifier
    . dis ustrregexrf("year 1968","[0-9]{1,2}","")
    year 68
    Stata's regular expressions are notoriously under-documented. I recommend to try first before looking for alternate solutions.


    • #3
      Robert Picard the new unicode based regular expressions are definitely an improvement without a doubt. Unless they have changed, they still operate on a single variable at a time, and the Java based functions are being developed to operate over multiple variables simultaneously. Another difference that I have planned (although not currently implemented) is string splitting based on regular expressions and/or retrieval of groups into multiple variables (e.g., if you wanted to parse an address string into its component parts having an option to pass a single address string that returns variables for house number, street, city, state, zip code, etc...).


      • #4
        I'm ready to stipulate that you know Java and that solutions in Java come to you naturally but this is a Stata forum so I would think that users generally expect a Stata solution. Of course if there's no way to do it efficiently in Stata, then by all means suggest ahead. I would say the same if some Excel expert created a script to export data from Stata to Excel to perform string manipulations using VBA and then moved the results back to Stata when all the while the same task could be done using a simple Stata command.

        The generate and replace command operate on a single variable at a time. I could create a mgenerate and mreplace alternates that would perform the same function on multiple variables. I don't think this requires Java and I don't think this is needed or desirable so I won't do it.

        You can take a look at moss (with Nick Cox, from SSC), it handles matching/splitting strings based on regular expressions. Again, no need for Java.


        • #5
          I guess that I am puzzled here too at the general strategy. For example, if I wanted to apply the same operation to several variables, then I would reach for a loop or Mata. That wouldn't rule out a customised program for something specific, but there are plenty of workable wheels in Stata.

          All that said, I know (e.g. from comments at users' meetings) that StataCorp are minded to expand regex stuff in Stata and to document it better and FWIW I am cheering them on.


          • #6
            I, too, look forward to the day when Stata offers expanded support for -regex- and the quality of documentation that it typically provides for other features!


            • #7
              I end up having to do a fair amount of work with rather messy data systems on a regular basis. In some cases, there are often several fields that are closely related, but may contain different data. In said cases, being able to use a single command to perform an operation over multiple variables (e.g., extracting say the second matching group across several variables) saves time. It is absolutely possible and plausible to use egen newvar = concat() to perform the equivalent operation on the data, or to construct a loop that handles all of that, but this reduces the work load to a single line of code. I also agree that there are other solutions that do the same/similar things and due to a lack of initial planning the stuff I've developed is only compatible with the Stata 14 Java API. That said, when I finally get around to porting class constructors to be compatible with the Stata 13 API methods (e.g., using int observation indices vs long observation indices) it will provide users who do not have access to the most current version of Stata with comparable functionality.

              Robert Picard while I understand that this is a Stata forum, there is a Java API (there's also a C/C++ API as well) for Stata. Unfortunately, the examples of people using the C/C++ API are few in number, and in some cases are platform dependent and/or closed source. There are also definitely times where pushing things down to a lower-level language are not necessary, but depending on how the program is written there can (definitely not in this case) be performance benefits that come with using lower level languages (or potentially a reduction in memory consumption by operating on a smaller set of the data without the time it takes to interpret a higher level language ...).

              Some of this started on another thread about cleaning cells containing multiple lines. Several different solutions were suggested, and in my haste I ignored the most parsimonious solution to ensuring any/all characters that cause a line break would be removed/substituted. It is absolutely possible to include "\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]" as a piece of a regular expression to ensure that any Unicode characters that could cause any form of line break were caught or we could use "\R" to get the same result. While this is likely not to be such an issue for the majority of users, those of us who spend the majority of our day coding tend to appreciate solutions that save several key strokes. There are also other predefined character class references that other users may find helpful (e.g., using \p{Sc} to capture currency symbols vs stating each currency symbol explicitly).


              • #8
                I think creating an alternate way of doing something that Stata built-in commands can do is not a bad idea at all, rather it expands the list of choices. See, Nick recently wrote numdate program to easily convert string or date formats to desired date format. Stata has already such functionalities. However, the new program offers syntax convenience that many users find useful. So, I wonder why there is so much fuss about jregex?
                Attaullah Shah, PhD.
                Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                Check my asdoc program, or even better asdocx, that easily sends Stata output to MS Word


                • #9
                  Sorry for the perceived fuss, it's just that in #1, wbuchanan was apparently misinformed and therefore incorrectly characterized Stata's support for POSIX regular expressions. Of course, if you are confused by Stata's regular expression functions and think that jregex provides plus value, by all means go ahead and use it. But it would still be correct to point out that the functionality you find convenient can be provided without a trip to Java (and more efficiently to boot). Perhaps this last observation could lead to an asregex in the future?


                  • #10
                    Robert Picard could you share the reference that states that Stata implements the full POSIX standard in their regular expressions? I saw no mention of it in the manuals ( and other information from Stata states fairly explicitly that they implement their own regular expression engine due to differences in operating systems ( Also could you clarify the efficiency statement? I haven't attempted any benchmarking, but if you've found that the program performs poorly I'm more than happy/willing to look at and work on addressing bottlenecks in the way the program works.


                    • #11
                      wbuchanan intended, I believe, to call his question to the attention of Robert Picard but accidentally mentioned a Robert who is registered with a first name only.

                      With that said, it's not clear to me that Robert Picard asserted that Stata implements the full POSIX standard: I interpreted his meaning to be that it could accomplish the "biggest differences" you pointed out in the initial post. I was pleased to see that, however slowly, StataCorp is making its way toward a fuller implementation of regular expressions. But I don't think I'll be translating my files into Unicode to take advantage of them.


                      • #12
                        You should still be able to use the Unicode regular expression functions with the ASCII character set without issues (other than using different function names. While I definitely like the new regular expression functions, breaking the regular expression API by defining new functions in a different name space vs using an option to specify ASCII vs Unicode makes it a bit more difficult to use the functionality within the context of existing code. If an additional parameter had been added to the regex API with a default value that retains the ASCII behavior it would be nicer from the perspective of maintaining code. That said, one of the deliberate decisions I made was to make the command jregex be the same for all cases. Different handling/mode options get toggled and have defaults set so users don't need to worry about setting the values explicitly. I already started implementing that operates over multiple variables simultaneously and will add an option to allow using the same interface to define sequential behavior (to reduce the need to call loops and/or concatenate strings manually). It's fine if this isn't something that others need/want to use, but it is an option that is available and when I get around to refactoring some of the code I use to access the data it will be accessible to users of Stata 13 as well.


                        • #13
                          wbuchanan, unfortunately, you relied on an old FAQ (circa 2005) written long before Stata 14. You made a mistake when you assumed that the new Unicode regular expression functions would provide no additional functionality.

                          As I said in my original post in #2:
                          Stata's regular expressions are notoriously under-documented.
                          I do not know of any reference that provides the details of Stata's new Unicode regular expression functions. Let me also make clear that I have no inside knowledge of the inner workings of Stata. And I am not employed or affiliated with StataCorp (nor have I ever been). I credit Dimitriy V. Masterov as the first who noticed the extra functionality on Stack Overflow in Sep. 2015.

                          You appear to have misread my comments in #2. I said

                          Stata uses the standard POSIX bracket expressions while Java uses the \p syntax.
                          I could not have said that the full POSIX standard is implemented because I don't know what is. All I did was to replicate the examples in your help file using Stata's regular expression functions. How did I manage to do that without documentation? I went to and looked up the POSIX character classes and tried them out on your examples.

                          The bottom line is that you wrote a program to perform regular expressions in Java and I pointed out that your statement in #1

                          The biggest difference that users may notice between how this program and the native Stata functions work is the ability to use the POSIX character classes (e.g., \p{Alpha}, \p{Punct}, etc...), conditional/counting meta characters (e.g., {2, 3} match at least twice but not more than 3 times)
                          is incorrect.

                          Since Java does not support POSIX bracket expressions (but does support POSIX character classes using the \p operator), I think it would be more accurate to say that jregex allows the user to specify a regular expression pattern using Java syntax. If your program can do something that can't be done as easily in Stata, that's fine too, but it's up to you to provide examples.

                          With respect to your question on efficiency, since I believe that jregex is of limited utility, I have no interest in working for you to benchmark it. I've already spent more time than I care to trying to correct your mischaracterizations of Unicode and regular expression support in Stata in this thread and in response to Stas yesterday. I comment only because I strongly believe that the language and style you used could easily misdirect Statalist readers away from a perfectly simple and correct solution in Stata. The burden is on you to explain the plus value of a trip to Java.

                          Finally, let me reiterate that if you intend to "improve" your program by including the capability of splitting strings based on regular expressions and/or retrieval of groups, this is a functionality that has been available for years via moss (from SSC). I'm not saying that you can't go that way, just don't say that this would bring a new functionality to Stata users.


                          • #14
                            William, Unicode is a superset of plain ASCII so the new Unicode functions can be used without fear on plain ASCII strings in Stata 14. The old versions remain for backwards compatibility and will work as before on characters 128-255 and continue to match based on byte values. The new Unicode functions perform character-based matches (which involve multiple bytes in UTF-8 for any non-ASCII character, and each byte that forms the multi-byte character is > 127).


                            • #15
                              Robert, many thanks. Somewhere I gained a misunderstanding of the workings of Unicode and thought it required a minimum of 2 bytes per character.