New program for regular expressions

William Lisowski replied

20 Feb 2016, 12:37
wbuchanan intended, I believe, to call his question to the attention of Robert Picard but accidentally mentioned a Robert who is registered with a first name only.

With that said, it's not clear to me that Robert Picard asserted that Stata implements the full POSIX standard: I interpreted his meaning to be that it could accomplish the "biggest differences" you pointed out in the initial post. I was pleased to see that, however slowly, StataCorp is making its way toward a fuller implementation of regular expressions. But I don't think I'll be translating my files into Unicode to take advantage of them.
Leave a comment:
wbuchanan replied

20 Feb 2016, 11:55
Robert Picard could you share the reference that states that Stata implements the full POSIX standard in their regular expressions? I saw no mention of it in the manuals (http://www.stata.com/manuals14/fnstr...ions.pdf#page7) and other information from Stata states fairly explicitly that they implement their own regular expression engine due to differences in operating systems (http://www.stata.com/support/faqs/da...r-expressions/). Also could you clarify the efficiency statement? I haven't attempted any benchmarking, but if you've found that the program performs poorly I'm more than happy/willing to look at and work on addressing bottlenecks in the way the program works.
Leave a comment:
Robert Picard replied

20 Feb 2016, 09:20
Sorry for the perceived fuss, it's just that in #1, wbuchanan was apparently misinformed and therefore incorrectly characterized Stata's support for POSIX regular expressions. Of course, if you are confused by Stata's regular expression functions and think that jregex provides plus value, by all means go ahead and use it. But it would still be correct to point out that the functionality you find convenient can be provided without a trip to Java (and more efficiently to boot). Perhaps this last observation could lead to an asregex in the future?
Leave a comment:
Attaullah Shah replied

19 Feb 2016, 21:43
I think creating an alternate way of doing something that Stata built-in commands can do is not a bad idea at all, rather it expands the list of choices. See, Nick recently wrote numdate program to easily convert string or date formats to desired date format. Stata has already such functionalities. However, the new program offers syntax convenience that many users find useful. So, I wonder why there is so much fuss about jregex?
1 like
Leave a comment:
wbuchanan replied

19 Feb 2016, 14:06
I end up having to do a fair amount of work with rather messy data systems on a regular basis. In some cases, there are often several fields that are closely related, but may contain different data. In said cases, being able to use a single command to perform an operation over multiple variables (e.g., extracting say the second matching group across several variables) saves time. It is absolutely possible and plausible to use egen newvar = concat() to perform the equivalent operation on the data, or to construct a loop that handles all of that, but this reduces the work load to a single line of code. I also agree that there are other solutions that do the same/similar things and due to a lack of initial planning the stuff I've developed is only compatible with the Stata 14 Java API. That said, when I finally get around to porting class constructors to be compatible with the Stata 13 API methods (e.g., using int observation indices vs long observation indices) it will provide users who do not have access to the most current version of Stata with comparable functionality.

Robert Picard while I understand that this is a Stata forum, there is a Java API (there's also a C/C++ API as well) for Stata. Unfortunately, the examples of people using the C/C++ API are few in number, and in some cases are platform dependent and/or closed source. There are also definitely times where pushing things down to a lower-level language are not necessary, but depending on how the program is written there can (definitely not in this case) be performance benefits that come with using lower level languages (or potentially a reduction in memory consumption by operating on a smaller set of the data without the time it takes to interpret a higher level language ...).

Some of this started on another thread about cleaning cells containing multiple lines. Several different solutions were suggested, and in my haste I ignored the most parsimonious solution to ensuring any/all characters that cause a line break would be removed/substituted. It is absolutely possible to include "\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]" as a piece of a regular expression to ensure that any Unicode characters that could cause any form of line break were caught or we could use "\R" to get the same result. While this is likely not to be such an issue for the majority of users, those of us who spend the majority of our day coding tend to appreciate solutions that save several key strokes. There are also other predefined character class references that other users may find helpful (e.g., using \p{Sc} to capture currency symbols vs stating each currency symbol explicitly).
Leave a comment:
Clyde Schechter replied

19 Feb 2016, 11:14
I, too, look forward to the day when Stata offers expanded support for -regex- and the quality of documentation that it typically provides for other features!
Leave a comment:
Nick Cox replied

19 Feb 2016, 10:52
I guess that I am puzzled here too at the general strategy. For example, if I wanted to apply the same operation to several variables, then I would reach for a loop or Mata. That wouldn't rule out a customised program for something specific, but there are plenty of workable wheels in Stata.

All that said, I know (e.g. from comments at users' meetings) that StataCorp are minded to expand regex stuff in Stata and to document it better and FWIW I am cheering them on.
2 likes
Leave a comment:
Robert Picard replied

19 Feb 2016, 10:45
I'm ready to stipulate that you know Java and that solutions in Java come to you naturally but this is a Stata forum so I would think that users generally expect a Stata solution. Of course if there's no way to do it efficiently in Stata, then by all means suggest ahead. I would say the same if some Excel expert created a script to export data from Stata to Excel to perform string manipulations using VBA and then moved the results back to Stata when all the while the same task could be done using a simple Stata command.

The generate and replace command operate on a single variable at a time. I could create a mgenerate and mreplace alternates that would perform the same function on multiple variables. I don't think this requires Java and I don't think this is needed or desirable so I won't do it.

You can take a look at moss (with Nick Cox, from SSC), it handles matching/splitting strings based on regular expressions. Again, no need for Java.
Leave a comment:
wbuchanan replied

19 Feb 2016, 09:59
Robert Picard the new unicode based regular expressions are definitely an improvement without a doubt. Unless they have changed, they still operate on a single variable at a time, and the Java based functions are being developed to operate over multiple variables simultaneously. Another difference that I have planned (although not currently implemented) is string splitting based on regular expressions and/or retrieval of groups into multiple variables (e.g., if you wanted to parse an address string into its component parts having an option to pass a single address string that returns variables for house number, street, city, state, zip code, etc...).
Leave a comment:

Robert Picard replied

19 Feb 2016, 09:35

I believe that Stata's new Unicode versions of the regular expression functions (version 14) can do all of this. Here's how to replicate the example in the help file of jregex. Stata uses the standard POSIX bracket expressions while Java uses the \p syntax. It also appears that Stata's implementation of [:punct:]does not include the plus sign so a bit of adjustment is needed.

Code:

. clear

. input str52 addy

                                                     addy
  1. "6675,+Old+Canton+RD,+Ridgeland,+MS,+39157"
  2. "12313,+33RD+Ave+North,+SEATTLE,+wa,+98125"
  3. "310,+Cahir+StreeT,+Providence,+Rhode+Island,+02903"
  4. "22,+Oaklawn+Ave,+Cranston,+RI,+02920"
  5. "61,+pine+st,+Attleboro,+MA,+02703"
  6. "10,+larkspur+R0ad,+Warwick,+ri,+02886"
  7. "91,+FaLLon+Ave,+Providence,+RI,+02908"
  8. "195,+Arlington+AVE,+Providence,+RI,+02906"
  9. "74,+REGENT+aVenuE,+Providence,+RI,+02908"
 10. end

. 
. clonevar addy0 = addy

. 
. * This will replace the first instance of the '+' character with the string passed to the rep argument.
. jregex replace addy, p(`"\+"') rep(`"_this is a replacement string_"') repf

. gen addy2 = ustrregexrf(addy0,`"\+"',`"_this is a replacement string_"')

. assert addy == addy2

. 
. * Now we can replace the replacement string to recreate the original variable.
. jregex replace addy newaddy, rep(`"\+"') p(`"_this is a replacement string_"')

. gen newaddy2 = ustrregexrf(addy0,`"_this is a replacement string_"',`"\+"')

. assert newaddy == newaddy2

. 
. * You can also use POSIX character classes to replace values, like punctuation marks with a single space
. jregex replace newaddy, p(`"\p{Punct}"') rep(" ")

. replace newaddy2 = ustrregexra(newaddy2,`"([:punct:]|\+)"'," ")
(9 real changes made)

. assert newaddy == newaddy2

Here's another example using a repetition quantifier

Code:

. dis ustrregexrf("year 1968","[0-9]{1,2}","")
year 68

Stata's regular expressions are notoriously under-documented. I recommend to try first before looking for alternate solutions.

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: