New program for regular expressions

wbuchanan started a topic New program for regular expressions

19 Feb 2016, 06:24
New program for regular expressions
There isn't too much too it at the moment, but I just put together a quick regular expression replace function (replaces either the first occurrence or all occurrences depending on optional arguments). Unlike the native regular expression functions in Stata, -jregex- uses the regular expression capabilities available in Java. The biggest difference that users may notice between how this program and the native Stata functions work is the ability to use the POSIX character classes (e.g., \p{Alpha}, \p{Punct}, etc...), conditional/counting meta characters (e.g., {2, 3} match at least twice but not more than 3 times), creating and referencing named groups (e.g., you can name subexpressions and reference the subexpressions later by name in addition to the typical $1, $2 group indicators), and several other features that are available in Java. You can find more information about the Java implementation of regular expressions by reading the Pattern API Javadocs. To install the program use:

Code:

net inst jregex, from("http://wbuchanan.github.io/StataRegex/")

The current plan is to implement all functionality using a single API with subcommands. The replace functionality is called with

Code:

jregex replace ...

And provides a method to replace the values in place (e.g., pass it a single variable after the replace subcommand) or to place the new values into a new variable (by passing an existing and new variable name after the replace option). You can find a few examples of how it can be used on the program's project page.
Tags: None

1 like
wbuchanan replied

24 Feb 2016, 11:58
Friedrich Huebler It is a hypothetical example, but not too dissimilar from other business rules that I've had to implement in the past. If you're pulling data from a large data system then values not conforming to some predefined pattern could be indicative of bugs that caused the pattern to not be enforced. A phone number was also the quickest thing I could throw together into an example. The standard you referenced also indicates that hyphens are acceptable:

9.1 Grouping of digits in a telephone number5 should be accomplished by means or spaces6 unless an agreed upon explicit symbol (e.g. hyphen) is necessary for procedural purposes. Only spaces should be used in an international number.

My statement was less about any data standard and more about the enforcement of a business rule (e.g., operational definition) that could be simplified by operating over multiple variables simultaneously (and without having to explicitly run things in a loop or concatenate the data prior to calling the function/command).
Leave a comment:
Friedrich Huebler replied

24 Feb 2016, 11:00
Originally posted by wbuchanan View Post

A valid phone number must have the form (###) ### - #### and you must select the first occurrence of a valid phone number among the three variables.

This may be off-topic but I would be grateful if you could clarify something. Are you saying that a valid US phone number must be written as (###) ### - #### or is this a purely hypothetical example? According to ITU recommendation E.123 ("Notation for national and international telephone numbers, e-mail addresses and web addresses") hyphens should not be part of a telephone number.

Grouping of digits in a telephone number should be accomplished by means of spaces unless an agreed upon explicit symbol (e.g. hyphen) is necessary for procedural purposes.

Even if we ignore that recommendation a number written as ### ### ####, for example, would certainly be valid.
Leave a comment:

Robert Picard replied

22 Feb 2016, 09:37

Sorry, I completely misunderstood the exercise and the tight bounds for the solution. Here's how I would deal with it. I don't understand why you want to reject numbers because of missing/extra spaces so I made the following solution more flexible. I target each part of the phone number and then standardize to the desired format. Also, since you seem to prefer fewer keystrokes, I used a local macro to reduce the "mess" a bit.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str17 phone1 str14 phone2 str16 phone3
"(706)332-9739"     "735 578 - 674 " "765 60-6789"     
"605 801 8928"      "(227) 385 3769" "(944)5239383"    
"(647) 425120"      "(98 ) 246674"   "(522)829-8615"   
"12  841 - 4162"    "(656) 919 8420" "(803) 436 - 483 "
"(795)450-6874"     "1108218003"     "1308484948"      
"(795) 450 - 68745" "1108218003"     "1308484948"      
"1108218003"        "(795) 450 - "   "1308484948"      
end

g clnphone = regexs(1) if regexm(phone1 + phone2 + phone3, "(\([0-9][0-9][0-9]\) [0-9][0-9][0-9] - [0-9][0-9][0-9][0-9])")

local d3 "[0-9][0-9][0-9]"
gen firstgood = ""
foreach v of varlist phone1 phone2 phone3 {
    replace firstgood = regexs(1) + " " + regexs(2) + " - " + regexs(3) ///
        if regexm(`v', "^(\(`d3'\)) *(`d3') *- *(`d3'[0-9])$") & mi(firstgood)
}
list

Code:

. list

     +---------------------------------------------------------------------------------------------+
     |            phone1           phone2             phone3           clnphone          firstgood |
     |---------------------------------------------------------------------------------------------|
  1. |     (706)332-9739   735 578 - 674         765 60-6789                      (706) 332 - 9739 |
  2. |      605 801 8928   (227) 385 3769       (944)5239383                                       |
  3. |      (647) 425120     (98 ) 246674      (522)829-8615                      (522) 829 - 8615 |
  4. |    12  841 - 4162   (656) 919 8420   (803) 436 - 483                                        |
  5. |     (795)450-6874       1108218003         1308484948                      (795) 450 - 6874 |
     |---------------------------------------------------------------------------------------------|
  6. | (795) 450 - 68745       1108218003         1308484948   (795) 450 - 6874                    |
  7. |        1108218003     (795) 450 -          1308484948   (795) 450 - 1308                    |
     +---------------------------------------------------------------------------------------------+

As you can see, you need to refine your solution as it can find matches when there are extra digits (obs 6) and matches can extend into the next phone number (jobs 7). You don't show how jregex would handle this problem so I can't comment on the benefit of a "single API that provides consistent access to a regular expression engine" versus a standard Stata solution.

Yes, bringing the equivalent of the more powerful Stata 14 regular expression functions to pre-14 users is certainly a useful endeavor. At this point I'm bailing out of this thread as I have better things to do.

Leave a comment:

wbuchanan replied

21 Feb 2016, 13:00
Clyde Schechter at least I don't feel quite as bad about the quality of education data. I've come across cases in the data system where I work where a phone number would be something along the lines of "(Call this person if child is sick and needs to be picked up from school)". Addresses and names are even worse (unfortunately). There are a few different natural language processing and search algorithms (Lucene is a good example of the later) that are designed to make string processing and searching easier. Not sure if you use strLs regularly, but they'd be particularly well suited for operations on those types of data (we end up having a ton of data like that when there a open text fields used to describe behavioral incidents).
Leave a comment:
Clyde Schechter replied

21 Feb 2016, 11:13
I have nothing to contribute to the dialog about the various approaches to handling this problem. All I have to add is that, from this user's perspective, it is a very pressing problem indeed. As an epidemiologist, I often work with data from clinical records. Before electronic health records (EHRs) became widespread, this data was typically abstracted from handwritten records. While that was labor-intensive for the research assistants, we would ordinarily have them input the data into a database that contained extensive front-end validation of entries, so by the time the data got to me it was usually in reasonable shape and needed only reasonable amounts of effort to clean up.

Today, much clinical data is taken from the EHR databases. The condition of this data is deplorable, almost beyond description. While it is understandable that free-form text input is needed for the description of many things in an EHR, you might expect that standard things like drug names, routine laboratory tests. and units of measurement are handled with drop-down menus or subject to front-end validation. You would be wrong. As a result, we now deal with data sets that are much larger than we could get before, but disproportionately more labor-intensive to clean. The kind of data that Bill Buchanan's code in #18 generates is quite typical of what we now routinely get from EHRs. Having tools that can clean this up more efficiently than Stata's current string-handling functions would be a blessing indeed!
Leave a comment:
wbuchanan replied

21 Feb 2016, 04:27
Robert Picard as you quoted, I never suggested that the POSIX square bracket expressions for character classes were implemented, but just that the set of POSIX character classes are available; in either case, your suggestion on making it more explicit that Java uses \p{} to denote character classes instead of [::] is noted and valid (and I'll make changes accordingly to ensure this is clearer in the future). That said, there are flags that users can set that make the behavior in -jregex- a bit more transparent. For example, the "^" and "$" meta characters can have different meanings depending on the underlying engine's treatment of line terminating characters. In Stas's case - in particular - this could have meaningful consequences (e.g., would the "$" character stop matching at the end of the input stream or would it stop matching the first time it encounters a line terminating character?) and the references to the Javadocs for the Pattern class make this fairly clear:

"By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence."

Similarly, there are also implications for other metacharacters such as the "." character which some users may or may not expect to match any character (regardless of class), but have an option to limit this behavior to only characters that do not terminate a line:

"The regular expression . matches any character except a line terminator unless the DOTALL flag is specified."

The other implication in Stas's case would be the types of characters that are recognized as line terminating:

"If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters."

There is a significant amount of overlap between the Java regular expression implementation and that made available by the ustrregex*() functions as noted on the ICU regular expression website:
ICU does not support UREGEX_CANON_EQ. See http://bugs.icu-project.org/trac/ticket/9111

The behavior of \cx (Control-X) different from Java when x is outside the range A-Z. See http://bugs.icu-project.org/trac/ticket/6068

Java allows quantifiers (*, +, etc) on zero length tests. ICU does not. Occurrences of these in patterns are most likely unintended user errors, but it is an incompatibility with Java. http://bugs.icu-project.org/trac/ticket/6080

ICU recognizes all Unicode properties known to ICU, which is all of them. Java is restricted to just a few.

ICU case insensitive matching works with all Unicode characters, and, within string literals, does full Unicode matching (where matching strings may be different lengths.) Java does ASCII only by default, with Unicode aware case folding available as an option.

ICU has an extended syntax for set [bracket] expressions, including additional operators. Added for improved compatibility with the original ICU implementation, which was based on ICU UnicodeSet pattern syntax.

That said, the biggest difference (other then those noted immediately above) between the wrapper around the Java regular expressions and the native ustrregex*() functions is that the compilation flags that define some of the behaviors (e.g., case sensitive/insensitive matching) are exposed for the Java wrapper and masked in the native functions. This doesn't mean that comparable behavior can't be achieved, assuming the dfifferences above are the sole differences. For example, in the Java documentation it is mentioned that case insensitive matching can be triggered using what they call an embedded flag expression - in this case (?i) - in the regular expression. I

With regards to splitting strings, the moss program allows a single subexpression to be used/matched, which can absolutely be helpful in many cases. I've not had the chance to work on this yet, but I've attached some code below to generate a relatively messy dataset. Assuming the rule is that the first valid phone number is the true phone number and a true phone number is defined as (###) ### - ####, trying to solve this problem currently requires resorting to a loop of some sort, generating temporary variables/mata objects to store intermediate results, using nested condition functions, etc... The way the I've been trying to design the interface for -jregex- is to allow users to perform operations like this in a single line of code that is capable of operating over one or more variables simultaneously (the development branch of the repository has some examples of this handling already).

Code:

clear set obs 1000 set seed 7779311 tempvar hasparen hasspace hashyphen area exchange extension areas exchanges extensions g `hasparen' = . g `hasspace' = . g `hashyphen' = . g `area' = . g `exchange' = . g `extension' = . g `areas' = "" g `exchanges' = "" g `extensions' = "" forv i = 1/3 { qui: replace `hasparen' = rbinomial(1, .5) qui: replace `hasspace' = rbinomial(1, .5) qui: replace `hashyphen' = rbinomial(1, .5) qui: replace `area' = int(runiform(1, 999)) qui: replace `exchange' = int(runiform(1, 999)) qui: replace `extension' = int(runiform(1, 9999)) qui: replace `areas' = cond(inrange(`area', 1, 9), " " + strofreal(`area'), /// cond(inrange(`area', 10, 99), strofreal(`area') + " ", /// strofreal(`area'))) qui: replace `areas' = "(" + `areas' + ")" if `hasparen' == 1 qui: replace `exchanges' = cond(inrange(`exchange', 1, 9), " " + strofreal(`exchange') + " ", /// cond(inrange(`exchange', 10, 99), " " + strofreal(`exchange'), /// strofreal(`exchange'))) qui: replace `extensions' = cond(inrange(`extension', 1, 9), " " + strofreal(`extension') + " ", /// cond(inrange(`extension', 10, 99), " " + strofreal(`extension'), /// cond(inrange(`extension', 100, 999), strofreal(`extension') + " ", /// strofreal(`extension')))) qui: g phone`i' = cond(`hasspace' == 1 & `hashyphen' == 1, /// `areas' + " " + `exchanges' + " - " + `extensions', /// cond(`hasspace' == 0 & `hashyphen' == 1, /// `areas' + `exchanges' + "-" + `extensions', /// cond(`hasspace' == 1 & `hashyphen' == 0, /// `areas' + " " + `exchanges' + " " + `extensions', /// `areas' + `exchanges' + `extensions'))) }

It is definitely an arbitrary example, but unfortunately this is still orders of magnitude cleaner than similar data stored in the data systems used by many school districts and state education agencies around the US. While I'd love to see validation rules added to the input side of these systems, in many cases they allow end users to push anything (especially in string fields) into the data system which in some cases will contain multiple columns across one or more tables that store the same/similar data. If your immediate thought is to handle it server side using regular expressions I completely agree...but SQL Server (which happens to be a favorite in the K-12 education sector in the US for some reason) doesn't implement regular expressions natively (at least in versions 2005 and 2008) so it becomes useful to both save keystrokes and implement the business logic consistently across multiple fields in a single pass. As I get around to adding additional functionality (which like the moss program is a wrapper of sorts) I'll make sure to be more explicit about differences between any functionality it provides and that of native Stata and user-written programs with which I am familiar.
1 like
Leave a comment:
wbuchanan replied

21 Feb 2016, 03:27
Hua Peng (StataCorp) thanks for the additional info. Like always, you've definitely shed needed light on a topic. Given what you mentioned I started digging into the regular expression implementation in Java a bit more this morning. You definitely bring up a valid point about the conversion of the underlying bytes to a string representation with an encoding. That being the case, I'll end up adding some functionality to the classes/methods I use for interfacing with the Java API to handle String encoding/decoding, which as you state above depends on there being no loss of information when the values are encoded in UTF-8/UTF-16.
Leave a comment:
Hua Peng (StataCorp) replied

20 Feb 2016, 17:31
Just to clarify as I saw there might be some confusion. Before Stata 14, Stata's only has one set of regular expression function, regexm(), regexr() and regexs(), which uses an implementation of Henry Spencer's NFA algorithm, which in turn is a nearly a subset POSIX standard. In Stata 14, we added 4 Unicode regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs(). These 4 new functions are using the ICU regular expression engine. If you are interested in the comparison of different regular expression engines, see the following wiki page: https://en.wikipedia.org/wiki/Compar...ession_engines

The obvious question is why maintaining two different set of functions, especially given that ICU engine is a far superior engine. The short answer is the two set of functions are essentially incompatible. And I suspect Bill Buchanan's jregex will suffer the same issue. Let me explain.

The regex*() set of functions treat a string as a byte stream and is encoding neutral, i.e., it does not assume any encoding of the string, it deals bytes. On the other hand, ICU based ustrreg*() functions assume the string is UTF-8 encoded. The is due to that ICU engine only works with UTF-16 encoded string, hence a conversion of the original string must be performed before passing it to ICU. As any conversion of string to a particular encoding, you have to assume a source encoding. Since Stata 14 uses UTF-8 encoding, the UTF-8 encoding is assumed. The side effects is that if teh assumption is wrong, for example, your source string is encoded in Latin-1, then the new ustrreg*() function will not work the conversion will lose information of the original string. Another situation is that your original "string" is really a byte stream and has no text meaning at all, for example, a byte stream of an image file, new ustrreg*() function will not work either since the conversion will almost surely destroy the string.

Hence the two set of functions basically deal with different cases, regex*() are for byte streams, ustrregex*() are for texts. In the case if your data is all ASCII (English a-z, A-Z, 0-9, and punctuation), both will work. But regex*() are faster, ustrreg*() supports standard regular expression syntax better.

Since Java uses UTF-16 as its internal String encoding as well, -jregex- probably will suffer the same issue if it uses Java String class (this is purely guess as I have not had time to play with jregex).
5 likes
Leave a comment:
William Lisowski replied

20 Feb 2016, 16:06
Robert, many thanks. Somewhere I gained a misunderstanding of the workings of Unicode and thought it required a minimum of 2 bytes per character.
Leave a comment:
Robert Picard replied

20 Feb 2016, 15:03
William, Unicode is a superset of plain ASCII so the new Unicode functions can be used without fear on plain ASCII strings in Stata 14. The old versions remain for backwards compatibility and will work as before on characters 128-255 and continue to match based on byte values. The new Unicode functions perform character-based matches (which involve multiple bytes in UTF-8 for any non-ASCII character, and each byte that forms the multi-byte character is > 127).
Leave a comment:
Robert Picard replied

20 Feb 2016, 15:01
wbuchanan, unfortunately, you relied on an old FAQ (circa 2005) written long before Stata 14. You made a mistake when you assumed that the new Unicode regular expression functions would provide no additional functionality.

As I said in my original post in #2:

Stata's regular expressions are notoriously under-documented.

I do not know of any reference that provides the details of Stata's new Unicode regular expression functions. Let me also make clear that I have no inside knowledge of the inner workings of Stata. And I am not employed or affiliated with StataCorp (nor have I ever been). I credit Dimitriy V. Masterov as the first who noticed the extra functionality on Stack Overflow in Sep. 2015.

You appear to have misread my comments in #2. I said

Stata uses the standard POSIX bracket expressions while Java uses the \p syntax.

I could not have said that the full POSIX standard is implemented because I don't know what is. All I did was to replicate the examples in your help file using Stata's regular expression functions. How did I manage to do that without documentation? I went to regular-Expressions.info and looked up the POSIX character classes and tried them out on your examples.

The bottom line is that you wrote a program to perform regular expressions in Java and I pointed out that your statement in #1

The biggest difference that users may notice between how this program and the native Stata functions work is the ability to use the POSIX character classes (e.g., \p{Alpha}, \p{Punct}, etc...), conditional/counting meta characters (e.g., {2, 3} match at least twice but not more than 3 times)

is incorrect.

Since Java does not support POSIX bracket expressions (but does support POSIX character classes using the \p operator), I think it would be more accurate to say that jregex allows the user to specify a regular expression pattern using Java syntax. If your program can do something that can't be done as easily in Stata, that's fine too, but it's up to you to provide examples.

With respect to your question on efficiency, since I believe that jregex is of limited utility, I have no interest in working for you to benchmark it. I've already spent more time than I care to trying to correct your mischaracterizations of Unicode and regular expression support in Stata in this thread and in response to Stas yesterday. I comment only because I strongly believe that the language and style you used could easily misdirect Statalist readers away from a perfectly simple and correct solution in Stata. The burden is on you to explain the plus value of a trip to Java.

Finally, let me reiterate that if you intend to "improve" your program by including the capability of splitting strings based on regular expressions and/or retrieval of groups, this is a functionality that has been available for years via moss (from SSC). I'm not saying that you can't go that way, just don't say that this would bring a new functionality to Stata users.
Leave a comment:
wbuchanan replied

20 Feb 2016, 14:52
You should still be able to use the Unicode regular expression functions with the ASCII character set without issues (other than using different function names. While I definitely like the new regular expression functions, breaking the regular expression API by defining new functions in a different name space vs using an option to specify ASCII vs Unicode makes it a bit more difficult to use the functionality within the context of existing code. If an additional parameter had been added to the regex API with a default value that retains the ASCII behavior it would be nicer from the perspective of maintaining code. That said, one of the deliberate decisions I made was to make the command jregex be the same for all cases. Different handling/mode options get toggled and have defaults set so users don't need to worry about setting the values explicitly. I already started implementing that operates over multiple variables simultaneously and will add an option to allow using the same interface to define sequential behavior (to reduce the need to call loops and/or concatenate strings manually). It's fine if this isn't something that others need/want to use, but it is an option that is available and when I get around to refactoring some of the code I use to access the data it will be accessible to users of Stata 13 as well.
Leave a comment:

Announcement