New program for regular expressions

Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 348
#16

20 Feb 2016, 17:31

Just to clarify as I saw there might be some confusion. Before Stata 14, Stata's only has one set of regular expression function, regexm(), regexr() and regexs(), which uses an implementation of Henry Spencer's NFA algorithm, which in turn is a nearly a subset POSIX standard. In Stata 14, we added 4 Unicode regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs(). These 4 new functions are using the ICU regular expression engine. If you are interested in the comparison of different regular expression engines, see the following wiki page: https://en.wikipedia.org/wiki/Compar...ession_engines

The obvious question is why maintaining two different set of functions, especially given that ICU engine is a far superior engine. The short answer is the two set of functions are essentially incompatible. And I suspect Bill Buchanan's jregex will suffer the same issue. Let me explain.

The regex*() set of functions treat a string as a byte stream and is encoding neutral, i.e., it does not assume any encoding of the string, it deals bytes. On the other hand, ICU based ustrreg*() functions assume the string is UTF-8 encoded. The is due to that ICU engine only works with UTF-16 encoded string, hence a conversion of the original string must be performed before passing it to ICU. As any conversion of string to a particular encoding, you have to assume a source encoding. Since Stata 14 uses UTF-8 encoding, the UTF-8 encoding is assumed. The side effects is that if teh assumption is wrong, for example, your source string is encoded in Latin-1, then the new ustrreg*() function will not work the conversion will lose information of the original string. Another situation is that your original "string" is really a byte stream and has no text meaning at all, for example, a byte stream of an image file, new ustrreg*() function will not work either since the conversion will almost surely destroy the string.

Hence the two set of functions basically deal with different cases, regex*() are for byte streams, ustrregex*() are for texts. In the case if your data is all ASCII (English a-z, A-Z, 0-9, and punctuation), both will work. But regex*() are faster, ustrreg*() supports standard regular expression syntax better.

Since Java uses UTF-16 as its internal String encoding as well, -jregex- probably will suffer the same issue if it uses Java String class (this is purely guess as I have not had time to play with jregex).
5 likes
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#17

21 Feb 2016, 03:27

Hua Peng (StataCorp) thanks for the additional info. Like always, you've definitely shed needed light on a topic. Given what you mentioned I started digging into the regular expression implementation in Java a bit more this morning. You definitely bring up a valid point about the conversion of the underlying bytes to a string representation with an encoding. That being the case, I'll end up adding some functionality to the classes/methods I use for interfacing with the Java API to handle String encoding/decoding, which as you state above depends on there being no loss of information when the values are encoded in UTF-8/UTF-16.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#18

21 Feb 2016, 04:27

Robert Picard as you quoted, I never suggested that the POSIX square bracket expressions for character classes were implemented, but just that the set of POSIX character classes are available; in either case, your suggestion on making it more explicit that Java uses \p{} to denote character classes instead of [::] is noted and valid (and I'll make changes accordingly to ensure this is clearer in the future). That said, there are flags that users can set that make the behavior in -jregex- a bit more transparent. For example, the "^" and "$" meta characters can have different meanings depending on the underlying engine's treatment of line terminating characters. In Stas's case - in particular - this could have meaningful consequences (e.g., would the "$" character stop matching at the end of the input stream or would it stop matching the first time it encounters a line terminating character?) and the references to the Javadocs for the Pattern class make this fairly clear:

"By default, the regular expressions ^ and $ ignore line terminators and only match at the beginning and the end, respectively, of the entire input sequence. If MULTILINE mode is activated then ^matches at the beginning of input and after any line terminator except at the end of input. When in MULTILINE mode $ matches just before a line terminator or the end of the input sequence."

Similarly, there are also implications for other metacharacters such as the "." character which some users may or may not expect to match any character (regardless of class), but have an option to limit this behavior to only characters that do not terminate a line:

"The regular expression . matches any character except a line terminator unless the DOTALL flag is specified."

The other implication in Stas's case would be the types of characters that are recognized as line terminating:

"If UNIX_LINES mode is activated, then the only line terminators recognized are newline characters."

There is a significant amount of overlap between the Java regular expression implementation and that made available by the ustrregex*() functions as noted on the ICU regular expression website:
ICU does not support UREGEX_CANON_EQ. See http://bugs.icu-project.org/trac/ticket/9111

The behavior of \cx (Control-X) different from Java when x is outside the range A-Z. See http://bugs.icu-project.org/trac/ticket/6068

Java allows quantifiers (*, +, etc) on zero length tests. ICU does not. Occurrences of these in patterns are most likely unintended user errors, but it is an incompatibility with Java. http://bugs.icu-project.org/trac/ticket/6080

ICU recognizes all Unicode properties known to ICU, which is all of them. Java is restricted to just a few.

ICU case insensitive matching works with all Unicode characters, and, within string literals, does full Unicode matching (where matching strings may be different lengths.) Java does ASCII only by default, with Unicode aware case folding available as an option.

ICU has an extended syntax for set [bracket] expressions, including additional operators. Added for improved compatibility with the original ICU implementation, which was based on ICU UnicodeSet pattern syntax.

That said, the biggest difference (other then those noted immediately above) between the wrapper around the Java regular expressions and the native ustrregex*() functions is that the compilation flags that define some of the behaviors (e.g., case sensitive/insensitive matching) are exposed for the Java wrapper and masked in the native functions. This doesn't mean that comparable behavior can't be achieved, assuming the dfifferences above are the sole differences. For example, in the Java documentation it is mentioned that case insensitive matching can be triggered using what they call an embedded flag expression - in this case (?i) - in the regular expression. I

With regards to splitting strings, the moss program allows a single subexpression to be used/matched, which can absolutely be helpful in many cases. I've not had the chance to work on this yet, but I've attached some code below to generate a relatively messy dataset. Assuming the rule is that the first valid phone number is the true phone number and a true phone number is defined as (###) ### - ####, trying to solve this problem currently requires resorting to a loop of some sort, generating temporary variables/mata objects to store intermediate results, using nested condition functions, etc... The way the I've been trying to design the interface for -jregex- is to allow users to perform operations like this in a single line of code that is capable of operating over one or more variables simultaneously (the development branch of the repository has some examples of this handling already).

Code:

clear set obs 1000 set seed 7779311 tempvar hasparen hasspace hashyphen area exchange extension areas exchanges extensions g `hasparen' = . g `hasspace' = . g `hashyphen' = . g `area' = . g `exchange' = . g `extension' = . g `areas' = "" g `exchanges' = "" g `extensions' = "" forv i = 1/3 { qui: replace `hasparen' = rbinomial(1, .5) qui: replace `hasspace' = rbinomial(1, .5) qui: replace `hashyphen' = rbinomial(1, .5) qui: replace `area' = int(runiform(1, 999)) qui: replace `exchange' = int(runiform(1, 999)) qui: replace `extension' = int(runiform(1, 9999)) qui: replace `areas' = cond(inrange(`area', 1, 9), " " + strofreal(`area'), /// cond(inrange(`area', 10, 99), strofreal(`area') + " ", /// strofreal(`area'))) qui: replace `areas' = "(" + `areas' + ")" if `hasparen' == 1 qui: replace `exchanges' = cond(inrange(`exchange', 1, 9), " " + strofreal(`exchange') + " ", /// cond(inrange(`exchange', 10, 99), " " + strofreal(`exchange'), /// strofreal(`exchange'))) qui: replace `extensions' = cond(inrange(`extension', 1, 9), " " + strofreal(`extension') + " ", /// cond(inrange(`extension', 10, 99), " " + strofreal(`extension'), /// cond(inrange(`extension', 100, 999), strofreal(`extension') + " ", /// strofreal(`extension')))) qui: g phone`i' = cond(`hasspace' == 1 & `hashyphen' == 1, /// `areas' + " " + `exchanges' + " - " + `extensions', /// cond(`hasspace' == 0 & `hashyphen' == 1, /// `areas' + `exchanges' + "-" + `extensions', /// cond(`hasspace' == 1 & `hashyphen' == 0, /// `areas' + " " + `exchanges' + " " + `extensions', /// `areas' + `exchanges' + `extensions'))) }

It is definitely an arbitrary example, but unfortunately this is still orders of magnitude cleaner than similar data stored in the data systems used by many school districts and state education agencies around the US. While I'd love to see validation rules added to the input side of these systems, in many cases they allow end users to push anything (especially in string fields) into the data system which in some cases will contain multiple columns across one or more tables that store the same/similar data. If your immediate thought is to handle it server side using regular expressions I completely agree...but SQL Server (which happens to be a favorite in the K-12 education sector in the US for some reason) doesn't implement regular expressions natively (at least in versions 2005 and 2008) so it becomes useful to both save keystrokes and implement the business logic consistently across multiple fields in a single pass. As I get around to adding additional functionality (which like the moss program is a wrapper of sorts) I'll make sure to be more explicit about differences between any functionality it provides and that of native Stata and user-written programs with which I am familiar.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30165
#19

21 Feb 2016, 11:13

I have nothing to contribute to the dialog about the various approaches to handling this problem. All I have to add is that, from this user's perspective, it is a very pressing problem indeed. As an epidemiologist, I often work with data from clinical records. Before electronic health records (EHRs) became widespread, this data was typically abstracted from handwritten records. While that was labor-intensive for the research assistants, we would ordinarily have them input the data into a database that contained extensive front-end validation of entries, so by the time the data got to me it was usually in reasonable shape and needed only reasonable amounts of effort to clean up.

Today, much clinical data is taken from the EHR databases. The condition of this data is deplorable, almost beyond description. While it is understandable that free-form text input is needed for the description of many things in an EHR, you might expect that standard things like drug names, routine laboratory tests. and units of measurement are handled with drop-down menus or subject to front-end validation. You would be wrong. As a result, we now deal with data sets that are much larger than we could get before, but disproportionately more labor-intensive to clean. The kind of data that Bill Buchanan's code in #18 generates is quite typical of what we now routinely get from EHRs. Having tools that can clean this up more efficiently than Stata's current string-handling functions would be a blessing indeed!
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#20

21 Feb 2016, 13:00

Clyde Schechter at least I don't feel quite as bad about the quality of education data. I've come across cases in the data system where I work where a phone number would be something along the lines of "(Call this person if child is sick and needs to be picked up from school)". Addresses and names are even worse (unfortunately). There are a few different natural language processing and search algorithms (Lucene is a good example of the later) that are designed to make string processing and searching easier. Not sure if you use strLs regularly, but they'd be particularly well suited for operations on those types of data (we end up having a ton of data like that when there a open text fields used to describe behavioral incidents).
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#21

21 Feb 2016, 14:04

I definitively sympathize with the call for additional validation at the data input stage.

Unless I'm missing something, the phone number problem proposed in #18 is not a hard one and can be easily disposed of with a single line of code per variable.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str16(phone1 phone2 phone3)
"(706)332-9739"  "735 578 - 674 " "765 60-6789"    
"605 801 8928"   "(227) 385 3769" "(944)5239383"    
"(647) 425120"   "(98 ) 246674"   "(522)829-8615"  
"12  841 - 4162" "(656) 919 8420" "(803) 436 - 483 "
"(795)450-6874"  "1108218003"     "1308484948"      
end

gen phone1_clean = ustrregexra(phone1,"[^0-9]","")
gen phone2_clean = ustrregexra(phone2,"[^0-9]","")
gen phone3_clean = ustrregexra(phone3,"[^0-9]","")

Code:

. list

     +-------------------------------------------------------------------------------------------+
     |         phone1           phone2             phone3   phone1_c~n   phone2_c~n   phone3_c~n |
     |-------------------------------------------------------------------------------------------|
  1. |  (706)332-9739   735 578 - 674         765 60-6789   7063329739    735578674    765606789 |
  2. |   605 801 8928   (227) 385 3769       (944)5239383   6058018928   2273853769   9445239383 |
  3. |   (647) 425120     (98 ) 246674      (522)829-8615    647425120     98246674   5228298615 |
  4. | 12  841 - 4162   (656) 919 8420   (803) 436 - 483     128414162   6569198420    803436483 |
  5. |  (795)450-6874       1108218003         1308484948   7954506874   1108218003   1308484948 |
     +-------------------------------------------------------------------------------------------+

You can then strip the leading one and even match the data to known area codes if desired. The length of the phone number will also identify invalid numbers. All pretty easy to do in Stata. Of course no program will be able to correct phone numbers that have missing digits.

Last edited by Robert Picard; 21 Feb 2016, 14:37.

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#22

22 Feb 2016, 02:35

Robert Picard your regular expression doesn't match the business rules described in the previous example. A valid phone number must have the form (###) ### - #### and you must select the first occurrence of a valid phone number among the three variables. The simplest one line solution I can think of (using the simulated data from the example above) is:

Code:

g clnphone = ustrregexs(1) if ustrregexm(phone1 + phone2 + phone3, "($[0-9]{3}$ [0-9]{3} - [0-9]{4})")

If you aren't luck enough to have Stata 14, the implementation starts to become fairly messy:

Code:

g clnphone2 = regexs(1) if regexm(phone1 + phone2 + phone3, "($[0-9][0-9][0-9]$ [0-9][0-9][0-9] - [0-9][0-9][0-9][0-9])")

So, having a single API that provides consistent access to a regular expression engine that can be made available to users of Stata 13 still provides value to members of the Stata community. Since a fair number of my colleagues have access to Stata 13, but not Stata 14 it is still a contribution that others find helpful and useful.

Last edited by wbuchanan; 22 Feb 2016, 02:49.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#23

22 Feb 2016, 09:37

Sorry, I completely misunderstood the exercise and the tight bounds for the solution. Here's how I would deal with it. I don't understand why you want to reject numbers because of missing/extra spaces so I made the following solution more flexible. I target each part of the phone number and then standardize to the desired format. Also, since you seem to prefer fewer keystrokes, I used a local macro to reduce the "mess" a bit.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str17 phone1 str14 phone2 str16 phone3
"(706)332-9739"     "735 578 - 674 " "765 60-6789"     
"605 801 8928"      "(227) 385 3769" "(944)5239383"    
"(647) 425120"      "(98 ) 246674"   "(522)829-8615"   
"12  841 - 4162"    "(656) 919 8420" "(803) 436 - 483 "
"(795)450-6874"     "1108218003"     "1308484948"      
"(795) 450 - 68745" "1108218003"     "1308484948"      
"1108218003"        "(795) 450 - "   "1308484948"      
end

g clnphone = regexs(1) if regexm(phone1 + phone2 + phone3, "(\([0-9][0-9][0-9]\) [0-9][0-9][0-9] - [0-9][0-9][0-9][0-9])")

local d3 "[0-9][0-9][0-9]"
gen firstgood = ""
foreach v of varlist phone1 phone2 phone3 {
    replace firstgood = regexs(1) + " " + regexs(2) + " - " + regexs(3) ///
        if regexm(`v', "^(\(`d3'\)) *(`d3') *- *(`d3'[0-9])$") & mi(firstgood)
}
list

Code:

. list

     +---------------------------------------------------------------------------------------------+
     |            phone1           phone2             phone3           clnphone          firstgood |
     |---------------------------------------------------------------------------------------------|
  1. |     (706)332-9739   735 578 - 674         765 60-6789                      (706) 332 - 9739 |
  2. |      605 801 8928   (227) 385 3769       (944)5239383                                       |
  3. |      (647) 425120     (98 ) 246674      (522)829-8615                      (522) 829 - 8615 |
  4. |    12  841 - 4162   (656) 919 8420   (803) 436 - 483                                        |
  5. |     (795)450-6874       1108218003         1308484948                      (795) 450 - 6874 |
     |---------------------------------------------------------------------------------------------|
  6. | (795) 450 - 68745       1108218003         1308484948   (795) 450 - 6874                    |
  7. |        1108218003     (795) 450 -          1308484948   (795) 450 - 1308                    |
     +---------------------------------------------------------------------------------------------+

As you can see, you need to refine your solution as it can find matches when there are extra digits (obs 6) and matches can extend into the next phone number (jobs 7). You don't show how jregex would handle this problem so I can't comment on the benefit of a "single API that provides consistent access to a regular expression engine" versus a standard Stata solution.

Yes, bringing the equivalent of the more powerful Stata 14 regular expression functions to pre-14 users is certainly a useful endeavor. At this point I'm bailing out of this thread as I have better things to do.

Comment

Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#24

24 Feb 2016, 11:00

Originally posted by wbuchanan View Post

A valid phone number must have the form (###) ### - #### and you must select the first occurrence of a valid phone number among the three variables.

This may be off-topic but I would be grateful if you could clarify something. Are you saying that a valid US phone number must be written as (###) ### - #### or is this a purely hypothetical example? According to ITU recommendation E.123 ("Notation for national and international telephone numbers, e-mail addresses and web addresses") hyphens should not be part of a telephone number.

Grouping of digits in a telephone number should be accomplished by means of spaces unless an agreed upon explicit symbol (e.g. hyphen) is necessary for procedural purposes.

Even if we ignore that recommendation a number written as ### ### ####, for example, would certainly be valid.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#25

24 Feb 2016, 11:58

Friedrich Huebler It is a hypothetical example, but not too dissimilar from other business rules that I've had to implement in the past. If you're pulling data from a large data system then values not conforming to some predefined pattern could be indicative of bugs that caused the pattern to not be enforced. A phone number was also the quickest thing I could throw together into an example. The standard you referenced also indicates that hyphens are acceptable:

9.1 Grouping of digits in a telephone number5 should be accomplished by means or spaces6 unless an agreed upon explicit symbol (e.g. hyphen) is necessary for procedural purposes. Only spaces should be used in an international number.

My statement was less about any data standard and more about the enforcement of a business rule (e.g., operational definition) that could be simplified by operating over multiple variables simultaneously (and without having to explicitly run things in a loop or concatenate the data prior to calling the function/command).
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment