Help with extracting strings using regular expressions

Gary Richardson

Join Date: Jul 2019

Posts: 10
#1

Help with extracting strings using regular expressions

14 Feb 2020, 23:15

Hi Statalist.

I am trying to use regexs and regexm to extra information from strings. A sample is below. The strings contain location on town names, which always comes first, and firm names, which always comes second. The town and firm names are usually separated by a comma, but could be separated by any punctuation character. Infrequently punctuation appears before the town name. Only part of the town name is extracted by the command that I wrote:

gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")

Please let me know if you have an idea about how I could extract the entire town name.

Sample data, code, and output appear below.

Thanks

Gary

input str60 LocationFirm
"Albertville,First*................"
"Albertville,Albertville-"
"Anniston.Anniston—--"
"Anniston;Commercial-."
"Anniston^Blender-."
"Decatur,MorganCounty"
"..Decatur,Jupiter"
end
gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")
list city

. clear

. do "C:\Users\garyr\AppData\Local\Temp\STD462c_000000. tmp"

. input str60 LocationFirm

LocationFirm
1. "Albertville,First*................"
2. "Albertville,Albertville-"
3. "Anniston.Anniston—--"
4. "Anniston;Commercial-."
5. "Anniston^Blender-."
6. "Decatur,MorganCounty"
7. "..Decatur,Jupiter"
8. end

. gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")

. list city

+---------+
| city |
|---------|
1. | Alber |
2. | Alber |
3. | Annisto |
4. | Annisto |
5. | Annisto |
|---------|
6. | Decat |
7. | Decat |
+---------+

.
end of do-file
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

15 Feb 2020, 07:58

I am running Stata 16 and was able to reproduce your problem - even typing in my own data in case there were invisible characters in your example data. The problem seems to involve your use of [:punct:] to match punctuation marks.

Code:

. gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([^a-zA-Z])")

. list city

     +-------------+
     |        city |
     |-------------|
  1. | Albertville |
  2. | Albertville |
  3. |    Anniston |
  4. |    Anniston |
  5. |    Anniston |
     |-------------|
  6. |     Decatur |
  7. |     Decatur |
     +-------------+

You've been caught by the lack of documentation of the regular expression syntax supported by these functions.

Stata 14 and later are Unicode aware, and there are a parallel set of regular expression functions that are Unicode aware (as there is for most string functions). (Note that ASCII is a proper subset of Unicode and thus these functions perform as expected with ASCII strings.) Replacing your functions with those solves the problem, suggesting that the older functions do not in fact support the POSIX character class syntax.

Code:

. gen city = ustrregexs(1) if ustrregexm(LocationFirm, "([a-zA-Z]+)([:punct:])")

. list 

     +--------------------------------------------------+
     |                       LocationFirm          city |
     |--------------------------------------------------|
  1. | Albertville,First*................   Albertville |
  2. |           Albertville,Albertville-   Albertville |
  3. |               Anniston.Anniston—--      Anniston |
  4. |              Anniston;Commercial-.      Anniston |
  5. |                 Anniston^Blender-.       Blender |
     |--------------------------------------------------|
  6. |               Decatur,MorganCounty       Decatur |
  7. |                  ..Decatur,Jupiter       Decatur |
     +--------------------------------------------------+

To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. For those of us who are facile with regular expressions, the unicode-aware functions represent a great improvement.

A parting thought: using ustrregexm, you may want something more like the following.

Code:

. gen city = ustrregexs(1) if ustrregexm(LocationFirm, "([[:alpha:][:blank:]]+)([:punct:])")

. list, clean noobs

                          LocationFirm          city  
                      Las Vegas,Nevada     Las Vegas  
    Albertville,First*................   Albertville  
              Albertville,Albertville-   Albertville  
                  Anniston.Anniston—--      Anniston  
                 Anniston;Commercial-.      Anniston  
                    Anniston^Blender-.       Blender  
                  Decatur,MorganCounty       Decatur  
                     ..Decatur,Jupiter       Decatur

Comment

Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#3

15 Feb 2020, 10:21

https://www.stata.com/support/faqs/d...r-expressions/ list what is supported by Stata’s regular expression parser (regex* functions), and POSIX Character classes is not.

ICU Regular Expressions (ustrregex* functions) support matching Unicode properties. When using Unicode regex it is better to use Unicode properties, than "POSIX Character classes". The latter was defined before Unicode; do not cover all characters and different implementations exist.

Code:

* https://www.regular-expressions.info/unicode.html * \p{UNICODE PROPERTY NAME} * \p{L} or \p{Letter}: any kind of letter from any language. * \p{Z} or \p{Separator}: any kind of whitespace or invisible separator. gen city = ustrregexs(0) if ustrregexm( LocationFirm , "[\p{L}\p{Z}]+" )
2 likes
Comment
Gary Richardson

Join Date: Jul 2019

Posts: 10
#4

15 Feb 2020, 10:24

Hi William. Thanks for the advice and code. Your solution works great.
Comment
Gary Richardson

Join Date: Jul 2019

Posts: 10
#5

15 Feb 2020, 10:25

Hi Bjarte. Thanks for the solution and explanation. Gary
Comment

Announcement