No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with extracting strings using regular expressions

    Hi Statalist.

    I am trying to use regexs and regexm to extra information from strings. A sample is below. The strings contain location on town names, which always comes first, and firm names, which always comes second. The town and firm names are usually separated by a comma, but could be separated by any punctuation character. Infrequently punctuation appears before the town name. Only part of the town name is extracted by the command that I wrote:

    gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")

    Please let me know if you have an idea about how I could extract the entire town name.

    Sample data, code, and output appear below.



    input str60 LocationFirm
    gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")
    list city

    . clear

    . do "C:\Users\garyr\AppData\Local\Temp\STD462c_000000. tmp"

    . input str60 LocationFirm

    1. "Albertville,First*................"
    2. "Albertville,Albertville-"
    3. "Anniston.Anniston—--"
    4. "Anniston;Commercial-."
    5. "Anniston^Blender-."
    6. "Decatur,MorganCounty"
    7. "..Decatur,Jupiter"
    8. end

    . gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([:punct:])")

    . list city

    | city |
    1. | Alber |
    2. | Alber |
    3. | Annisto |
    4. | Annisto |
    5. | Annisto |
    6. | Decat |
    7. | Decat |

    end of do-file

  • #2
    I am running Stata 16 and was able to reproduce your problem - even typing in my own data in case there were invisible characters in your example data. The problem seems to involve your use of [:punct:] to match punctuation marks.
    . gen city = regexs(1) if regexm(LocationFirm, "([a-zA-Z]+)([^a-zA-Z])")
    . list city
         |        city |
      1. | Albertville |
      2. | Albertville |
      3. |    Anniston |
      4. |    Anniston |
      5. |    Anniston |
      6. |     Decatur |
      7. |     Decatur |
    You've been caught by the lack of documentation of the regular expression syntax supported by these functions.

    Stata 14 and later are Unicode aware, and there are a parallel set of regular expression functions that are Unicode aware (as there is for most string functions). (Note that ASCII is a proper subset of Unicode and thus these functions perform as expected with ASCII strings.) Replacing your functions with those solves the problem, suggesting that the older functions do not in fact support the POSIX character class syntax.
    . gen city = ustrregexs(1) if ustrregexm(LocationFirm, "([a-zA-Z]+)([:punct:])")
    . list 
         |                       LocationFirm          city |
      1. | Albertville,First*................   Albertville |
      2. |           Albertville,Albertville-   Albertville |
      3. |               Anniston.Anniston—--      Anniston |
      4. |              Anniston;Commercial-.      Anniston |
      5. |                 Anniston^Blender-.       Blender |
      6. |               Decatur,MorganCounty       Decatur |
      7. |                  ..Decatur,Jupiter       Decatur |
    To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at For those of us who are facile with regular expressions, the unicode-aware functions represent a great improvement.

    A parting thought: using ustrregexm, you may want something more like the following.
    . gen city = ustrregexs(1) if ustrregexm(LocationFirm, "([[:alpha:][:blank:]]+)([:punct:])")
    . list, clean noobs
                              LocationFirm          city  
                          Las Vegas,Nevada     Las Vegas  
        Albertville,First*................   Albertville  
                  Albertville,Albertville-   Albertville  
                      Anniston.Anniston—--      Anniston  
                     Anniston;Commercial-.      Anniston  
                        Anniston^Blender-.       Blender  
                      Decatur,MorganCounty       Decatur  
                         ..Decatur,Jupiter       Decatur


    • #3 list what is supported by Stata’s regular expression parser (regex* functions), and POSIX Character classes is not.

      ICU Regular Expressions (ustrregex* functions) support matching Unicode properties. When using Unicode regex it is better to use Unicode properties, than "POSIX Character classes". The latter was defined before Unicode; do not cover all characters and different implementations exist.
      * \p{L} or \p{Letter}: any kind of letter from any language.
      * \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
      gen city = ustrregexs(0) if ustrregexm( LocationFirm , "[\p{L}\p{Z}]+" )


      • #4
        Hi William. Thanks for the advice and code. Your solution works great.


        • #5
          Hi Bjarte. Thanks for the solution and explanation. Gary