Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regular expression in Stata

    Dear Stata users,

    Regular expression is used frequently nowadays. Stata's help file on regular expression, that is -regexm-, -regexr-, and -regexs- is too simple. Suppose there is a string variable which sotre letters & numbers, and some of letters is in lowercase. Now I want to uppercase them. Surely I can use string function -strupper()- to achieve that. And what's the regular expression equivalent to it?

    Code:
    . input str10 address
    
            address
      1. abcd1010
      2. BDTY0204
      3. TWcb0203
      4. jbtw0987
      5. jbFL0105
      6. xtbJ0108
      7. end
    
    . generate address_upper=strupper(address)
    
    . list
    
         +---------------------+
         |  address   addres~r |
         |---------------------|
      1. | abcd1010   ABCD1010 |
      2. | BDTY0204   BDTY0204 |
      3. | TWcb0203   TWCB0203 |
      4. | jbtw0987   JBTW0987 |
      5. | jbFL0105   JBFL0105 |
         |---------------------|
      6. | xtbJ0108   XTBJ0108 |
         +---------------------+

  • #2
    I would say that -strlower()- and it's related commands (note also the Unicode versions) are most suitable in your case since you want to change the case of the entire string. To your direct question, your data look more like license plate numbers than addresses, which tend to have very fixed patterns. A regex approach would have to match a pattern, extract substrings, change the case of the substring, recombine the substrings.

    Generally, regular expressions are primarily intended for matching patterns, not string manipulation. Add to this the complication that not all implementations of regular expression engines allow for character case conversion. The regex engines that Stata uses do not appear to be able to do this.

    Comment


    • #3
      Chen Samulsion -

      Since you mention regexm, regexr, and regexs, I will mention that the Unicode regular expression functions introduced in Stata 14 – for example, ustregexra – have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at https://unicode-org.github.io/icu/us...gs/regexp.html. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

      With that said, an examination of the ICU regular expression engine does not reveal a way to transform lower case to upper case in a replacement string. Some web searching finds that there are regular expression parsers that contain metacharacters that will do this, but the ICU engine is not one of them.

      Comment


      • #4
        Thank you so much both of you Leonardo Guizzetti William Lisowski

        Comment

        Working...
        X