Regular expression in Stata

Chen Samulsion

Join Date: Jan 2018

Posts: 926
#1

Regular expression in Stata

18 Sep 2021, 22:13

Dear Stata users,

Regular expression is used frequently nowadays. Stata's help file on regular expression, that is -regexm-, -regexr-, and -regexs- is too simple. Suppose there is a string variable which sotre letters & numbers, and some of letters is in lowercase. Now I want to uppercase them. Surely I can use string function -strupper()- to achieve that. And what's the regular expression equivalent to it?

Code:

. input str10 address address 1. abcd1010 2. BDTY0204 3. TWcb0203 4. jbtw0987 5. jbFL0105 6. xtbJ0108 7. end . generate address_upper=strupper(address) . list +---------------------+ | address addres~r | |---------------------| 1. | abcd1010 ABCD1010 | 2. | BDTY0204 BDTY0204 | 3. | TWcb0203 TWCB0203 | 4. | jbtw0987 JBTW0987 | 5. | jbFL0105 JBFL0105 | |---------------------| 6. | xtbJ0108 XTBJ0108 | +---------------------+
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

18 Sep 2021, 23:21

I would say that -strlower()- and it's related commands (note also the Unicode versions) are most suitable in your case since you want to change the case of the entire string. To your direct question, your data look more like license plate numbers than addresses, which tend to have very fixed patterns. A regex approach would have to match a pattern, extract substrings, change the case of the substring, recombine the substrings.

Generally, regular expressions are primarily intended for matching patterns, not string manipulation. Add to this the complication that not all implementations of regular expression engines allow for character case conversion. The regex engines that Stata uses do not appear to be able to do this.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

19 Sep 2021, 08:30

Chen Samulsion -

Since you mention regexm, regexr, and regexs, I will mention that the Unicode regular expression functions introduced in Stata 14 – for example, ustregexra – have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at https://unicode-org.github.io/icu/us...gs/regexp.html. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

With that said, an examination of the ICU regular expression engine does not reveal a way to transform lower case to upper case in a replacement string. Some web searching finds that there are regular expression parsers that contain metacharacters that will do this, but the ICU engine is not one of them.
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 926
#4

23 Sep 2021, 21:19

Thank you so much both of you Leonardo Guizzetti William Lisowski
Comment

Announcement

Regular expression in Stata

Comment

Comment

Comment