Extract substring between nth and (n+1)th commas in a variable

Ryan Zalla

Join Date: Feb 2019

Posts: 9
#1

Extract substring between nth and (n+1)th commas in a variable

30 Mar 2021, 19:20

How can I extract a substring between the nth and (n+1)th commas in a variable?

For example, consider ID = 3 and beta = "eight,nine,ten,eleven,twelve". How could I extract the substring between the 3rd and 4th commas? (Answer: "eleven")

Code:

clear input ID strL beta 1 "one,two,three,four" 2 "five,six,seven" 3 "eight,nine,ten,eleven,twelve" end

Please note this is a vastly simplified example of an 80,000+ observation dataset where I have as many as 1,000 commas in an observation of the variable beta. I am using Stata 16.1 on Windows 10.

Many thanks!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

30 Mar 2021, 19:58

Added in edit: I misread, the code below takes the token before comma number ID rather than after comma number ID. Remove the "-1" in each of the two places it occurs.

Code:

. generate wanted = ustrregexrf(beta, "([^,]*,){"+string(ID-1)+"}([^,]*).*", "$2") /// > if ustrregexm(beta, "([^,]*,){"+string(ID-1)+"}([^,]*).*") (1 missing value generated) . list +---------------------------------------------+ | ID beta wanted | |---------------------------------------------| 1. | 1 one,two,three,four one | 2. | 2 five,six,seven six | 3. | 3 eight,nine,ten,eleven,twelve ten | 4. | 666 one | +---------------------------------------------+

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

The functions for replacement support "capture group" references in the substitution string. Capture groups are surrounded with parentheses in the regular expression being matched and capture groups are referenced as $1, $2, ... .

Last edited by William Lisowski; 30 Mar 2021, 20:06.
1 like
Comment
Ryan Zalla

Join Date: Feb 2019

Posts: 9
#3

30 Mar 2021, 21:20

William, thank you very much! I would have had a very tough time coming up with this solution myself. I appreciate your time and attention! Thanks again.
Comment

Announcement

Extract substring between nth and (n+1)th commas in a variable

Comment

Comment