Extracting commas between specific characters

Nicolas Echeverry

Join Date: Feb 2019
Posts: 2

Extracting commas between specific characters

24 Feb 2019, 09:36

Goog Afternoon Statalist users,

I have a database that looks like the following:

Code:

Example generated by -dataex-. To install: ssc install dataex
clear
input str100 Politician
"Arthur Alexander (Wilson Center), Hugh Patrick (Wilson Center)"
"Milt Drucker (Director Summit of the Americas, U.S. Department of State)"
"John Sequeira (State Department, Office of Southern Africa Affairs), Ltc Greg Saunders (DOD, Office "
"Mr. Nakamura (president, Kawasaki Steel America)"
"Steven A. Thompson (Professional Staff Member), Roger Smith (House National Security Committee (HNSC"
"S. Chami of Kenya (Ambassador ), Mr. W. Imakando (Embassy of Zambia), Mr. S. Dlamini (Embassy of Swa"
"Randi Sutton,Texas Governor George Bush, Cheryl Parker Rose, Florida
Governor Lawton Chiles (D)."
"Matt Mcmanus (Acting Division Chief, Energy-Producer Country Affairs, U.S. Department of State)"
end

I need to separate the names on each string. My first thought was to separate them by commas, but as you can see there are some commas between parenthesis that would separate things different of names so I need to replace the commas that are between parenthesis to other character and then split the string.

Does anyone nows how to make this replace with the commas only if they are found inside parenthesis?

I've tried this but didn't work:

gen var2 = trim(regexr(Politician," \( (,)+\) *",""))

Thank you very much.

Last edited by Nicolas Echeverry; 24 Feb 2019, 10:14.

Tags: regex string extract, string

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

24 Feb 2019, 12:07

Here's a solution in the direction of what I think you want for an end result, but which does not use regular expressions. (Nothing wrong with regex technique, but often it's not needed.) Rather than use your strategy, my approach is to repeatedly find and remove parenthesized material in the given string until there is no such material left in any observation. This code will:
1) Create a string containing the given string, but without the material between parens. That material is assumed to be a comma separated list of names, and so is -split- into a list of names.
2) Save each chunk of parenthesized material in separate variables for whatever further use.
I don't know if the following is entirely robust with respect to unmatched parentheses.

Code:

local done = 0 gen int pos1 = . gen int pos2 = . local i 1 gen s = Politician // Succesively find material between parens, save it, and remove it from consideration while (`done' == 0) { replace pos1 = strpos(s, "(") replace pos2 = strpos(s, ")" ) gen paren`i' = substr(s, pos1, pos2-pos1 + 1) replace s = subinstr(s, paren`i', "", 1) if pos1 > 0 cap assert (pos2 == 0 ) | (pos1 == 0) local done = (_rc == 0) local ++i } // Save and remove any trailing material, as in observation 3 gen paren`i' = substr(s, pos1, .) if (pos1 > 0) // What if an unpaired ")" ? gen byte unpaired = (pos2 > 0) replace s = subinstr(s, paren`i', "", .) split s, parse(",") gen(name) list name* list paren*
Comment
Nicolas Echeverry

Join Date: Feb 2019

Posts: 2
#3

24 Feb 2019, 16:05

Dear Mark, thank you for your reply. This code is working perfectly. Thank you very much.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#4

24 Feb 2019, 18:41

Since Stata 14, Stata has a set of ICU based regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(). ICU regular expression supports negative look-ahead assertion (?! ... ), which evaluates to true if the parenthesized pattern does not match at the current input position without advancing the input position. You may use:

Code:

gen var2 = ustrregexra(Politician, ",(?![^\(\)]*\))", "|")

to replace all commas outside parentheses to "|" (suppose "|" does not appear in your data).

Then you may split var2 according to "|". Something like

Code:

gen name1 = substr(var2, 1, strpos(var2, "|")-1) gen name2 = substr(var2, strpos(var2, "|")+1, .)
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#5

25 Feb 2019, 08:01

In addition, for your example data the following should work:

Code:

gen var3a = subinstr(Politician, "),",")|",.) split var3a , parse("|")

or using a regex:

Code:

gen var3b = ustrregexra(Politician, "\)\s?," ,")|") split var3b , parse("|")
Comment

Announcement

Extracting commas between specific characters

Comment

Comment

Comment

Comment