How to delete everything in different parentheses for a string variable

Ann Ng

Join Date: Apr 2017

Posts: 25
#1

How to delete everything in different parentheses for a string variable

10 Jun 2017, 21:51

Dear Statalists,
I want to delete everything in different parentheses for a string variable. I found this topic :http://www.stata.com/statalist/archi.../msg00312.html
However, the code there will delete all information after the first bracket is found. Meanwhile, my string variable (Info) is like this:
ID Info

1 carrot 4% (eating and growing), banana 6%, apple (eating, growing, etc.) 10%, orange 5%

2 banana 5%

3 pineapple 10% (not including xxx), others 5%

I want to remove all additional information in the parentheses for Info variable before splitting that variable by comma ","
Could anyone help me with this?
Thank you very much!

Last edited by Ann Ng; 10 Jun 2017, 21:54.
Tags: None

Chris Larkin

Join Date: Apr 2016
Posts: 296

10 Jun 2017, 22:33

Hi Ann,

The code below works. One quick thing: when sharing data on Statalist, it'd be great if you could use the command dataex (which you can download by running the command ssc install dataex). It allows people to replicate your problem in Stata itself, and figure out code that helps you

Code:

//FLAG: this is an example of how data is shared using dataex
clear
input byte id str87 info
1 "carrot 4% (eating and growing), banana 6%, apple (eating, growing, etc.) 10%, orange 5%"
2 "banana 5%"                                                                              
3 "pineapple 10% (not including xxx), others 5%"                                          
end

//this removes everything within (and including) the parentheses
tempvar n
egen `n' = noccur(info), string("(") //if this line doesn't run for you, you probably need to install egenmore (type ssc install egenmore)
summ `n', meanonly
forvalues i = 1/`r(max)'{
replace info = subinstr(info, substr(info, strpos(info, "("), strpos(info, ")")-(strpos(info, "("))+1), "",.)
}

//this splits the variable and removes leading and lagging spaces
split info, p(",")
forvalues i = 1/`r(nvars)'{
    replace info`i' = trim(info`i')
}

Last edited by Chris Larkin; 10 Jun 2017, 22:35.

Comment

Ann Ng

Join Date: Apr 2017

Posts: 25
#3

11 Jun 2017, 03:35

Dear Chris,
Thank you very much for the codes which work beautifully. I am sorry for not using dataex although Nick also noticed me once before :D. I promise to use dataex next time
1 like
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#4

11 Jun 2017, 04:58

Here is an alternative code with regular expressions

Code:

gen re = "\([A-Za-z0-9, \.]+\)" while ( sum(regexm(info,re))>0 ) { replace info = ltrim(regexr(info,re,"")) }

Aside:
I have tried Chris' code with version 12.1 (Stata's version on my personal computer) and it works only for the first row of the dataset, but not for the other ones.
Chris' code does not take into account that the number of times "(" appears is not constant across observations and leaves the second and third rows empty.
It works fine with version 14.2

I had to modify the following line in order to get it work

Code:

replace info = subinstr(info, substr(info, strpos(info, "("), /// strpos(info, ")")-(strpos(info, "("))+1), "",.) if strpos(info, "(") >0

Last edited by Christophe Kolodziejczyk; 11 Jun 2017, 05:09.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

11 Jun 2017, 08:41

There is a unicode version of regexr() that will replace all matches. You will need to use a non-greedy quantifier to limit the replace to matching parentheses.

Code:

clear
input byte id str87 info
1 "carrot 4% (eating and growing), banana 6%, apple (eating, growing, etc.) 10%, orange 5%"
2 "banana 5%"                                                                              
3 "pineapple 10% (not including xxx), others 5%"                                          
end

gen s = ustrregexra(info,"\(.+?\)","")
split s, parse(,) gen(stub_)

list stub_*

and the results

Code:

. list stub_*

     +--------------------------------------------------------+
     |         stub_1       stub_2        stub_3       stub_4 |
     |--------------------------------------------------------|
  1. |     carrot 4%     banana 6%    apple  10%    orange 5% |
  2. |      banana 5%                                         |
  3. | pineapple 10%     others 5%                            |
     +--------------------------------------------------------+

.

Comment

Chris Larkin

Join Date: Apr 2016
Posts: 296

11 Jun 2017, 11:22

Hi Christophe Kolodziejczyk,

Do you know why my code only works on the first row with version 12.1?

I'm on 14.2 and it works fine, producing said output

Code:

. list info1-info4

     +----------------------------------------------------+
     |         info1       info2        info3       info4 |
     |----------------------------------------------------|
  1. |     carrot 4%   banana 6%   apple  10%   orange 5% |
  2. |     banana 5%                                      |
  3. | pineapple 10%   others 5%                          |
     +----------------------------------------------------+

So definitely across all rows

Announcement