remove text from multiple brackets

Ylenia Curci

Join Date: Sep 2017

Posts: 72
#1

remove text from multiple brackets

29 Sep 2017, 05:29

Hello,

I have a variable called ist that looks like this:

[Rabhi, Sameh; Rabhi, Imen; Guizani-Tabbane, Lamia] Univ Tunis El Manar, Inst Pasteur Tunis, Lab Parasitol Med Biotechnol & Biomol, 13,Pl Pasteur BP 74, Tunis 1002, Tunisia; [Rabhi, Sameh] Univ Carthage, Ave Republ BP 77, Carthage 1054, Tunisia; [Rabhi, Imen] Univ Manouba, Biotechnol & Biogeo Resources Valorizat Lab LR11E, Sidi Thabet 2020, Ariana, Tunisia; [Rabhi, Imen] Univ Manouba, Higher Inst Biotechnol, Sidi Thabet 2020, Ariana, Tunisia; [Trentin, Bernadette; Piquemal, David] Acobiom Cap Delta Biopole Euromed II, 1682 Rue Valsiere, F-34184 Montpellier 4, France; [Regnault, Beatrice] Inst Pasteur Paris, Genopole, DNA Chip Platform, 25-28 Rue Dr Roux, F-75015 Paris, France; [Goyard, Sophie; Lang, Thierry] Inst Pasteur, Dept Infect & Epidemiol, Lab Proc Infect Trypanosomatides, 26 Rue Dr Roux, F-75724 Paris 15, France; [Descoteaux, Albert] INRS Inst Armand Frappier, 531 Blvd Prairies, Laval, PQ H7V 1B7, Canada; [Descoteaux, Albert] Ctr Host Parasite Interact, 531 Blvd Prairies, Laval, PQ H7V 1B7, Canada; [Enninga, Jost] Inst Pasteur, Dynam Host Pathogen Interact Unit, 25 Rue Dr Roux, F-75724 Paris, France

and I want to get rid of all the text in brackets, but when I use the following, I end up with empty cells

replace ist= substr(ist, 1, strpos(ist, "(") - 1)

Thank you for your help!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

29 Sep 2017, 06:41

Perhaps you want

Code:

replace ist = substr( ist, strpos(ist,"] ")+1, . )

which will delete everything from the start of the string through the first right bracket and the space that follows the bracket.
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#3

30 Sep 2017, 04:07

Thank you William for your reply.
I have already tried that code too, but I need to get rid of everything that is in all the brackets in the string. I can use the code you suggested until I see 0 real changes made, but I was hoping to find a more straightforward solution.
Comment

Daniel Bela

Join Date: Apr 2014
Posts: 246

30 Sep 2017, 05:52

Dear Ylenia,

please consider reading the FAQ section 12 (and all the rest of the FAQ as well, of course) on how to create a minimal (possibly imaginary) data example that represents your data structure using -dataex- (SSC). This helps all readers who are willing to answer your question massively in understanding the problem. I assume the main reason behind your question only receiving one answer until now is the absence of such a data example.

William Lisowski's answer is an absolute correct fix to the code you presented; your code just did not match to what you wanted to achieve.

I suggest getting one step back and cleaning up the data source before you import your data to Stata; what you presented in your text does pretty much look as if there are several information (names and institutional addresses) mixed up; possibly, this has been imported from a HTML file, as some issues with special characters seem to indicate. So my recommendation is: Go back to the original parser, clean up, and then import into Stata.

Thus said, I created a minimal data example for you and wrote a short while-loop to do what you (presumably) want to do:

Code:

version 14
clear
input str1136(data)
`"[Rabhi, Sameh; Rabhi, Imen; Guizani-Tabbane, Lamia] Univ Tunis El Manar, Inst Pasteur Tunis, Lab Parasitol Med Biotechnol &amp; Biomol, 13,Pl Pasteur BP 74, Tunis 1002, Tunisia; [Rabhi, Sameh] Univ Carthage, Ave Republ BP 77, Carthage 1054, Tunisia; [Rabhi, Imen] Univ Manouba, Biotechnol &amp; Biogeo Resources Valorizat Lab LR11E, Sidi Thabet 2020, Ariana, Tunisia; [Rabhi, Imen] Univ Manouba, Higher Inst Biotechnol, Sidi Thabet 2020, Ariana, Tunisia; [Trentin, Bernadette; Piquemal, David] Acobiom Cap Delta Biopole Euromed II, 1682 Rue Valsiere, F-34184 Montpellier 4, France; [Regnault, Beatrice] Inst Pasteur Paris, Genopole, DNA Chip Platform, 25-28 Rue Dr Roux, F-75015 Paris, France; [Goyard, Sophie; Lang, Thierry] Inst Pasteur, Dept Infect &amp; Epidemiol, Lab Proc Infect Trypanosomatides, 26 Rue Dr Roux, F-75724 Paris 15, France; [Descoteaux, Albert] INRS Inst Armand Frappier, 531 Blvd Prairies, Laval, PQ H7V 1B7, Canada; [Descoteaux, Albert] Ctr Host Parasite Interact, 531 Blvd Prairies, Laval, PQ H7V 1B7, Canada; [Enninga, Jost] Inst Pasteur, Dynam Host Pathogen Interact Unit, 25 Rue Dr Roux, F-75724 Paris, France"'
end

// count how many observations contain some text inside brackets
quietly : count if (ustrregexm(data,`"(.*)\[.*\] (.*)"'))
// repeat the following as long as the previous count is not 0
while r(N)>0 {
    // remove the first bracket-pair (and its content) from the string
    replace data=ustrregexs(1)+ustrregexs(2) if (ustrregexm(data,`"(.*)\[.*\] (.*)"'))
    // count how many observations contain some text inside brackets (this step at the end of the while-loop is essential!)
    quietly : count if (ustrregexm(data,`"(.*)\[.*\] (.*)"'))
}    // <-- end of while-loop

I think that working with regular expressions to remove the bracket texts is the most straightforward way to do it; note that my code uses the unicode-aware regular expression functions of Stata 14 or newer, hence the "version 14" statement at the beginning. To understand these regular expression functions, have a look at the corresponding help file (and the linked PDF documentation).

Regards
Bela

Last edited by Daniel Bela; 30 Sep 2017, 05:58. Reason: added link to Stata help for regular expressions

Comment

Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#5

30 Sep 2017, 07:31

ustrregexra() replaces all substrings within the string that match:

Code:

replace data = ustrregexra(data, "\[.*?\]" , "" ) if strpos(data,"[") * maybe trimming blanks replace data = trim(itrim(data)) compress

ICU's Regular Expressions: http://userguide.icu-project.org/strings/regexp

Last edited by Bjarte Aagnes; 30 Sep 2017, 07:33.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

30 Sep 2017, 10:14

Bjarte Aagnes -

Many sincere thanks for the reference on ICU's regular expressions, upon which Stata's are based.

I apparently missed the topic you answered in early September, because the "*?" syntax you discussed here was, until now, unfamiliar to me. ICU is referenced nowhere in the official Stata documentation, sorry to say, nor is equivalent information given.

I have added the link you provided as a new post on the end of an earlier topic on learning to use regular expressions.
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#7

30 Sep 2017, 10:16

Hi Daniel, thank you for your suggestion, I will read the FAQs. I was surprised about the confusion and then I realized that in the first post I have copied the wrong line of code. That was the one used to get rid of the text in the last part of the string. The data are txt and because I have semicolons between both names and affiliations, I have to keep them in one variable. The special characters are fine in the data, something went wrong in my copy and paste into the forum. Unfortunately, I am still stuck with stata 13, so I guess I cannot run your code. I have a (most probably) naive question. Why I cannot use the subinstr function with the wildcard * and do something like this? replace ist = subinstr(ist, "[*]", " ", .)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

30 Sep 2017, 11:47

Why I cannot use the subinstr function with the wildcard * and do something like this? replace ist = subinstr(ist, "[*]", " ", .)

Because nothing in the documentation for the subinstr function found in the output of the help subinstr command suggests that any sort of wildcard is supported.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

30 Sep 2017, 12:27

Ylenia, An answer to your question in post #7:

subinstr() does an exact string match and the asterisk character is as any other character.

Code:

display subinstr("abc[*]efg", "[*]", "_", 1) abc_efg

Since you use Stata 13 I think you must use repeated use of strpos(), substr() and subinstr() to strip of the embraced strings:

Code:

version 13 clear all set obs 3 gen str = "A[bbb]C[ddd]" replace str = _n * str gen new = str qui while ( r(N) > 0 ) { count if regexm( new , "\[.*\]") replace new = subinstr(new,substr(new,strpos(new,"["),1+strpos(new,"]")-strpos(new,"[")),"",.) } list

Code:

     +-----------------------------------------------+
     |                                  str      new |
     |-----------------------------------------------|
  1. |                         A[bbb]C[ddd]       AC |
  2. |             A[bbb]C[ddd]A[bbb]C[ddd]     ACAC |
  3. | A[bbb]C[ddd]A[bbb]C[ddd]A[bbb]C[ddd]   ACACAC |
     +-----------------------------------------------+

Last edited by Bjarte Aagnes; 30 Sep 2017, 13:01.

Comment

Ylenia Curci

Join Date: Sep 2017
Posts: 72

#10

01 Oct 2017, 10:40

THANK YOU!!! This code works perfectly!

Originally posted by Bjarte Aagnes View Post

Code:

gen new = str

qui while ( r(N) > 0 ) {

count if regexm( new , "\[.*\]")

replace new = subinstr(new,substr(new,strpos(new,"["),1+strpos(new,"]")-strpos(new,"[")),"",.)
}

Announcement