Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove text from multiple brackets

    Hello,

    I have a variable called ist that looks like this:

    [Rabhi, Sameh; Rabhi, Imen; Guizani-Tabbane, Lamia] Univ Tunis El Manar, Inst Pasteur Tunis, Lab Parasitol Med Biotechnol & Biomol, 13,Pl Pasteur BP 74, Tunis 1002, Tunisia; [Rabhi, Sameh] Univ Carthage, Ave Republ BP 77, Carthage 1054, Tunisia; [Rabhi, Imen] Univ Manouba, Biotechnol & Biogeo Resources Valorizat Lab LR11E, Sidi Thabet 2020, Ariana, Tunisia; [Rabhi, Imen] Univ Manouba, Higher Inst Biotechnol, Sidi Thabet 2020, Ariana, Tunisia; [Trentin, Bernadette; Piquemal, David] Acobiom Cap Delta Biopole Euromed II, 1682 Rue Valsiere, F-34184 Montpellier 4, France; [Regnault, Beatrice] Inst Pasteur Paris, Genopole, DNA Chip Platform, 25-28 Rue Dr Roux, F-75015 Paris, France; [Goyard, Sophie; Lang, Thierry] Inst Pasteur, Dept Infect & Epidemiol, Lab Proc Infect Trypanosomatides, 26 Rue Dr Roux, F-75724 Paris 15, France; [Descoteaux, Albert] INRS Inst Armand Frappier, 531 Blvd Prairies, Laval, PQ H7V 1B7, Canada; [Descoteaux, Albert] Ctr Host Parasite Interact, 531 Blvd Prairies, Laval, PQ H7V 1B7, Canada; [Enninga, Jost] Inst Pasteur, Dynam Host Pathogen Interact Unit, 25 Rue Dr Roux, F-75724 Paris, France

    and I want to get rid of all the text in brackets, but when I use the following, I end up with empty cells

    replace ist= substr(ist, 1, strpos(ist, "(") - 1)

    Thank you for your help!


  • #2
    Perhaps you want
    Code:
    replace ist = substr( ist, strpos(ist,"] ")+1, . )
    which will delete everything from the start of the string through the first right bracket and the space that follows the bracket.

    Comment


    • #3
      Thank you William for your reply.
      I have already tried that code too, but I need to get rid of everything that is in all the brackets in the string. I can use the code you suggested until I see 0 real changes made, but I was hoping to find a more straightforward solution.

      Comment


      • #4
        Dear Ylenia,

        please consider reading the FAQ section 12 (and all the rest of the FAQ as well, of course) on how to create a minimal (possibly imaginary) data example that represents your data structure using -dataex- (SSC). This helps all readers who are willing to answer your question massively in understanding the problem. I assume the main reason behind your question only receiving one answer until now is the absence of such a data example.

        William Lisowski's answer is an absolute correct fix to the code you presented; your code just did not match to what you wanted to achieve.

        I suggest getting one step back and cleaning up the data source before you import your data to Stata; what you presented in your text does pretty much look as if there are several information (names and institutional addresses) mixed up; possibly, this has been imported from a HTML file, as some issues with special characters seem to indicate. So my recommendation is: Go back to the original parser, clean up, and then import into Stata.

        Thus said, I created a minimal data example for you and wrote a short while-loop to do what you (presumably) want to do:
        Code:
        version 14
        clear
        input str1136(data)
        `"[Rabhi, Sameh; Rabhi, Imen; Guizani-Tabbane, Lamia] Univ Tunis El Manar, Inst Pasteur Tunis, Lab Parasitol Med Biotechnol & Biomol, 13,Pl Pasteur BP 74, Tunis 1002, Tunisia; [Rabhi, Sameh] Univ Carthage, Ave Republ BP 77, Carthage 1054, Tunisia; [Rabhi, Imen] Univ Manouba, Biotechnol & Biogeo Resources Valorizat Lab LR11E, Sidi Thabet 2020, Ariana, Tunisia; [Rabhi, Imen] Univ Manouba, Higher Inst Biotechnol, Sidi Thabet 2020, Ariana, Tunisia; [Trentin, Bernadette; Piquemal, David] Acobiom Cap Delta Biopole Euromed II, 1682 Rue Valsiere, F-34184 Montpellier 4, France; [Regnault, Beatrice] Inst Pasteur Paris, Genopole, DNA Chip Platform, 25-28 Rue Dr Roux, F-75015 Paris, France; [Goyard, Sophie; Lang, Thierry] Inst Pasteur, Dept Infect & Epidemiol, Lab Proc Infect Trypanosomatides, 26 Rue Dr Roux, F-75724 Paris 15, France; [Descoteaux, Albert] INRS Inst Armand Frappier, 531 Blvd Prairies, Laval, PQ H7V 1B7, Canada; [Descoteaux, Albert] Ctr Host Parasite Interact, 531 Blvd Prairies, Laval, PQ H7V 1B7, Canada; [Enninga, Jost] Inst Pasteur, Dynam Host Pathogen Interact Unit, 25 Rue Dr Roux, F-75724 Paris, France"'
        end
        
        // count how many observations contain some text inside brackets
        quietly : count if (ustrregexm(data,`"(.*)\[.*\] (.*)"'))
        // repeat the following as long as the previous count is not 0
        while r(N)>0 {
            // remove the first bracket-pair (and its content) from the string
            replace data=ustrregexs(1)+ustrregexs(2) if (ustrregexm(data,`"(.*)\[.*\] (.*)"'))
            // count how many observations contain some text inside brackets (this step at the end of the while-loop is essential!)
            quietly : count if (ustrregexm(data,`"(.*)\[.*\] (.*)"'))
        }    // <-- end of while-loop
        I think that working with regular expressions to remove the bracket texts is the most straightforward way to do it; note that my code uses the unicode-aware regular expression functions of Stata 14 or newer, hence the "version 14" statement at the beginning. To understand these regular expression functions, have a look at the corresponding help file (and the linked PDF documentation).

        Regards
        Bela
        Last edited by Daniel Bela; 30 Sep 2017, 05:58. Reason: added link to Stata help for regular expressions

        Comment


        • #5
          ustrregexra() replaces all substrings within the string that match:

          Code:
          replace data = ustrregexra(data, "\[.*?\]" , "" ) if strpos(data,"[")
          
          * maybe trimming blanks
          
          replace data = trim(itrim(data))
          compress
          ICU's Regular Expressions: http://userguide.icu-project.org/strings/regexp
          Last edited by Bjarte Aagnes; 30 Sep 2017, 07:33.

          Comment


          • #6
            Bjarte Aagnes -

            Many sincere thanks for the reference on ICU's regular expressions, upon which Stata's are based.

            I apparently missed the topic you answered in early September, because the "*?" syntax you discussed here was, until now, unfamiliar to me. ICU is referenced nowhere in the official Stata documentation, sorry to say, nor is equivalent information given.

            I have added the link you provided as a new post on the end of an earlier topic on learning to use regular expressions.

            Comment


            • #7
              Hi Daniel, thank you for your suggestion, I will read the FAQs. I was surprised about the confusion and then I realized that in the first post I have copied the wrong line of code. That was the one used to get rid of the text in the last part of the string. The data are txt and because I have semicolons between both names and affiliations, I have to keep them in one variable. The special characters are fine in the data, something went wrong in my copy and paste into the forum. Unfortunately, I am still stuck with stata 13, so I guess I cannot run your code. I have a (most probably) naive question. Why I cannot use the subinstr function with the wildcard * and do something like this? replace ist = subinstr(ist, "[*]", " ", .)

              Comment


              • #8
                Why I cannot use the subinstr function with the wildcard * and do something like this? replace ist = subinstr(ist, "[*]", " ", .)
                Because nothing in the documentation for the subinstr function found in the output of the help subinstr command suggests that any sort of wildcard is supported.

                Comment


                • #9
                  Ylenia, An answer to your question in post #7:

                  subinstr() does an exact string match and the asterisk character is as any other character.
                  Code:
                  display subinstr("abc[*]efg", "[*]", "_", 1)
                  abc_efg
                  Since you use Stata 13 I think you must use repeated use of strpos(), substr() and subinstr() to strip of the embraced strings:
                  Code:
                  version 13
                  clear all
                  set obs 3  
                  gen str = "A[bbb]C[ddd]"
                  replace str = _n * str
                  
                  gen new = str
                  
                  qui while ( r(N) > 0 ) {
                  
                      count if regexm( new , "\[.*\]")
                      
                      replace new = subinstr(new,substr(new,strpos(new,"["),1+strpos(new,"]")-strpos(new,"[")),"",.)
                  }
                  
                  list
                  Code:
                       +-----------------------------------------------+
                       |                                  str      new |
                       |-----------------------------------------------|
                    1. |                         A[bbb]C[ddd]       AC |
                    2. |             A[bbb]C[ddd]A[bbb]C[ddd]     ACAC |
                    3. | A[bbb]C[ddd]A[bbb]C[ddd]A[bbb]C[ddd]   ACACAC |
                       +-----------------------------------------------+
                  Last edited by Bjarte Aagnes; 30 Sep 2017, 13:01.

                  Comment


                  • #10
                    THANK YOU!!! This code works perfectly!


                    Originally posted by Bjarte Aagnes View Post

                    Code:
                    gen new = str
                    
                    qui while ( r(N) > 0 ) {
                    
                    count if regexm( new , "\[.*\]")
                    
                    replace new = subinstr(new,substr(new,strpos(new,"["),1+strpos(new,"]")-strpos(new,"[")),"",.)
                    }

                    Comment

                    Working...
                    X