Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to delete everything in different parentheses for a string variable

    Dear Statalists,
    I want to delete everything in different parentheses for a string variable. I found this topic :http://www.stata.com/statalist/archi.../msg00312.html
    However, the code there will delete all information after the first bracket is found. Meanwhile, my string variable (Info) is like this:
    ID Info
    1 carrot 4% (eating and growing), banana 6%, apple (eating, growing, etc.) 10%, orange 5%
    2 banana 5%
    3 pineapple 10% (not including xxx), others 5%
    I want to remove all additional information in the parentheses for Info variable before splitting that variable by comma ","
    Could anyone help me with this?
    Thank you very much!
    Last edited by Ann Ng; 10 Jun 2017, 21:54.

  • #2
    Hi Ann,

    The code below works. One quick thing: when sharing data on Statalist, it'd be great if you could use the command dataex (which you can download by running the command ssc install dataex). It allows people to replicate your problem in Stata itself, and figure out code that helps you

    Code:
    //FLAG: this is an example of how data is shared using dataex
    clear
    input byte id str87 info
    1 "carrot 4% (eating and growing), banana 6%, apple (eating, growing, etc.) 10%, orange 5%"
    2 "banana 5%"                                                                              
    3 "pineapple 10% (not including xxx), others 5%"                                          
    end
    
    //this removes everything within (and including) the parentheses
    tempvar n
    egen `n' = noccur(info), string("(") //if this line doesn't run for you, you probably need to install egenmore (type ssc install egenmore)
    summ `n', meanonly
    forvalues i = 1/`r(max)'{
    replace info = subinstr(info, substr(info, strpos(info, "("), strpos(info, ")")-(strpos(info, "("))+1), "",.)
    }
    
    //this splits the variable and removes leading and lagging spaces
    split info, p(",")
    forvalues i = 1/`r(nvars)'{
        replace info`i' = trim(info`i')
    }
    Last edited by Chris Larkin; 10 Jun 2017, 22:35.

    Comment


    • #3
      Dear Chris,
      Thank you very much for the codes which work beautifully. I am sorry for not using dataex although Nick also noticed me once before :D. I promise to use dataex next time

      Comment


      • #4
        Here is an alternative code with regular expressions

        Code:
        gen re = "\([A-Za-z0-9, \.]+\)"
        while ( sum(regexm(info,re))>0 ) {
            replace info = ltrim(regexr(info,re,""))
        }
        Aside:
        I have tried Chris' code with version 12.1 (Stata's version on my personal computer) and it works only for the first row of the dataset, but not for the other ones.
        Chris' code does not take into account that the number of times "(" appears is not constant across observations and leaves the second and third rows empty.
        It works fine with version 14.2

        I had to modify the following line in order to get it work
        Code:
        replace info = subinstr(info, substr(info, strpos(info, "("), ///
        strpos(info, ")")-(strpos(info, "("))+1), "",.) if strpos(info, "(") >0
        Last edited by Christophe Kolodziejczyk; 11 Jun 2017, 05:09.

        Comment


        • #5
          There is a unicode version of regexr() that will replace all matches. You will need to use a non-greedy quantifier to limit the replace to matching parentheses.

          Code:
          clear
          input byte id str87 info
          1 "carrot 4% (eating and growing), banana 6%, apple (eating, growing, etc.) 10%, orange 5%"
          2 "banana 5%"                                                                              
          3 "pineapple 10% (not including xxx), others 5%"                                          
          end
          
          gen s = ustrregexra(info,"\(.+?\)","")
          split s, parse(,) gen(stub_)
          
          list stub_*
          and the results
          Code:
          . list stub_*
          
               +--------------------------------------------------------+
               |         stub_1       stub_2        stub_3       stub_4 |
               |--------------------------------------------------------|
            1. |     carrot 4%     banana 6%    apple  10%    orange 5% |
            2. |      banana 5%                                         |
            3. | pineapple 10%     others 5%                            |
               +--------------------------------------------------------+
          
          .

          Comment


          • #6
            Hi Christophe Kolodziejczyk,

            Do you know why my code only works on the first row with version 12.1?

            I'm on 14.2 and it works fine, producing said output

            Code:
            . list info1-info4
            
                 +----------------------------------------------------+
                 |         info1       info2        info3       info4 |
                 |----------------------------------------------------|
              1. |     carrot 4%   banana 6%   apple  10%   orange 5% |
              2. |     banana 5%                                      |
              3. | pineapple 10%   others 5%                          |
                 +----------------------------------------------------+
            So definitely across all rows

            Comment

            Working...
            X