Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting strings

    Hi there!

    I am working on cleaning up some text data, and have a quick question. I have the following data from which I am trying to extract the information inside brackets. I currently use -strpos- and -strrpos- functions to spot the brackets that appear first and last in a row, and extract the information using -substr-. However, if there are more than two sets of brackets in a row, this approach doesn't work. Could you please help me on how to go about handling such cases?

    Kindly let me know if something is unclear.

    Thanks very much!
    Krishna

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str25 string
    "(ab)cd(e)"                
    "abcd(efg)"                
    "(abc)d(efg)i(jkl)"        
    "(abc)defg)i(jkl)"         
    "(abc)(defg)i(jkl)mno(pqr)"
    end

  • #2
    Code:
    gen stringcopy=string
    replace stringcopy = subinstr(stringcopy, "(", "_",.)
    replace stringcopy = subinstr(stringcopy, ")", "_",.)
    split stringcopy, parse("_")
    
    forvalues i = 1(2)50{
    cap drop stringcopy`i'
    }
    ren stringcopy# stringpart#, renumber
    Note: does not work perfectly with your example because line 4 of your data has a closing bracket following another closing bracket.
    Should that be taken into account?

    Comment


    • #3
      This works with your example. moss is from SSC.


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str25 string
      "(ab)cd(e)"                
      "abcd(efg)"                
      "(abc)d(efg)i(jkl)"        
      "(abc)defg)i(jkl)"         
      "(abc)(defg)i(jkl)mno(pqr)"
      end
      
      moss string, match("(\([a-z]*\))") regex 
      
      quietly foreach v of var _match* { 
          replace `v' = subinstr(`v', "(", "", .) 
          replace `v' = subinstr(`v', ")", "", .)
      }
      
      list string _match* 
      
           +-------------------------------------------------------------------+
           |                    string   _match1   _match2   _match3   _match4 |
           |-------------------------------------------------------------------|
        1. |                 (ab)cd(e)        ab         e                     |
        2. |                 abcd(efg)       efg                               |
        3. |         (abc)d(efg)i(jkl)       abc       efg       jkl           |
        4. |          (abc)defg)i(jkl)       abc       jkl                     |
        5. | (abc)(defg)i(jkl)mno(pqr)       abc      defg       jkl       pqr |
           +-------------------------------------------------------------------+

      Comment


      • #4

        Using moss: the opening and closing parenthesis can be outside the regex subexpression:
        Code:
        moss string, match("\(([a-z]+)\)") regex
        Last edited by Bjarte Aagnes; 11 Jul 2019, 13:31.

        Comment


        • #5
          Bjarte Aagnes Good point!

          Comment


          • #6
            These suggestions work great!

            Thanks so much, Bjarte Aagnes and Nick Cox -- really appreciate the help!

            Comment


            • #7
              Nick Cox and Bjarte Aagnes, just a quick follow up thought on the suggested solution. What if I were to use -moss- on a dataset that has non-english (or a mix of english and non-english) strings? In other words, how do I extract any information inside the brackets without specifying "[a-z]", "[0-9]", etc.? I tried using "*" but doesn't work.

              Thanks!

              Comment


              • #8
                No free lunch here. if any character is allowed, you won't get specific strings extracted. You'll get the maximal substring between the outermost parentheses as running

                Code:
                moss string, match("\((.*)\)") regex
                will show. I would be silly to rule out an easy solution, but I can't think of one right now. What seems clear is that split can't help either, given junk in your string that you want to ignore.

                Comment


                • #9
                  First, matching any byte but not the code for ")" and "(" using ^ negating the character class:
                  Code:
                  moss string, match("\(([^\)\(]+)\)") regex
                  Or, better to use the Unicode ICU regular expressions available also in -moss-, below match Unicode letters:
                  Code:
                  moss string, match("\(([\p{Letter}]+)\)") regex unicode
                  Code:
                  clear
                  input str50 string
                  "(ab)cd(e)"                
                  "abcd(efg)"                
                  "(abc)d(efg)i(jkl)"        
                  "(abc)defg)i(jkl)"        
                  "(abc)(defg)i(jkl)mno(pqr)"
                  `"(.;!)("#$)i(zzz)"'
                  "(日本語)(简体中文)(ไทย)"
                  end
                  compress
                  
                  moss string, match("\(([^\)\(]+)\)") regex
                  
                  list string *match*
                  
                  keep string
                  
                  moss string, match("\(([\p{Letter}]+)\)") regex unicode
                  
                  list string *match*
                  Code:
                  . moss string, match("\(([^\)\(]+)\)") regex
                  
                  .
                  . list string *match*
                  
                       +--------------------------------------------------------------------+
                       |                    string   _match1    _match2   _match3   _match4 |
                       |--------------------------------------------------------------------|
                    1. |                 (ab)cd(e)        ab          e                     |
                    2. |                 abcd(efg)       efg                                |
                    3. |         (abc)d(efg)i(jkl)       abc        efg       jkl           |
                    4. |          (abc)defg)i(jkl)       abc        jkl                     |
                    5. | (abc)(defg)i(jkl)mno(pqr)       abc       defg       jkl       pqr |
                       |--------------------------------------------------------------------|
                    6. |          (.;!)("#$)i(zzz)       .;!        "#$       zzz           |
                    7. |   (日本語)(简体中文)(ไทย)    日本語   简体中文       ไทย           |
                       +--------------------------------------------------------------------+
                  
                  .
                  . keep string
                  
                  .
                  . moss string, match("\(([\p{Letter}]+)\)") regex unicode
                  
                  .
                  . list string *match*
                  
                       +--------------------------------------------------------------------+
                       |                    string   _match1    _match2   _match3   _match4 |
                       |--------------------------------------------------------------------|
                    1. |                 (ab)cd(e)        ab          e                     |
                    2. |                 abcd(efg)       efg                                |
                    3. |         (abc)d(efg)i(jkl)       abc        efg       jkl           |
                    4. |          (abc)defg)i(jkl)       abc        jkl                     |
                    5. | (abc)(defg)i(jkl)mno(pqr)       abc       defg       jkl       pqr |
                       |--------------------------------------------------------------------|
                    6. |          (.;!)("#$)i(zzz)       zzz                                |
                    7. |   (日本語)(简体中文)(ไทย)    日本語   简体中文       ไทย           |
                       +--------------------------------------------------------------------+
                  Last edited by Bjarte Aagnes; 12 Jul 2019, 11:45.

                  Comment


                  • #10
                    Adding to #9 , regex references:
                    on Stata regex support:
                    https://www.statalist.org/forums/for...79#post1327779
                    ICU regex:
                    http://userguide.icu-project.org/strings/regexp
                    Unicode regex :
                    https://www.regular-expressions.info/unicode.html


                    Extended example, inluding regex allowing comments:
                    Code:
                    version 14
                    
                    clear
                    input str50 string
                    "(ab)cd(e)"                
                    "abcd(efg)"                
                    "(abc)d(efg)i(jkl)"        
                    "(abc)defg)i(jkl)"        
                    "(abc)(defg)i(jkl)mno(pqr)"
                    `"(.;!)("#$)i(zzz)"'
                    "(日本語)(简体中文)(ไทย)"
                    end
                    compress
                    
                    ********************************************************************************
                    * define string scalar with regex allowing comments
                    ********************************************************************************
                    
                    * define local for newline character(s)
                    
                    local nl = cond( c(os)=="Windows", char(13) + char(10) , char(10) ) 
                    
                    #delim;  /* newlines `nl' must be inserted */
                    
                    scalar sc_regxp_letters =
                    
                    "(?x)     # SET flag UREGEX_COMMENTS Allow white space and comments      `nl'
                              # PATTERN TO MATCH:                                            `nl'
                    \(        # match opening parenthesis                                    `nl'  
                      (       #  START Capturing Group (sub-expression)                      `nl' 
                      [\p{L}] #   match character class: Unicode property Letter             `nl'
                      +?      #   quantifier (+) one or more, (?) non-greedy                 `nl'
                      )       #  END Capturing Group (sub-expression)                        `nl' 
                    \)        # match closing parenthesis                                     
                    " 
                    
                    ;
                    #delim cr
                    
                    ********************************************************************************
                    * use regex to parse
                    ********************************************************************************
                    
                    tempvar ustring
                    gen `ustring' = string
                    
                    qui forvalues i = 1/1000 {
                    
                        gen m`i' = ustrregexs(1) if ustrregexm(`ustring', sc_regxp_letters )
                        
                        replace `ustring' = usubinstr(`ustring',  "(" + m`i' + ")", "", 1)
                        
                        if ( ustrregexs(1) == "" ) {
                        
                            continue, break
                        }
                    }
                    
                    list
                    
                    
                    ********************************************************************************
                    * use regex to parse using -moss-
                    ********************************************************************************
                    
                    * moss fail using pattern stored in scalar sc_regxp_letters
                    
                    local regxp_letters = subinstr(sc_regxp_letters,"(?x)","",1) 
                    local regxp_letters = ustrregexra(`"`regxp_letters'"', ///
                        /* (?m) set UREGEX_MULTILINE */ "(?m)#.+?$","")
                    local regxp_letters = ustrregexra(`"`regxp_letters'"',"[\s]+","")
                    
                    di _n as txt `"`regxp_letters'"'
                    
                    moss string, match(`"`regxp_letters'"') regex unicode
                    
                    list string _match*
                    
                    exit
                    Results:
                    Code:
                    . qui forvalues i = 1/1000 {
                    
                    . list
                    
                         +-------------------------------------------------------------------------+
                         |                    string      __000000       m1         m2    m3    m4 |
                         |-------------------------------------------------------------------------|
                      1. |                 (ab)cd(e)            cd       ab          e             |
                      2. |                 abcd(efg)          abcd      efg                        |
                      3. |         (abc)d(efg)i(jkl)            di      abc        efg   jkl       |
                      4. |          (abc)defg)i(jkl)        defg)i      abc        jkl             |
                      5. | (abc)(defg)i(jkl)mno(pqr)          imno      abc       defg   jkl   pqr |
                         |-------------------------------------------------------------------------|
                      6. |          (.;!)("#$)i(zzz)   (.;!)("#$)i      zzz                        |
                      7. |   (日本語)(简体中文)(ไทย)                 日本語   简体中文   ไทย       |
                         +-------------------------------------------------------------------------+
                    Code:
                    . di _n as txt `"`regxp_letters'"'
                    
                    \(([\p{L}]+?)\)
                    
                    . moss string, match(`"`regxp_letters'"') regex unicode
                    
                    . list string _match*
                    
                         +--------------------------------------------------------------------+
                         |                    string   _match1    _match2   _match3   _match4 |
                         |--------------------------------------------------------------------|
                      1. |                 (ab)cd(e)        ab          e                     |
                      2. |                 abcd(efg)       efg                                |
                      3. |         (abc)d(efg)i(jkl)       abc        efg       jkl           |
                      4. |          (abc)defg)i(jkl)       abc        jkl                     |
                      5. | (abc)(defg)i(jkl)mno(pqr)       abc       defg       jkl       pqr |
                         |--------------------------------------------------------------------|
                      6. |          (.;!)("#$)i(zzz)       zzz                                |
                      7. |   (日本語)(简体中文)(ไทย)    日本語   简体中文       ไทย           |
                         +--------------------------------------------------------------------+
                    
                    .

                    Comment


                    • #11
                      Thanks so much, Bjarte Aagnes -- this is all super helpful! In my example, line 4 has a close bracket without its corresponding open bracket. These are cases where the brackets have been improperly specified in the data. Assuming I know exactly what to extract in such cases, how could one adapt your solution? For example, I want to extract "efg" (just like in row 3) from row 4 as well. The same is true for lines with an open bracket, but not close bracket -- check out the new row I have added where I would like to pick up on "uvw".

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input str50 string
                      "(ab)cd(e)"                           
                      "abcd(efg)"                           
                      "(abc)d(efg)i(jkl)"                   
                      "(abc)defg)i(jkl)"                    
                      "(abc)(defg)i(jkl)mno(pqr)"           
                      `"(.;!)("#$)i(zzz)"'                  
                      "(日本語)(简体中文)(ไทย)"
                      "mno(pq)rs(uvwx(yz)"                  
                      end

                      Comment


                      • #12
                        Below is two suggestions using moss I think fit your current description and data example.

                        A) Will not keep the ordering of the strings matched (but this may be recovered from the _pos vars left by -moss-)
                        B) using a new regex (efg|uvw|\([\p{L}]+?\)) balanced parenthesis is part of the match and must be stripped off.

                        (I shortly start vacation, including digital detox, so do not expect any fast follow up on this from me the next weeks.)
                        Code:
                        version 14
                        
                        clear
                        input str50 string
                        "(ab)cd(e)"                
                        "abcd(efg)"                
                        "(abc)d(efg)i(jkl)"        
                        "(abc)defg)i(jkl)"        
                        "(abc)(defg)i(jkl)mno(pqr)"
                        `"(.;!)("#$)i(zzz)"'
                        "(日本語)(简体中文)(ไทย)"
                        "mno(pq)rs(uvw(yz)"
                        end
                        compress
                        
                        rename string stringvar
                        
                        ********************************************************************************
                        * store regex to match in scalars
                        ********************************************************************************
                        
                        scalar sc_rxp_letters_balanced    = `" "\(([\p{L}]+?)\)" "'
                        
                        scalar sc_rxp_string_opening      = `" "[\(](efg|uvw)[^\)]" "'    
                        
                        scalar sc_rxp_string_closing      = `" "[^\(](efg|uvw)[\)]"  "'  
                        
                        
                        * ad hoc: to avoid regex complications when matching strings at ends,
                        * add some chars at ends
                        
                        replace stringvar = "_" + stringvar + "_"
                        
                        
                        ********************************************************************************
                        * (A) moss repeat on rest of string left after first regex
                        ********************************************************************************
                        
                        moss stringvar , match(`=sc_rxp_letters_balanced') regex unicode pre(_0_)
                        
                        * create rest of string left after first regex
                        
                        gen rest = stringvar , after(stringvar)
                        local rest "rest"
                        
                        qui foreach v of varlist _0_m* {
                        
                            replace `rest' = subinstr(`rest', "(" + `v' + ")" ,"",1)
                        }
                        
                        moss `rest' , match(`=sc_rxp_string_opening')   regex unicode pre(_1_)
                        moss `rest' , match(`=sc_rxp_string_closing')   regex unicode pre(_2_)
                        
                        rename ?#?match# ?#?m#
                        list stringvar `rest' *m*
                        
                        ********************************************************************************
                        * (B) moss new regex "(efg|uvw|\([\p{L}]+?\))"
                        ********************************************************************************
                        
                        keep stringvar
                        
                        moss stringvar , match("(efg|uvw|\([\p{L}]+?\))") regex unicode pre(_n_)
                        
                        * list stringvar *m*
                        * strip of parentheses from matches like "(pqr)"    
                        
                        qui foreach v of varlist *match* {
                        
                            replace `v' = subinstr(`v', "(", "", .)
                            replace `v' = subinstr(`v', ")", "", .)
                        }
                        
                        rename ???match# ???m#
                        list stringvar *m*
                        
                        ********************************************************************************
                        exit
                        Results (A):
                        Code:
                        . list stringvar `rest' *m*
                        
                             +-------------------------------------------------------------------------------------------------+
                             |                   stringvar            rest    _0_m1      _0_m2   _0_m3   _0_m4   _1_m1   _2_m1 |
                             |-------------------------------------------------------------------------------------------------|
                          1. |                 _(ab)cd(e)_            _cd_       ab          e                                 |
                          2. |                 _abcd(efg)_          _abcd_      efg                                            |
                          3. |         _(abc)d(efg)i(jkl)_            _di_      abc        efg     jkl                         |
                          4. |          _(abc)defg)i(jkl)_        _defg)i_      abc        jkl                             efg |
                          5. | _(abc)(defg)i(jkl)mno(pqr)_          _imno_      abc       defg     jkl     pqr                 |
                             |-------------------------------------------------------------------------------------------------|
                          6. |          _(.;!)("#$)i(zzz)_   _(.;!)("#$)i_      zzz                                            |
                          7. |     _(日本語)(简体中文)(ไทย)_              __    日本語     简体中文     ไทย                         |
                          8. |         _mno(pq)rs(uvw(yz)_     _mnors(uvw_       pq         yz                     uvw         |
                             +-------------------------------------------------------------------------------------------------+
                        Results (B):
                        Code:
                        . list stringvar *m*
                        
                             +-----------------------------------------------------------------+
                             |                   stringvar    _n_m1      _n_m2   _n_m3   _n_m4 |
                             |-----------------------------------------------------------------|
                          1. |                 _(ab)cd(e)_       ab          e                 |
                          2. |                 _abcd(efg)_      efg                            |
                          3. |         _(abc)d(efg)i(jkl)_      abc        efg     jkl         |
                          4. |          _(abc)defg)i(jkl)_      abc        efg     jkl         |
                          5. | _(abc)(defg)i(jkl)mno(pqr)_      abc       defg     jkl     pqr |
                             |-----------------------------------------------------------------|
                          6. |          _(.;!)("#$)i(zzz)_      zzz                            |
                          7. |     _(日本語)(简体中文)(ไทย)_     日本語    简体中文     ไทย         |
                          8. |         _mno(pq)rs(uvw(yz)_       pq        uvw      yz         |
                             +-----------------------------------------------------------------+
                        Last edited by sladmin; 16 Jul 2019, 09:17. Reason: BBCode fix

                        Comment

                        Working...
                        X