Extracting strings

Krishna Prasad Srinivasan

Join Date: Jul 2019

Posts: 13
#1

Extracting strings

11 Jul 2019, 11:51

Hi there!

I am working on cleaning up some text data, and have a quick question. I have the following data from which I am trying to extract the information inside brackets. I currently use -strpos- and -strrpos- functions to spot the brackets that appear first and last in a row, and extract the information using -substr-. However, if there are more than two sets of brackets in a row, this approach doesn't work. Could you please help me on how to go about handling such cases?

Kindly let me know if something is unclear.

Thanks very much!
Krishna

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str25 string "(ab)cd(e)" "abcd(efg)" "(abc)d(efg)i(jkl)" "(abc)defg)i(jkl)" "(abc)(defg)i(jkl)mno(pqr)" end
Tags: None
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#2

11 Jul 2019, 12:30

Code:

gen stringcopy=string replace stringcopy = subinstr(stringcopy, "(", "_",.) replace stringcopy = subinstr(stringcopy, ")", "_",.) split stringcopy, parse("_") forvalues i = 1(2)50{ cap drop stringcopy`i' } ren stringcopy# stringpart#, renumber

Note: does not work perfectly with your example because line 4 of your data has a closing bracket following another closing bracket.
Should that be taken into account?
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

11 Jul 2019, 12:50

This works with your example. moss is from SSC.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str25 string
"(ab)cd(e)"                
"abcd(efg)"                
"(abc)d(efg)i(jkl)"        
"(abc)defg)i(jkl)"         
"(abc)(defg)i(jkl)mno(pqr)"
end

moss string, match("(\([a-z]*\))") regex 

quietly foreach v of var _match* { 
    replace `v' = subinstr(`v', "(", "", .) 
    replace `v' = subinstr(`v', ")", "", .)
}

list string _match* 

     +-------------------------------------------------------------------+
     |                    string   _match1   _match2   _match3   _match4 |
     |-------------------------------------------------------------------|
  1. |                 (ab)cd(e)        ab         e                     |
  2. |                 abcd(efg)       efg                               |
  3. |         (abc)d(efg)i(jkl)       abc       efg       jkl           |
  4. |          (abc)defg)i(jkl)       abc       jkl                     |
  5. | (abc)(defg)i(jkl)mno(pqr)       abc      defg       jkl       pqr |
     +-------------------------------------------------------------------+

Comment

Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#4

11 Jul 2019, 13:25

Using moss: the opening and closing parenthesis can be outside the regex subexpression:

Code:

moss string, match("$([a-z]+)$") regex

Last edited by Bjarte Aagnes; 11 Jul 2019, 13:31.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

11 Jul 2019, 13:52

Bjarte Aagnes Good point!
Comment
Krishna Prasad Srinivasan

Join Date: Jul 2019

Posts: 13
#6

11 Jul 2019, 19:06

These suggestions work great!

Thanks so much, Bjarte Aagnes and Nick Cox -- really appreciate the help!
Comment
Krishna Prasad Srinivasan

Join Date: Jul 2019

Posts: 13
#7

12 Jul 2019, 09:47

Nick Cox and Bjarte Aagnes, just a quick follow up thought on the suggested solution. What if I were to use -moss- on a dataset that has non-english (or a mix of english and non-english) strings? In other words, how do I extract any information inside the brackets without specifying "[a-z]", "[0-9]", etc.? I tried using "*" but doesn't work.

Thanks!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#8

12 Jul 2019, 10:04

No free lunch here. if any character is allowed, you won't get specific strings extracted. You'll get the maximal substring between the outermost parentheses as running

Code:

moss string, match("$(.*)$") regex

will show. I would be silly to rule out an easy solution, but I can't think of one right now. What seems clear is that split can't help either, given junk in your string that you want to ignore.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

12 Jul 2019, 10:49

First, matching any byte but not the code for ")" and "(" using ^ negating the character class:

Code:

moss string, match("\(([^\)\(]+)\)") regex

Or, better to use the Unicode ICU regular expressions available also in -moss-, below match Unicode letters:

Code:

moss string, match("\(([\p{Letter}]+)\)") regex unicode

Code:

clear
input str50 string
"(ab)cd(e)"                
"abcd(efg)"                
"(abc)d(efg)i(jkl)"        
"(abc)defg)i(jkl)"        
"(abc)(defg)i(jkl)mno(pqr)"
`"(.;!)("#$)i(zzz)"'
"(日本語)(简体中文)(ไทย)"
end
compress

moss string, match("\(([^\)\(]+)\)") regex

list string *match*

keep string

moss string, match("\(([\p{Letter}]+)\)") regex unicode

list string *match*

Code:

. moss string, match("\(([^\)\(]+)\)") regex

.
. list string *match*

     +--------------------------------------------------------------------+
     |                    string   _match1    _match2   _match3   _match4 |
     |--------------------------------------------------------------------|
  1. |                 (ab)cd(e)        ab          e                     |
  2. |                 abcd(efg)       efg                                |
  3. |         (abc)d(efg)i(jkl)       abc        efg       jkl           |
  4. |          (abc)defg)i(jkl)       abc        jkl                     |
  5. | (abc)(defg)i(jkl)mno(pqr)       abc       defg       jkl       pqr |
     |--------------------------------------------------------------------|
  6. |          (.;!)("#$)i(zzz)       .;!        "#$       zzz           |
  7. |   (日本語)(简体中文)(ไทย)    日本語   简体中文       ไทย           |
     +--------------------------------------------------------------------+

.
. keep string

.
. moss string, match("\(([\p{Letter}]+)\)") regex unicode

.
. list string *match*

     +--------------------------------------------------------------------+
     |                    string   _match1    _match2   _match3   _match4 |
     |--------------------------------------------------------------------|
  1. |                 (ab)cd(e)        ab          e                     |
  2. |                 abcd(efg)       efg                                |
  3. |         (abc)d(efg)i(jkl)       abc        efg       jkl           |
  4. |          (abc)defg)i(jkl)       abc        jkl                     |
  5. | (abc)(defg)i(jkl)mno(pqr)       abc       defg       jkl       pqr |
     |--------------------------------------------------------------------|
  6. |          (.;!)("#$)i(zzz)       zzz                                |
  7. |   (日本語)(简体中文)(ไทย)    日本語   简体中文       ไทย           |
     +--------------------------------------------------------------------+

Last edited by Bjarte Aagnes; 12 Jul 2019, 11:45.

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

#10

13 Jul 2019, 09:42

Adding to #9 , regex references:

on Stata regex support:

https://www.statalist.org/forums/for...79#post1327779

ICU regex:

http://userguide.icu-project.org/strings/regexp

Unicode regex :

https://www.regular-expressions.info/unicode.html

Extended example, inluding regex allowing comments:

Code:

version 14

clear
input str50 string
"(ab)cd(e)"                
"abcd(efg)"                
"(abc)d(efg)i(jkl)"        
"(abc)defg)i(jkl)"        
"(abc)(defg)i(jkl)mno(pqr)"
`"(.;!)("#$)i(zzz)"'
"(日本語)(简体中文)(ไทย)"
end
compress

********************************************************************************
* define string scalar with regex allowing comments
********************************************************************************

* define local for newline character(s)

local nl = cond( c(os)=="Windows", char(13) + char(10) , char(10) ) 

#delim;  /* newlines `nl' must be inserted */

scalar sc_regxp_letters =

"(?x)     # SET flag UREGEX_COMMENTS Allow white space and comments      `nl'
          # PATTERN TO MATCH:                                            `nl'
\(        # match opening parenthesis                                    `nl'  
  (       #  START Capturing Group (sub-expression)                      `nl' 
  [\p{L}] #   match character class: Unicode property Letter             `nl'
  +?      #   quantifier (+) one or more, (?) non-greedy                 `nl'
  )       #  END Capturing Group (sub-expression)                        `nl' 
\)        # match closing parenthesis                                     
" 

;
#delim cr

********************************************************************************
* use regex to parse
********************************************************************************

tempvar ustring
gen `ustring' = string

qui forvalues i = 1/1000 {

    gen m`i' = ustrregexs(1) if ustrregexm(`ustring', sc_regxp_letters )
    
    replace `ustring' = usubinstr(`ustring',  "(" + m`i' + ")", "", 1)
    
    if ( ustrregexs(1) == "" ) {
    
        continue, break
    }
}

list


********************************************************************************
* use regex to parse using -moss-
********************************************************************************

* moss fail using pattern stored in scalar sc_regxp_letters

local regxp_letters = subinstr(sc_regxp_letters,"(?x)","",1) 
local regxp_letters = ustrregexra(`"`regxp_letters'"', ///
    /* (?m) set UREGEX_MULTILINE */ "(?m)#.+?$","")
local regxp_letters = ustrregexra(`"`regxp_letters'"',"[\s]+","")

di _n as txt `"`regxp_letters'"'

moss string, match(`"`regxp_letters'"') regex unicode

list string _match*

exit

Results:

Code:

. qui forvalues i = 1/1000 {

. list

     +-------------------------------------------------------------------------+
     |                    string      __000000       m1         m2    m3    m4 |
     |-------------------------------------------------------------------------|
  1. |                 (ab)cd(e)            cd       ab          e             |
  2. |                 abcd(efg)          abcd      efg                        |
  3. |         (abc)d(efg)i(jkl)            di      abc        efg   jkl       |
  4. |          (abc)defg)i(jkl)        defg)i      abc        jkl             |
  5. | (abc)(defg)i(jkl)mno(pqr)          imno      abc       defg   jkl   pqr |
     |-------------------------------------------------------------------------|
  6. |          (.;!)("#$)i(zzz)   (.;!)("#$)i      zzz                        |
  7. |   (日本語)(简体中文)(ไทย)                 日本語   简体中文   ไทย       |
     +-------------------------------------------------------------------------+

Code:

. di _n as txt `"`regxp_letters'"'

\(([\p{L}]+?)\)

. moss string, match(`"`regxp_letters'"') regex unicode

. list string _match*

     +--------------------------------------------------------------------+
     |                    string   _match1    _match2   _match3   _match4 |
     |--------------------------------------------------------------------|
  1. |                 (ab)cd(e)        ab          e                     |
  2. |                 abcd(efg)       efg                                |
  3. |         (abc)d(efg)i(jkl)       abc        efg       jkl           |
  4. |          (abc)defg)i(jkl)       abc        jkl                     |
  5. | (abc)(defg)i(jkl)mno(pqr)       abc       defg       jkl       pqr |
     |--------------------------------------------------------------------|
  6. |          (.;!)("#$)i(zzz)       zzz                                |
  7. |   (日本語)(简体中文)(ไทย)    日本語   简体中文       ไทย           |
     +--------------------------------------------------------------------+

.

Comment

Krishna Prasad Srinivasan

Join Date: Jul 2019

Posts: 13
#11

14 Jul 2019, 12:13

Thanks so much, Bjarte Aagnes -- this is all super helpful! In my example, line 4 has a close bracket without its corresponding open bracket. These are cases where the brackets have been improperly specified in the data. Assuming I know exactly what to extract in such cases, how could one adapt your solution? For example, I want to extract "efg" (just like in row 3) from row 4 as well. The same is true for lines with an open bracket, but not close bracket -- check out the new row I have added where I would like to pick up on "uvw".

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str50 string "(ab)cd(e)" "abcd(efg)" "(abc)d(efg)i(jkl)" "(abc)defg)i(jkl)" "(abc)(defg)i(jkl)mno(pqr)" `"(.;!)("#$)i(zzz)"' "(日本語)(简体中文)(ไทย)" "mno(pq)rs(uvwx(yz)" end
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 785

#12

15 Jul 2019, 08:58

Below is two suggestions using moss I think fit your current description and data example.

A) Will not keep the ordering of the strings matched (but this may be recovered from the _pos vars left by -moss-)
B) using a new regex (efg|uvw|$[\p{L}]+?$) balanced parenthesis is part of the match and must be stripped off.

(I shortly start vacation, including digital detox, so do not expect any fast follow up on this from me the next weeks.)

Code:

version 14

clear
input str50 string
"(ab)cd(e)"                
"abcd(efg)"                
"(abc)d(efg)i(jkl)"        
"(abc)defg)i(jkl)"        
"(abc)(defg)i(jkl)mno(pqr)"
`"(.;!)("#$)i(zzz)"'
"(日本語)(简体中文)(ไทย)"
"mno(pq)rs(uvw(yz)"
end
compress

rename string stringvar

********************************************************************************
* store regex to match in scalars
********************************************************************************

scalar sc_rxp_letters_balanced    = `" "\(([\p{L}]+?)\)" "'

scalar sc_rxp_string_opening      = `" "[\(](efg|uvw)[^\)]" "'    

scalar sc_rxp_string_closing      = `" "[^\(](efg|uvw)[\)]"  "'  


* ad hoc: to avoid regex complications when matching strings at ends,
* add some chars at ends

replace stringvar = "_" + stringvar + "_"


********************************************************************************
* (A) moss repeat on rest of string left after first regex
********************************************************************************

moss stringvar , match(`=sc_rxp_letters_balanced') regex unicode pre(_0_)

* create rest of string left after first regex

gen rest = stringvar , after(stringvar)
local rest "rest"

qui foreach v of varlist _0_m* {

    replace `rest' = subinstr(`rest', "(" + `v' + ")" ,"",1)
}

moss `rest' , match(`=sc_rxp_string_opening')   regex unicode pre(_1_)
moss `rest' , match(`=sc_rxp_string_closing')   regex unicode pre(_2_)

rename ?#?match# ?#?m#
list stringvar `rest' *m*

********************************************************************************
* (B) moss new regex "(efg|uvw|\([\p{L}]+?\))"
********************************************************************************

keep stringvar

moss stringvar , match("(efg|uvw|\([\p{L}]+?\))") regex unicode pre(_n_)

* list stringvar *m*
* strip of parentheses from matches like "(pqr)"    

qui foreach v of varlist *match* {

    replace `v' = subinstr(`v', "(", "", .)
    replace `v' = subinstr(`v', ")", "", .)
}

rename ???match# ???m#
list stringvar *m*

********************************************************************************
exit

Results (A):

Code:

. list stringvar `rest' *m*

     +-------------------------------------------------------------------------------------------------+
     |                   stringvar            rest    _0_m1      _0_m2   _0_m3   _0_m4   _1_m1   _2_m1 |
     |-------------------------------------------------------------------------------------------------|
  1. |                 _(ab)cd(e)_            _cd_       ab          e                                 |
  2. |                 _abcd(efg)_          _abcd_      efg                                            |
  3. |         _(abc)d(efg)i(jkl)_            _di_      abc        efg     jkl                         |
  4. |          _(abc)defg)i(jkl)_        _defg)i_      abc        jkl                             efg |
  5. | _(abc)(defg)i(jkl)mno(pqr)_          _imno_      abc       defg     jkl     pqr                 |
     |-------------------------------------------------------------------------------------------------|
  6. |          _(.;!)("#$)i(zzz)_   _(.;!)("#$)i_      zzz                                            |
  7. |     _(日本語)(简体中文)(ไทย)_              __    日本語     简体中文     ไทย                         |
  8. |         _mno(pq)rs(uvw(yz)_     _mnors(uvw_       pq         yz                     uvw         |
     +-------------------------------------------------------------------------------------------------+

Results (B):

Code:

. list stringvar *m*

     +-----------------------------------------------------------------+
     |                   stringvar    _n_m1      _n_m2   _n_m3   _n_m4 |
     |-----------------------------------------------------------------|
  1. |                 _(ab)cd(e)_       ab          e                 |
  2. |                 _abcd(efg)_      efg                            |
  3. |         _(abc)d(efg)i(jkl)_      abc        efg     jkl         |
  4. |          _(abc)defg)i(jkl)_      abc        efg     jkl         |
  5. | _(abc)(defg)i(jkl)mno(pqr)_      abc       defg     jkl     pqr |
     |-----------------------------------------------------------------|
  6. |          _(.;!)("#$)i(zzz)_      zzz                            |
  7. |     _(日本語)(简体中文)(ไทย)_     日本語    简体中文     ไทย         |
  8. |         _mno(pq)rs(uvw(yz)_       pq        uvw      yz         |
     +-----------------------------------------------------------------+

Last edited by sladmin; 16 Jul 2019, 09:17. Reason: BBCode fix

Announcement