Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting commas between specific characters

    Goog Afternoon Statalist users,

    I have a database that looks like the following:

    Code:
    Example generated by -dataex-. To install: ssc install dataex
    clear
    input str100 Politician
    "Arthur Alexander (Wilson Center), Hugh Patrick (Wilson Center)"
    "Milt Drucker (Director Summit of the Americas, U.S. Department of State)"
    "John Sequeira (State Department, Office of Southern Africa Affairs), Ltc Greg Saunders (DOD, Office "
    "Mr. Nakamura (president, Kawasaki Steel America)"
    "Steven A. Thompson (Professional Staff Member), Roger Smith (House National Security Committee (HNSC"
    "S. Chami of Kenya (Ambassador ), Mr. W. Imakando (Embassy of Zambia), Mr. S. Dlamini (Embassy of Swa"
    "Randi Sutton,Texas Governor George Bush, Cheryl Parker Rose, Florida
    Governor Lawton Chiles (D)."
    "Matt Mcmanus (Acting Division Chief, Energy-Producer Country Affairs, U.S. Department of State)"
    end
    I need to separate the names on each string. My first thought was to separate them by commas, but as you can see there are some commas between parenthesis that would separate things different of names so I need to replace the commas that are between parenthesis to other character and then split the string.

    Does anyone nows how to make this replace with the commas only if they are found inside parenthesis?

    I've tried this but didn't work:

    gen var2 = trim(regexr(Politician," \( (,)+\) *",""))



    Thank you very much.


    Last edited by Nicolas Echeverry; 24 Feb 2019, 10:14.

  • #2
    Here's a solution in the direction of what I think you want for an end result, but which does not use regular expressions. (Nothing wrong with regex technique, but often it's not needed.) Rather than use your strategy, my approach is to repeatedly find and remove parenthesized material in the given string until there is no such material left in any observation. This code will:
    1) Create a string containing the given string, but without the material between parens. That material is assumed to be a comma separated list of names, and so is -split- into a list of names.
    2) Save each chunk of parenthesized material in separate variables for whatever further use.
    I don't know if the following is entirely robust with respect to unmatched parentheses.

    Code:
    local done = 0
    gen int pos1 = .
    gen int pos2 = .
    local i 1
    gen s = Politician
    // Succesively find material between parens, save it, and remove it from consideration
    while (`done' == 0) {
        replace pos1 = strpos(s, "(")
        replace pos2 = strpos(s, ")" )
        gen paren`i' = substr(s, pos1, pos2-pos1 + 1)
       replace s = subinstr(s, paren`i', "", 1) if pos1 > 0
       cap assert (pos2 == 0 )  | (pos1 == 0)
       local done = (_rc == 0)
       local ++i
    }
    // Save and remove any trailing material, as in observation 3
    gen paren`i' = substr(s, pos1, .) if (pos1 >  0)
    // What if an unpaired ")" ?
    gen byte unpaired = (pos2 > 0)
    replace s = subinstr(s, paren`i', "", .)
    split s,  parse(",") gen(name)
    list name*
    list paren*

    Comment


    • #3
      Dear Mark, thank you for your reply. This code is working perfectly. Thank you very much.

      Comment


      • #4
        Since Stata 14, Stata has a set of ICU based regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(). ICU regular expression supports negative look-ahead assertion (?! ... ), which evaluates to true if the parenthesized pattern does not match at the current input position without advancing the input position. You may use:

        Code:
        gen var2 = ustrregexra(Politician, ",(?![^\(\)]*\))", "|")
        to replace all commas outside parentheses to "|" (suppose "|" does not appear in your data).

        Then you may split var2 according to "|". Something like

        Code:
        gen name1 = substr(var2, 1, strpos(var2, "|")-1)
        gen name2 = substr(var2, strpos(var2, "|")+1, .)

        Comment


        • #5
          In addition, for your example data the following should work:
          Code:
          gen var3a = subinstr(Politician, "),",")|",.)
          split var3a , parse("|")
          or using a regex:
          Code:
          gen var3b = ustrregexra(Politician, "\)\s?," ,")|")
          split var3b , parse("|")

          Comment

          Working...
          X