Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract substring between nth and (n+1)th commas in a variable

    How can I extract a substring between the nth and (n+1)th commas in a variable?

    For example, consider ID = 3 and beta = "eight,nine,ten,eleven,twelve". How could I extract the substring between the 3rd and 4th commas? (Answer: "eleven")

    Code:
    clear
    input ID strL beta
    1 "one,two,three,four"
    2 "five,six,seven"
    3 "eight,nine,ten,eleven,twelve"
    end
    Please note this is a vastly simplified example of an 80,000+ observation dataset where I have as many as 1,000 commas in an observation of the variable beta. I am using Stata 16.1 on Windows 10.

    Many thanks!

  • #2
    Added in edit: I misread, the code below takes the token before comma number ID rather than after comma number ID. Remove the "-1" in each of the two places it occurs.

    Code:
    . generate wanted = ustrregexrf(beta, "([^,]*,){"+string(ID-1)+"}([^,]*).*", "$2") ///
    >                 if ustrregexm(beta, "([^,]*,){"+string(ID-1)+"}([^,]*).*")
    (1 missing value generated)
    
    . list
    
         +---------------------------------------------+
         |  ID                           beta   wanted |
         |---------------------------------------------|
      1. |   1             one,two,three,four      one |
      2. |   2                 five,six,seven      six |
      3. |   3   eight,nine,ten,eleven,twelve      ten |
      4. | 666                            one          |
         +---------------------------------------------+
    The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

    The functions for replacement support "capture group" references in the substitution string. Capture groups are surrounded with parentheses in the regular expression being matched and capture groups are referenced as $1, $2, ... .
    Last edited by William Lisowski; 30 Mar 2021, 20:06.

    Comment


    • #3
      William, thank you very much! I would have had a very tough time coming up with this solution myself. I appreciate your time and attention! Thanks again.

      Comment

      Working...
      X