Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regexm and regexs

    Hi everyone,

    I am trying to extract a portion of text variable. My aim is to take the characters after "aa:" until "f". (or until a pre-specified word) Below is my simplified code:

    gen text = ""
    replace text = "aa: inffant bb: insp cc: 35 yrd old. dd: ee:acad" in 1
    replace text = "aa: infant ff:no fnote bb: insp cc: 35 yrd old. dd: ee:acad" in 2
    gen trial = ""
    replace trial = regexs(1) if regexm(text, "aa[. #:-]*([a-z0-9.,&/: ]*)(f)")

    My outputs for this code is as follows:

    Row 1: inf
    Row 2: infant ff:no

    Why does it happen? I mean why I do not get "in" for both of them, and also why I get "inf" for the first one although I get "infant ff:no" for the second one?

    I appreciate your helps in advance.

    Best,
    Ulas

  • #2
    Welcome to the Statalist forum. In the forum FAQ, there is good advice on how to post data, preferably using dataex and "CODE delimiters".

    In general, don't use regular expressions if not needed. With your data example and aim it seems like you do not need to use regular expressions.The subinstr(), substr(), strpos() functions will do:
    Code:
    gen t1 = trim(subinstr(substr(text, 1, -1 + strpos(text, "f") ), "aa:", "", 1 ))
    I mean why I do not get "in" for both of them
    Because "f" is included in the word class [a-z] and Stata's regexm(), regexr() and regexs() functions are greedy. i.e. match the longest possible string, which also explain your second question:

    why I get "inf" for the first one although I get "infant ff:no" for the second one?
    From Stata 14 the new regular expression functions ustrregexm(), ustrregexrf(), ustrregexra(), and ustrregexs() (using the ICU regular expression engine) support non-greedy regular expressions operators:

    *? Match 0 or more times. Match as few times as possible.
    +? Match 1 or more times. Match as few times as possible.
    So, your regular expression can be changed by using the non-greedy "*?":
    Code:
    gen t2 = ustrregexs(1) if ustrregexm(text, "aa[. #:-]*([a-z0-9.,&/: ]*?)(f)")
    Also see this post on Stata's regular espressions
    Last edited by Bjarte Aagnes; 10 Sep 2017, 08:23.

    Comment

    Working...
    X