Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • parsing with variant length strings; overcoming ustrregexra greediness

    I'm currently trying to parse text from string variables like this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1964 FieldLabel
    `"<div style="background:tomato; padding:5px; margin:5px;"><font color="black"> <font size="5">Verify alpha codes are equal, values are not the same </font></strong></div>"'
    `"<div style="background:lightgrey; padding:5px; margin:5px;"><font color="black"><font size="5">Please answer the following questions. </font>"'                                   
    end
    On a former post of mine for a similar issue, William Lisowski recommended using -ustrregexra- before -split- to parse using length-variant substrings that were book-ended by like-symbols. The only problem using that strategy here is that -ustrregextra- is "greedy", meaning that if I want to use "<" and ">" to create new symbols to parse with, the entire string will be replaced. To illustrate,

    using:
    Code:
    g newtext = ustrregexra(FieldLabel,"<.*>","!!split!!")
    I want:
    Code:
    !!split!!!!split!!!!split!!Please verify PID codes are equal, values are not the same !!split!!!!split!!!!split!!
    but I get:
    Code:
    !!split!!
    Can anyone recommend a solution to the greedy problem, or perhaps another strategy entirely?

    Thank you!

    -Reese

    v.14.2

  • #2
    Instead of matching any character following the "<" you need to match any character that isn't ">".
    Code:
    g newtext = ustrregexra(FieldLabel,"<[^>]*>","!!split!!")

    Comment


    • #3
      Beautiful, thank you William Lisowski !

      Comment

      Working...
      X