parsing with variant length strings; overcoming ustrregexra greediness

Reese Crispen

Join Date: Jul 2018

Posts: 55
#1

parsing with variant length strings; overcoming ustrregexra greediness

01 Mar 2019, 10:09

I'm currently trying to parse text from string variables like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1964 FieldLabel `"<div style="background:tomato; padding:5px; margin:5px;"><font color="black"> <font size="5">Verify alpha codes are equal, values are not the same </font></strong></div>"' `"<div style="background:lightgrey; padding:5px; margin:5px;"><font color="black"><font size="5">Please answer the following questions. </font>"' end

On a former post of mine for a similar issue, William Lisowski recommended using -ustrregexra- before -split- to parse using length-variant substrings that were book-ended by like-symbols. The only problem using that strategy here is that -ustrregextra- is "greedy", meaning that if I want to use "<" and ">" to create new symbols to parse with, the entire string will be replaced. To illustrate,

using:

Code:

g newtext = ustrregexra(FieldLabel,"<.*>","!!split!!")

I want:

Code:

!!split!!!!split!!!!split!!Please verify PID codes are equal, values are not the same !!split!!!!split!!!!split!!

but I get:

Code:

!!split!!

Can anyone recommend a solution to the greedy problem, or perhaps another strategy entirely?

Thank you!

-Reese

v.14.2
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

01 Mar 2019, 11:50

Instead of matching any character following the "<" you need to match any character that isn't ">".

Code:

g newtext = ustrregexra(FieldLabel,"<[^>]*>","!!split!!")
1 like
Comment
Reese Crispen

Join Date: Jul 2018

Posts: 55
#3

01 Mar 2019, 12:15

Beautiful, thank you William Lisowski !
Comment

Announcement

parsing with variant length strings; overcoming ustrregexra greediness

Comment

Comment