how to use regular expressions to tease out host's words?

Fred Lee

Join Date: Nov 2017

Posts: 488
#1

how to use regular expressions to tease out host's words?

07 Nov 2022, 18:44

There are two circumstances. Host is the last speaker or Presenter speaks after Host. How to use regular expressions to tease out host's words? Here is the example:
I want to tease out "Host 00:00 please begin" and "Host 01:00 that’s Ok. Your part is ending." Thanks a ton!

Code:

replace prstText = ustrregexra(prstText, "Host\s\d{2}:\d{2}.*Presenter", "")

deletes "Presenter", which I don't want

Code:

* Example generated by -dataex-. For more info, type help dataex clear input strL prstText "Host 00:00 please begin Presenter1 02:03 Ok I will" "Host 01:00 that’s Ok. Your part is ending." end

Last edited by Fred Lee; 07 Nov 2022, 18:50.
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

07 Nov 2022, 19:18

Think of the problem as "remove the Presenter's text".

Code:

. generate hostText = ustrregexra(prstText, " Presenter.*", "")

. 
. list hostText, clean

                                         hostText  
  1.                      Host 00:00 please begin  
  2.   Host 01:00 that's Ok. Your part is ending.

Comment

Fred Lee

Join Date: Nov 2017
Posts: 488

07 Nov 2022, 19:41

Sorry, the example is too simple, I update it:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input strL prstText
"Presenter 2 hello world Host 00:00 please begin Presenter 1 02:03 Ok I will"
"Presenter 3 how about you? Host 01:00 that’s Ok. Your part is ending."      
end

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

08 Nov 2022, 04:46

I now wonder if "tease out" does not mean to you what it means to me. Perhaps a more technical term, like "extract" or "remove", would have been clearer, but at this point please tell us the results you expect.
Comment

Fred Lee

Join Date: Nov 2017
Posts: 488

08 Nov 2022, 06:55

Thanks, William Lisowski I want to remove what host says.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input strL prstText
"Presenter 2 hello world Host 00:00 please begin Presenter 1 02:03 Ok I will"
"Presenter 3 how about you? Host 01:00 that’s Ok. Your part is ending."      
end

The results what I want are:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input strL prstText
"Presenter 2 hello world Presenter 1 02:03 Ok I will"
"Presenter 3 how about you?"      
end

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

08 Nov 2022, 07:11

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input strL prstText
"Host 00:00 please begin Presenter1 02:03 Ok I will"
"Host 01:00 that's Ok. Your part is ending."      
"Presenter 2 hello world Host 00:00 please begin Presenter 1 02:03 Ok I will"
"Presenter 3 how about you? Host 01:00 that's Ok. Your part is ending."      
"Host 05:05 welcome Presenter 2 hello world Host 00:00 please begin Presenter 1 02:03 Ok I will"
end

replace prstText = ustrregexra(prstText, "Host\s\d{2}:\d{2}.*?(Presenter|$)", "$1")

list, clean noobs

Code:

. list, clean noobs

                                               prstText  
                             Presenter1 02:03 Ok I will  
                                                        
    Presenter 2 hello world Presenter 1 02:03 Ok I will  
                            Presenter 3 how about you?  
    Presenter 2 hello world Presenter 1 02:03 Ok I will  

.

Comment

Fred Lee

Join Date: Nov 2017

Posts: 488
#7

08 Nov 2022, 07:39

Thanks William Lisowski
Can you explain the what does "(Presenter|$)", "$1"" mean?
Or where can I learn this meaning?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

08 Nov 2022, 10:48

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

Breaking down the regular expression in post #6
Host\s\d{2}:\d{2} matches "Host" followed by a space, 2 digits, a colon, and two more digits

.*? matches the the shortest sequence of characters before the next item is matched (without the ? the .* would match the longest sequence)

Presenter|$ matches either the "Presenter" immediately after the Host, or the end of the string if the Host is the last item in the string

(Presenter|$) remembers what was matched, and - since it is the first set of enclosing parentheses - the matched content can then be referred to by $1 in the replacement string - so that the Host material is deleted but the Presenter is replaced with itself and thus retained.
Comment

Announcement

how to use regular expressions to tease out host's words?

Comment

Comment

Comment

Comment

Comment

Comment

Comment