Using regular expressions to remove specific portions of a string variable

Nate Tamment

Join Date: Jun 2020

Posts: 19
#1

Using regular expressions to remove specific portions of a string variable

23 May 2022, 07:32

Dear Statalisters,

I am trying to use regular expressions to remove certain words/phrases (always in brackets) from long strings. Here is some example data.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str20 speaker str152 statement "Speaker 1" "[Speaker 1 stands] The Secretary of State knows that the cost of food has gotten much higher under the current government. [Clapping] What will you do? " "Speaker 2" "As our Secretary has said, the next meeting will focus specifically on the issue of food security. [Speaker of the House ends the meeting]" "Speaker of the House" "We will reconvene tomorrow morning. [Speaker of the House stands]" end

I've used code that I found on this site that I can use to remove all text contained in brackets:

Code:

gen clean = ustrregexra(statement,"\[.+?\]","")

However, I'd only like to remove some of the text in brackets. In the example data, I would like to remove "[xxx stands]" and "[Clapping]", but keep "[Speaker of the House ends the meeting]".

I haven't figured out the proper syntax in regular expressions. The following code seeks to remove all instances of "[Clapping]", but it ends up cutting out all the preceding text:

Code:

gen clean = ustrregexra(statement,"^\[.+?\Clapping]","")

Can anyone help with the proper way to set this up using regular expressions so I can specify words or word fragments within brackets?

Thanks!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10213

23 May 2022, 11:01

You can search and delete the defined keywords, positively looking behind for the opening bracket ("[") and positively looking ahead for the closing bracket ("]").

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str20 speaker str152 statement
"Speaker 1"            "[Speaker 1 stands] The Secretary of State knows that the cost of food has gotten much higher under the current government. [Clapping] What will you do? "
"Speaker 2"            "As our Secretary has said, the next meeting will focus specifically on the issue of food security. [Speaker of the House ends the meeting]"              
"Speaker of the House" "We will reconvene tomorrow morning. [Speaker of the House stands]"                                                                                      
end

gen wanted=ustrregexra(statement, "(?<=\[?.)(Clapping|stands)(?=\]?.)","")

Res.:

Code:

. l wanted, notrim

                                                                                                                                           wanted  
  1.   [Speaker 1 ] The Secretary of State knows that the cost of food has gotten much higher under the current government. [] What will you do?   
  2.   As our Secretary has said, the next meeting will focus specifically on the issue of food security. [Speaker of the House ends the meeting]  
  3.                                                                                  We will reconvene tomorrow morning. [Speaker of the House ]

Last edited by Andrew Musau; 23 May 2022, 11:03.

Comment

Nate Tamment

Join Date: Jun 2020

Posts: 19
#3

23 May 2022, 12:13

Hi Andrew,

Thanks for your reply!

To elaborate a bit, basically I would like to delete all text within the brackets should the bracket include a fragment of the target word. The reason is that sometimes the text within the brackets varies quite a bit. For example, [Speaker 1 stands], [Speaker of the House stands], [Secretary stands], etc. I would like to cover all of these different cases. Using psuedo code based on what you wrote, it might be something like

Code:

gen wanted2=ustrregexra(statement, "(?<=\[?.)(*ands)(?=\]?.)","")

I suppose the other way to do this would be some sort of if statement using strpos, but that I am also not sure how to construct.

Thanks,
Nate
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10213

23 May 2022, 13:47

Maybe try to split and isolate the sequences.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str20 speaker str152 statement
"Speaker 1"            "[Speaker 1 stands] The Secretary of State knows that the cost of food has gotten much higher under the current government. [Clapping] What will you do? "
"Speaker 2"            "As our Secretary has said, the next meeting will focus specifically on the issue of food security. [Speaker of the House ends the meeting]"              
"Speaker of the House" "We will reconvene tomorrow morning. [Speaker of the House stands]"                                                                                      
end

g tosplit= ustrregexra(ustrregexra( statement, "(\s\[)", " Ø$1"), "(\]\s)", "$1Ø")
split tosplit, p(Ø) g(wanted)
foreach var of varlist wanted*{
    replace `var'= "" if ustrregexm(`var',"\[" ) & ustrregexm(`var', "(Clapping|stands)") & ustrregexm(`var',"\]" )
}
egen wanted= concat(wanted*)
keep speaker statement wanted

Res.:

Code:

. l wanted, notrim

                                                                                                                                            wanted  
  1.                     The Secretary of State knows that the cost of food has gotten much higher under the current government. What will you do?  
  2.   As our Secretary has said, the next meeting will focus specifically on the issue of food security.  [Speaker of the House ends the meeting]  
  3.                                                                                                           We will reconvene tomorrow morning.

Last edited by Andrew Musau; 23 May 2022, 13:50.

Comment

Nate Tamment

Join Date: Jun 2020

Posts: 19
#5

24 May 2022, 07:59

Thanks Andrew, this is very helpful!
Comment

Announcement

Using regular expressions to remove specific portions of a string variable

Comment

Comment

Comment

Comment