Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using regular expressions to remove specific portions of a string variable

    Dear Statalisters,


    I am trying to use regular expressions to remove certain words/phrases (always in brackets) from long strings. Here is some example data.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str20 speaker str152 statement
    "Speaker 1"            "[Speaker 1 stands] The Secretary of State knows that the cost of food has gotten much higher under the current government. [Clapping] What will you do? "
    "Speaker 2"            "As our Secretary has said, the next meeting will focus specifically on the issue of food security. [Speaker of the House ends the meeting]"              
    "Speaker of the House" "We will reconvene tomorrow morning. [Speaker of the House stands]"                                                                                       
    end

    I've used code that I found on this site that I can use to remove all text contained in brackets:

    Code:
     gen clean = ustrregexra(statement,"\[.+?\]","")
    However, I'd only like to remove some of the text in brackets. In the example data, I would like to remove "[xxx stands]" and "[Clapping]", but keep "[Speaker of the House ends the meeting]".

    I haven't figured out the proper syntax in regular expressions. The following code seeks to remove all instances of "[Clapping]", but it ends up cutting out all the preceding text:

    Code:
    gen clean = ustrregexra(statement,"^\[.+?\Clapping]","")

    Can anyone help with the proper way to set this up using regular expressions so I can specify words or word fragments within brackets?

    Thanks!


  • #2
    You can search and delete the defined keywords, positively looking behind for the opening bracket ("[") and positively looking ahead for the closing bracket ("]").

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str20 speaker str152 statement
    "Speaker 1"            "[Speaker 1 stands] The Secretary of State knows that the cost of food has gotten much higher under the current government. [Clapping] What will you do? "
    "Speaker 2"            "As our Secretary has said, the next meeting will focus specifically on the issue of food security. [Speaker of the House ends the meeting]"              
    "Speaker of the House" "We will reconvene tomorrow morning. [Speaker of the House stands]"                                                                                      
    end
    
    gen wanted=ustrregexra(statement, "(?<=\[?.)(Clapping|stands)(?=\]?.)","")
    Res.:

    Code:
    . l wanted, notrim
    
                                                                                                                                               wanted  
      1.   [Speaker 1 ] The Secretary of State knows that the cost of food has gotten much higher under the current government. [] What will you do?   
      2.   As our Secretary has said, the next meeting will focus specifically on the issue of food security. [Speaker of the House ends the meeting]  
      3.                                                                                  We will reconvene tomorrow morning. [Speaker of the House ]
    Last edited by Andrew Musau; 23 May 2022, 11:03.

    Comment


    • #3
      Hi Andrew,

      Thanks for your reply!

      To elaborate a bit, basically I would like to delete all text within the brackets should the bracket include a fragment of the target word. The reason is that sometimes the text within the brackets varies quite a bit. For example, [Speaker 1 stands], [Speaker of the House stands], [Secretary stands], etc. I would like to cover all of these different cases. Using psuedo code based on what you wrote, it might be something like
      Code:
      gen wanted2=ustrregexra(statement, "(?<=\[?.)(*ands)(?=\]?.)","")
      I suppose the other way to do this would be some sort of if statement using strpos, but that I am also not sure how to construct.

      Thanks,
      Nate



      Comment


      • #4
        Maybe try to split and isolate the sequences.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str20 speaker str152 statement
        "Speaker 1"            "[Speaker 1 stands] The Secretary of State knows that the cost of food has gotten much higher under the current government. [Clapping] What will you do? "
        "Speaker 2"            "As our Secretary has said, the next meeting will focus specifically on the issue of food security. [Speaker of the House ends the meeting]"              
        "Speaker of the House" "We will reconvene tomorrow morning. [Speaker of the House stands]"                                                                                      
        end
        
        g tosplit= ustrregexra(ustrregexra( statement, "(\s\[)", " Ø$1"), "(\]\s)", "$1Ø")
        split tosplit, p(Ø) g(wanted)
        foreach var of varlist wanted*{
            replace `var'= "" if ustrregexm(`var',"\[" ) & ustrregexm(`var', "(Clapping|stands)") & ustrregexm(`var',"\]" )
        }
        egen wanted= concat(wanted*)
        keep speaker statement wanted
        Res.:

        Code:
        . l wanted, notrim
        
                                                                                                                                                    wanted  
          1.                     The Secretary of State knows that the cost of food has gotten much higher under the current government. What will you do?  
          2.   As our Secretary has said, the next meeting will focus specifically on the issue of food security.  [Speaker of the House ends the meeting]  
          3.                                                                                                           We will reconvene tomorrow morning.
        Last edited by Andrew Musau; 23 May 2022, 13:50.

        Comment


        • #5
          Thanks Andrew, this is very helpful!

          Comment

          Working...
          X