Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to use regular expressions to tease out host's words?

    There are two circumstances. Host is the last speaker or Presenter speaks after Host. How to use regular expressions to tease out host's words? Here is the example:
    I want to tease out "Host 00:00 please begin" and "Host 01:00 that’s Ok. Your part is ending." Thanks a ton!
    Code:
    replace prstText = ustrregexra(prstText, "Host\s\d{2}:\d{2}.*Presenter", "")
    deletes "Presenter", which I don't want

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input strL prstText
    "Host 00:00 please begin Presenter1 02:03 Ok I will"
    "Host 01:00 that’s Ok. Your part is ending."      
    end
    Last edited by Fred Lee; 07 Nov 2022, 18:50.

  • #2
    Think of the problem as "remove the Presenter's text".
    Code:
    . generate hostText = ustrregexra(prstText, " Presenter.*", "")
    
    . 
    . list hostText, clean
    
                                             hostText  
      1.                      Host 00:00 please begin  
      2.   Host 01:00 that's Ok. Your part is ending.

    Comment


    • #3
      Sorry, the example is too simple, I update it:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input strL prstText
      "Presenter 2 hello world Host 00:00 please begin Presenter 1 02:03 Ok I will"
      "Presenter 3 how about you? Host 01:00 that’s Ok. Your part is ending."      
      end

      Comment


      • #4
        I now wonder if "tease out" does not mean to you what it means to me. Perhaps a more technical term, like "extract" or "remove", would have been clearer, but at this point please tell us the results you expect.

        Comment


        • #5
          Thanks, William Lisowski I want to remove what host says.
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input strL prstText
          "Presenter 2 hello world Host 00:00 please begin Presenter 1 02:03 Ok I will"
          "Presenter 3 how about you? Host 01:00 that’s Ok. Your part is ending."      
          end
          The results what I want are:
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input strL prstText
          "Presenter 2 hello world Presenter 1 02:03 Ok I will"
          "Presenter 3 how about you?"      
          end

          Comment


          • #6
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input strL prstText
            "Host 00:00 please begin Presenter1 02:03 Ok I will"
            "Host 01:00 that's Ok. Your part is ending."      
            "Presenter 2 hello world Host 00:00 please begin Presenter 1 02:03 Ok I will"
            "Presenter 3 how about you? Host 01:00 that's Ok. Your part is ending."      
            "Host 05:05 welcome Presenter 2 hello world Host 00:00 please begin Presenter 1 02:03 Ok I will"
            end
            
            replace prstText = ustrregexra(prstText, "Host\s\d{2}:\d{2}.*?(Presenter|$)", "$1")
            
            list, clean noobs
            Code:
            . list, clean noobs
            
                                                           prstText  
                                         Presenter1 02:03 Ok I will  
                                                                    
                Presenter 2 hello world Presenter 1 02:03 Ok I will  
                                        Presenter 3 how about you?  
                Presenter 2 hello world Presenter 1 02:03 Ok I will  
            
            .

            Comment


            • #7
              Thanks William Lisowski
              Can you explain the what does "(Presenter|$)", "$1"" mean?
              Or where can I learn this meaning?

              Comment


              • #8
                The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

                A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

                Breaking down the regular expression in post #6
                • Host\s\d{2}:\d{2} matches "Host" followed by a space, 2 digits, a colon, and two more digits
                • .*? matches the the shortest sequence of characters before the next item is matched (without the ? the .* would match the longest sequence)
                • Presenter|$ matches either the "Presenter" immediately after the Host, or the end of the string if the Host is the last item in the string
                • (Presenter|$) remembers what was matched, and - since it is the first set of enclosing parentheses - the matched content can then be referred to by $1 in the replacement string - so that the Host material is deleted but the Presenter is replaced with itself and thus retained.

                Comment

                Working...
                X