Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using regexm for a list of expressions

    Hey,
    I am relatively new with Stata and I have some problems using the regexm command.

    In the following, there is an excerpt of my string variable, where I want to search for the following terms and generate a new variable which equals to 1 if the following terms are included?

    "CEO" or "Chief Executive Officer" but not "CEO of" or "Chief Executive Officer of".

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str250 title
    "Chairman & CEO"                                      
    "Vice Chairman"                                        
    "Vice President of Commercial Development"            
    "Chief Financial Officer, Vice President and Treasurer"
    "Group Vice President of Structures & Systems Segment"
    "Chairman & CEO"                                      
    "Vice Chairman"                                        
    "Vice President of Commercial Development"            
    "Chief Financial Officer, Vice President and Treasurer"
    "Group Vice President of Structures & Systems Segment"
    "Chairman & CEO"                                      
    "Vice Chairman"                                        
    "Chief Financial Officer, Vice President and Treasurer"
    "Group Vice President of Structures & Systems Segment"
    "VP, General Counsel & Secretary"                      
    "CEO of Power Systems"                                      
    "Vice Chairman"                                        
    "Chief Financial Officer, Vice President and Treasurer"
    "Group Vice President of Structures & Systems Segment"
    "VP, General Counsel & Secretary"                      
    end
    Thanks in advance!
    Last edited by Steffen Weisthoff; 19 Oct 2022, 10:57.

  • #2
    This is not case-insensitive.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str250 title
    "Chairman & CEO"                                       
    "Vice Chairman"                                        
    "Vice President of Commercial Development"             
    "Chief Financial Officer, Vice President and Treasurer"
    "Group Vice President of Structures & Systems Segment" 
    "Chairman & CEO"                                       
    "Vice Chairman"                                        
    "Vice President of Commercial Development"             
    "Chief Financial Officer, Vice President and Treasurer"
    "Group Vice President of Structures & Systems Segment" 
    "Chairman & CEO"                                       
    "Vice Chairman"                                        
    "Chief Financial Officer, Vice President and Treasurer"
    "Group Vice President of Structures & Systems Segment" 
    "VP, General Counsel & Secretary"                      
    "CEO of Power Systems"                                 
    "Vice Chairman"                                        
    "Chief Financial Officer, Vice President and Treasurer"
    "Group Vice President of Structures & Systems Segment" 
    "VP, General Counsel & Secretary"                      
    end
    
    
    g wanted= ustrregexm(title, "\bCEO\b|\bChief Executive Officer\b") & ///
    !ustrregexm(title, "\bCEO\s+[o][f]\b|\bChief Executive Officer\s+[o][f]\b")
    Res.:

    Code:
    . l, sep(0)
    
         +----------------------------------------------------------------+
         |                                                 title   wanted |
         |----------------------------------------------------------------|
      1. |                                        Chairman & CEO        1 |
      2. |                                         Vice Chairman        0 |
      3. |              Vice President of Commercial Development        0 |
      4. | Chief Financial Officer, Vice President and Treasurer        0 |
      5. |  Group Vice President of Structures & Systems Segment        0 |
      6. |                                        Chairman & CEO        1 |
      7. |                                         Vice Chairman        0 |
      8. |              Vice President of Commercial Development        0 |
      9. | Chief Financial Officer, Vice President and Treasurer        0 |
     10. |  Group Vice President of Structures & Systems Segment        0 |
     11. |                                        Chairman & CEO        1 |
     12. |                                         Vice Chairman        0 |
     13. | Chief Financial Officer, Vice President and Treasurer        0 |
     14. |  Group Vice President of Structures & Systems Segment        0 |
     15. |                       VP, General Counsel & Secretary        0 |
     16. |                                  CEO of Power Systems        0 |
     17. |                                         Vice Chairman        0 |
     18. | Chief Financial Officer, Vice President and Treasurer        0 |
     19. |  Group Vice President of Structures & Systems Segment        0 |
     20. |                       VP, General Counsel & Secretary        0 |
         +----------------------------------------------------------------+
    
    .

    Comment


    • #3
      Code:
      gen byte CEO = ustrregexm(title, "(?i)(CEO|Chief Executive Officer)(?!\sof)")
      https://unicode-org.github.io/icu/us...gs/regexp.html

      Comment


      • #4
        Thanks both of you for your answer.

        @ Andrew: I tried your code suggestion and it worked. In a second step, I tried to adjust the code, so that also the following expressions are also not marked with 1: "Interim-CEO", "Assistent to the CEO" & "Co-CEO"

        Based on your code suggestion I tried to adjust the code as follows for the suffix interim:


        Code:
        gen ceo= ustrregexm(title, "\bCEO\b|\bChief Executive Officer\b") & !ustrregexm(title, "\bCEO\s+[o][f]\b|\bChief Executive Officer\s+[o][f]\b|\s[interim]+\bChief Executive Officer\b|\s[interim]+\bCEO\b")


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str250 title
        "Chairman & CEO"                                      
        "Vice Chairman"                                        
        "Vice President of Commercial Development"            
        "Chief Financial Officer, Vice President and Treasurer"
        "Group Vice President of Structures & Systems Segment"
        "Chairman & CEO"                                      
        "Vice Chairman"                                        
        "Vice President of Commercial Development"            
        "Chief Financial Officer, Vice President and Treasurer"
        "Group Vice President of Structures & Systems Segment"
        "Chairman & CEO"                                      
        "Assistent to the CEO"                                        
        "Chief Financial Officer, Vice President and Treasurer"
        "Group Vice President of Structures & Systems Segment"
        "VP, General Counsel & Secretary"                      
        "Chairman & Co-CEO"                                      
        "Vice Chairman"                                        
        "Chief Financial Officer, Vice President and Treasurer"
        "Interim-CEO"
        "CEO of Power Systems"                      
        end
        Do you also know where I can find a documentation which shows when I have to use \b or \s or why "of" is written as [o][f]? And do you know if your code also filters for terms in both capital letters and small letters? E.g. is it possible to both "CEO" and "ceo" terms with 1?

        Thanks a lot in advance!
        Last edited by Steffen Weisthoff; 20 Oct 2022, 09:36.

        Comment


        • #5
          In a second step, I tried to adjust the code, so that also the following expressions are also not marked with 1: "Interim-CEO", "Assistent to the CEO" & "Co-CEO"
          You just add these exclusions as they appear in the text. In these cases, the extra text precede the tag words.

          Do you also know where I can find a documentation which shows when I have to use \b or \s or why "of" is written as [o][f]?
          You do not have to write "of" as "[o][f]", elements within the brackets indicate characters to be matched. See the ICU regex manual for a description of these: https://unicode-org.github.io/icu/us...gs/regexp.html

          And do you know if your code also filters for terms in both capital letters and small letters? E.g. is it possible to both "CEO" and "ceo" terms with 1?
          For case-insensitive matching, see the highlighted in the code below (note the misspelling of assistant):

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str250 title
          "Chairman & ceo"                                      
          "Vice Chairman"                                        
          "Vice President of Commercial Development"            
          "Chief Financial Officer, Vice President and Treasurer"
          "Group Vice President of Structures & Systems Segment"
          "Chairman & CeO"                                      
          "Vice Chairman"                                        
          "Vice President of Commercial Development"            
          "Chief Financial Officer, Vice President and Treasurer"
          "Group Vice President of Structures & Systems Segment"
          "Chairman & CEO"                                      
          "Assistent to the CEO"                                
          "Chief Financial Officer, Vice President and Treasurer"
          "Group Vice President of Structures & Systems Segment"
          "VP, General Counsel & Secretary"                      
          "Chairman & Co-CEO"                                    
          "Vice Chairman"                                        
          "Chief Financial Officer, Vice President and Treasurer"
          "Interim-CEO"                                          
          "CEO of Power Systems"                                
          end
          
          
          gen wanted= ustrregexm(title, "(?i)(\bCEO\b|\bChief Executive Officer\b)") & ///
          !ustrregexm(title, "(?i)(\bCEO\s+of\b|\bChief Executive Officer\s+of\b|Interim[?-]CEO)|Assist[a|e]nt to the CEO|co[?-]CEO")
          Res.:

          As the list gets longer, Bjarte's solution in #3 is more efficient as you avoid specifying the tag words repeatedly, provided that what comes before and after these words is to be excluded for all the tag words.

          Code:
          . l, sep(0)
          
               +----------------------------------------------------------------+
               |                                                 title   wanted |
               |----------------------------------------------------------------|
            1. |                                        Chairman & ceo        1 |
            2. |                                         Vice Chairman        0 |
            3. |              Vice President of Commercial Development        0 |
            4. | Chief Financial Officer, Vice President and Treasurer        0 |
            5. |  Group Vice President of Structures & Systems Segment        0 |
            6. |                                        Chairman & CeO        1 |
            7. |                                         Vice Chairman        0 |
            8. |              Vice President of Commercial Development        0 |
            9. | Chief Financial Officer, Vice President and Treasurer        0 |
           10. |  Group Vice President of Structures & Systems Segment        0 |
           11. |                                        Chairman & CEO        1 |
           12. |                                  Assistent to the CEO        0 |
           13. | Chief Financial Officer, Vice President and Treasurer        0 |
           14. |  Group Vice President of Structures & Systems Segment        0 |
           15. |                       VP, General Counsel & Secretary        0 |
           16. |                                     Chairman & Co-CEO        0 |
           17. |                                         Vice Chairman        0 |
           18. | Chief Financial Officer, Vice President and Treasurer        0 |
           19. |                                           Interim-CEO        0 |
           20. |                                  CEO of Power Systems        0 |
               +----------------------------------------------------------------+
          
          .
          Last edited by Andrew Musau; 20 Oct 2022, 13:36.

          Comment

          Working...
          X