Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting the number of occurances of a word inside the observation

    Hello everyone,

    It has been some time since I have tried to count the occurrences of the word 'and' within each observation, but regrettably, I have not yet found the command to achieve this. I am working with a variable that contains a list of corporate positions, and my objective is to determine how many times the word 'and' appears within the titles of these positions. I would be very grateful if you could advise me with the code:


    Below, please find a snap of the variable:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str250 Positions
    ""                                                                           
    ""                                                                           
    ""                                                                           
    ""                                                                           
    "chairman AND president and CEO"                                                  
    "executive vp & chief finance officer"                                       
    "president & CEO"                                                            
    "president & chief operating officer"                                        
    "senior vp; president-Integrated Solutions Group"                            
    "vp-worldwide operations"                                                    
    ""                                                                           
    ""                                                                           
    ""                                                                           
    ""                                                                           
    ""                                                                           
    ""                                                                           
    "chairman, president & CEO"                                                  
    "executive vp & chief finance officer"                                       
    "vp & chief marketing officer"                                               
    "vp; president-software systems"                                             
    "vp; president-systems integration"                                          
    ""                                                                           
    ""                                                                           
    "chairman, president & CEO"                                                  
    "president & CEO"                                                            
    "senior vp-global sales, marketing and customer service"                     
    "vp; president-IP cable business unit"                                       
    "vp; president-connectivity business unit"                                   
    "vp; president-systems integration and software systems business unit"       
    "vp; president-wireline business unit"                                       
    ""                                                                           
    ""                                                                           
    "president and CEO and CFO"                                                            
    "vp & chief finance officer"                                                 
    "vp; president-global connectivity solutions"                                
    "vp; president-global connectivity solutions"                                
    "vp; president-professional services"                                        
    "vp; president-wireless"                                                     
    "vp; president-wireless and wireline"                                        
    ""                                                                           
    ""                                                                           
    "president & CEO"                                                            
    "vp & chief finance officer"                                                 
    "vp; president-global connectivity solutions"                                
    "vp; president-professional services"                                        
    "vp; president-wireless and wireline"                                        
    "president & CEO"                                                            
    "vp & chief finance officer"                                                 
    "vp, general counsel & secretary"                                            
    "vp; president-global connectivity solutions"                                
    "vp;president-professional services"                                         
    ""                                                                           
    "president & CEO"                                                            
    "vp & chief administrative officer"                                          
    "vp & chief finance officer"                                                 
    "vp & chief finance officer"                                                 
    "vp, general counsel & secretary"                                            
    "vp; president-global connectivity solutions"                                
    ""                                                                           
    ""                                                                           
    "president & CEO"                                                            
    "vp & chief administrative officer"                                          
    "vp & chief finance officer"                                                 
    "vp, general counsel & secretary"                                            
    "vp; president-global connectivity solutions"                                
    ""                                                                           
    "Vice President, Secretary and General Counsel"                              
    "chairman, president & CEO"                                                  
    "vp & chief administrative officer"                                          
    "vp & chief finance officer"                                                 
    "vp & president-network solutions"                                           
    "vp; president-global connectivity solutions"                                
    "Chairman, Chief Executive Officer and President"                            
    "Chief Financial Officer and Vice President"                                 
    "Vice President and President of Global Connectivity Solutions Business Unit"
    "Vice President of Global Go-to-Market"                                      
    "Vice President, General Counsel and Secretary"                              
    ""                                                                           
    ""                                                                           
    ""                                                                           
    ""                                                                           
    "chairman & CEO"                                                             
    "executive vp; executive vp-customer service-American"                       
    "executive vp; executive vp-marketing & planning-American"                   
    "executive vp; executive vp-operations-American"                             
    "vice chairman"                                                              
    ""                                                                           
    ""                                                                           
    ""                                                                           
    "chairman & CEO"                                                             
    "executive vp; executive vp-customer service-American"                       
    "executive vp; executive vp-marketing & planning-American"                   
    "president & chief operating officer"                                        
    "senior vp & general counsel"                                                
    "senior vp-government affairs-American"                                      
    "vice chairman"                                                              
    ""                                                                           
    ""                                                                           
    "chairman & CEO"                                                             
    "executive vp; executive vp-marketing-American"                              
    end

  • #2
    ssc install egenmore

    Code:
    egen amperstand = noccur(Positions) , string(&)

    Comment


    • #3
      See also https://journals.sagepub.com/doi/pdf...6867X221141068

      Code:
      gen count = (strlen(Positions) - strlen(subinstr(lower(Positions), "and", "", .))) / 3
      Count the number of characters in your variable.

      Count the number of characters that you would have if each occurrence of and were deleted in a lower-cased version.

      Work out the difference and divide by 3.

      Comment


      • #4
        While I think the solutions in #1 and #3 each contain the gist of a piece of the problem, I think both are incomplete in that they will fail in certain edge cases.
        1. I assume that O.P wants a count of both & and "and" occurrences--neither solution gives both.
        2. Nick's solution will overcount if part of the title contains "and" internal to a word, e.g. "CEO of Land Division."
        3. In order to deal with #2, it seems that one must also then consider the possibility of "and" occurring at the very beginning or end of the position not surrounded by blanks on both sides.
        So I propose the following solution, which I believe is more general:
        Code:
        gen working = " " + lower(Positions) + " "
        gen current_length = strlen(working)
        replace working = subinstr(working, "&", " & ", .)
        replace working = subinstr(working, " & ", "", .)
        gen ampersands = current_length - strlen(working)
        replace current_length = strlen(working)
        replace working = subinstr(working, "and", "", .)
        gen ands = (current_length - strlen(working))/3
        gen ands_and_ampersands = ands + ampersands
        With regard to problem 3, this may be slightly overkill, as it is unlikely that & or and would appear in either initial or final position. But, you never know, and this solution is robust to such a situation should it arise.
        Last edited by Clyde Schechter; 21 Sep 2023, 15:40.

        Comment


        • #5
          Thank you both very much, it was very helpful!

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            While I think the solutions in #1 and #3 each contain the gist of a piece of the problem, I think both are incomplete in that they will fail in certain edge cases.
            1. I assume that O.P wants a count of both & and "and" occurrences--neither solution gives both.
            2. Nick's solution will overcount if part of the title contains "and" internal to a word, e.g. "CEO of Land Division."
            3. In order to deal with #2, it seems that one must also then consider the possibility of "and" occurring at the very beginning or end of the position not surrounded by blanks on both sides.
            So I propose the following solution, which I believe is more general:
            Code:
            gen working = " " + lower(Positions) + " "
            gen current_length = strlen(working)
            replace working = subinstr(working, "&", " & ", .)
            replace working = subinstr(working, " & ", "", .)
            gen ampersands = current_length - strlen(working)
            replace current_length = strlen(working)
            replace working = subinstr(working, "and", "", .)
            gen ands = (current_length - strlen(working))/3
            gen ands_and_ampersands = ands + ampersands
            With regard to problem 3, this solution may be slightly overkill, as it also contemplates & appearing at the very beginning or end of Position, which seems improbable, but I don't think it does any harm as it involves no additional coding to do that, and, in fact, one would have to add code to exclude the possibility.
            Thank you Clyde Schechter for this very important input!

            Comment


            • #7
              Taking Clyde's thoughts, this might work. Note that some of your "and" are "AND" and that will make a difference.

              Code:
              g ptemp = lower(Positions)
              replace ptemp = subinstr(ptemp,"and"," and ",.)
              replace ptemp = subinstr(ptemp," and ","&",.)
              egen and_count = noccur(ptemp) , string(&)

              Comment


              • #8
                I think -replace ptemp = subinstr(ptemp,"and"," and ",.)- will lead to incorrect results if word-internal and appears in the job title, as in example 2 that I gave in #4.

                Comment


                • #9
                  I want to flag that the reference I cited in #3 goes much further than my code.

                  Comment

                  Working...
                  X