Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove multiple substrings from string

    Hi,

    I use the following code (found here) to remove certain legal forms from company names:

    Code:
    clear
    input str58 Acq_Name
    "Atlas Pipeline Partners LP"   
    "Heartware International Inc"  
    "Thomson Reuters Corp"         
    "Patheon Inc"                  
    "Lions Gate Entertainment Corp"
    "Kodiak Oil & Gas Corp"        
    "SXC Health Solutions Corp"    
    "Catamaran Corp"               
    "Energy Fuels Inc"             
    "Domtar Corp"                  
    end
    
    local to_remove Inc Corp LP Ltd LLC Holdings Trust Partners & PLC Co Group
    gen rname = reverse(Acq_Name)
    foreach t of local to_remove {
        local trev = reverse(`"`t'"')
        replace Acq_Name = reverse(subinword(rname, `"`trev'"', "", 1))  ///
            if strpos(rname, `"`trev'"') == 1
    }
    ​​​​​​
    This code works just fine in most cases. If a company is registered under multiple legal forms though (e.g., "Test Company Corp Inc"), only the very last one will be removed, which will be "Inc" in the example. In order to control for these cases if have to remove the trailing spaces that where produced after the legal form is removed:
    Code:
    replace Acq_Name = strrtrim(Acq_Name)
    and then run the command again.
    Is there a way to control for this so I don't have to run the commands over and over again until no further legal forms are removed? I tried to implement the #strrtrim command in the loop which however does not lead to the desired result.

    Any help is much appreciated.
    Thank you.


  • #2
    Code:
    clear
    input str10 Acq_Name
    "A Inc"   
    "B Corp"
    "C Corp Inc"
    end
    
    gen wanted = trim(ustrregexra(Acq_Name,"inc|corp","",.))

    Comment


    • #3
      Note that the elegant solution provided in #2 will still not remove the ampersand (&) symbol because this symbol is interpreted by Stata as the logical "and."

      Edit: Actually, I was wrong. Just escape the and in the pattern.

      Code:
      gen wanted = trim(ustrregexra(Acq_Name,"inc|corp|\&","",.))
      Last edited by Daniel Schaefer; 14 Oct 2022, 11:39.

      Comment


      • #4
        Also the solution in #2 will remove the key letters from within whole words:

        Code:
        . input str18 Acq_Name
        
                       Acq_Name
          1. "A Inc"   
          2. "B Corp"
          3. "C Corp Inc"
          4. "D Income Corp"
          5. "Invincible Corp"
          6. "Extracorporeal Inc"
          7. end
        
        . 
        . gen wanted = trim(ustrregexra(Acq_Name,"inc|corp","",.))
        
        . list
        
             +---------------------------------+
             |           Acq_Name       wanted |
             |---------------------------------|
          1. |              A Inc            A |
          2. |             B Corp            B |
          3. |         C Corp Inc            C |
          4. |      D Income Corp        D ome |
          5. |    Invincible Corp      Invible |
             |---------------------------------|
          6. | Extracorporeal Inc   Extraoreal |
             +---------------------------------+
        Here is one way

        Code:
        clear
        input str58 Acq_Name
        "Atlas Pipeline Partners LP"   
        "Heartware International Inc"  
        "Thomson Reuters Corp"         
        "Patheon Inc"                  
        "Lions Gate Entertainment Corp"
        "Kodiak Oil & Gas Corp"        
        "SXC Health Solutions Corp"    
        "Catamaran Corp"               
        "Energy Fuels Inc"             
        "Domtar Corp"  
        "Test Company Corp Inc"    
        "Invincible Corp"
        "Extracorporeal Inc"     
        end
        
        local to_remove ="Inc Corp LP Ltd LLC Holdings Trust Partners & PLC Co Group"
        forv i = 1/12 {
            local name  "`=Acq_Name[`i']'"
            local name2: list name - to_remove
            replace Acq_Name = "`name2'" in `i'
            
        }
        list

        Comment


        • #5
          One problem that the code in #4 does not solve is that of whole words which appear anywhere except at the end of the string. E.g. "Vintage LP Records Inc" --> "Vintage Records"

          Comment


          • #6
            Thank you for all the suggestions. Some of you already recognized what I forgot to mention (sorry for that): The code should only remove the desired legal forms when they appear at the end of the string. E.g. "The Corp Company Inc" should result in "The Corp Company". That's why I use the code in #1 in the first placed instead of a simple #regex solution.

            Comment


            • #7
              Code:
              clear
              input str58 Acq_Name
              "Atlas Pipeline Partners LP"   
              "Heartware International Inc"  
              "Thomson Reuters Corp"         
              "Patheon Inc"                  
              "Lions Gate Entertainment Corp"
              "Kodiak Oil & Gas Corp"        
              "SXC Health Solutions Corp"    
              "Catamaran Corp"               
              "Energy Fuels Inc"             
              "Domtar Corp"  
              "Test Company Corp Inc"    
              "Invincible Corp"
              "Extracorporeal Inc" 
              "The Corp Company Inc"
              "Hutton & Partners PLC"    
              end
              
              local not_converged 1
              
              gen wanted = Acq_Name
              gen wanted2 = ""
              while `not_converged' {    
                  replace wanted2 = trim(ustrregexrf(wanted,"\b(Inc|Corp|LP|Ltd|LLC|Holdings|Trust|Partners|PLC|Co|Group)\b$",""))
                  replace wanted2 = trim(ustrregexrf(wanted2,"( &)$"," "))
                  capture assert wanted2!=wanted
                  local not_converged = (_rc == 0)
                  replace wanted = wanted2
                  }
              drop wanted2
              which produces:
              Code:
              . list, noobs sep(0)
              
                +----------------------------------------------------------+
                |                      Acq_Name                     wanted |
                |----------------------------------------------------------|
                |    Atlas Pipeline Partners LP             Atlas Pipeline |
                |   Heartware International Inc    Heartware International |
                |          Thomson Reuters Corp            Thomson Reuters |
                |                   Patheon Inc                    Patheon |
                | Lions Gate Entertainment Corp   Lions Gate Entertainment |
                |         Kodiak Oil & Gas Corp           Kodiak Oil & Gas |
                |     SXC Health Solutions Corp       SXC Health Solutions |
                |                Catamaran Corp                  Catamaran |
                |              Energy Fuels Inc               Energy Fuels |
                |                   Domtar Corp                     Domtar |
                |         Test Company Corp Inc               Test Company |
                |               Invincible Corp                 Invincible |
                |            Extracorporeal Inc             Extracorporeal |
                |          The Corp Company Inc           The Corp Company |
                |         Hutton & Partners PLC                     Hutton |
                +----------------------------------------------------------+

              Comment


              • #8
                A loopless approach.
                Code:
                gen wanted3 = ustrregexra(wanted,"\b(Inc|Corp|LP|Ltd|LLC|Holdings|Trust|Partners|PLC|Co|Group|\&)\b$","")
                replace wanted3 = trim(stritrim(wanted3))
                Code:
                . list, clean noobs
                
                                         Acq_Name                     wanted                    wanted3  
                       Atlas Pipeline Partners LP             Atlas Pipeline             Atlas Pipeline  
                      Heartware International Inc    Heartware International    Heartware International  
                             Thomson Reuters Corp            Thomson Reuters            Thomson Reuters  
                                      Patheon Inc                    Patheon                    Patheon  
                    Lions Gate Entertainment Corp   Lions Gate Entertainment   Lions Gate Entertainment  
                            Kodiak Oil & Gas Corp           Kodiak Oil & Gas           Kodiak Oil & Gas  
                        SXC Health Solutions Corp       SXC Health Solutions       SXC Health Solutions  
                                   Catamaran Corp                  Catamaran                  Catamaran  
                                 Energy Fuels Inc               Energy Fuels               Energy Fuels  
                                      Domtar Corp                     Domtar                     Domtar  
                            Test Company Corp Inc               Test Company               Test Company  
                                  Invincible Corp                 Invincible                 Invincible  
                               Extracorporeal Inc             Extracorporeal             Extracorporeal  
                             The Corp Company Inc           The Corp Company           The Corp Company  
                            Hutton & Partners PLC                     Hutton                     Hutton

                Comment


                • #9
                  William Lisowski the function in #8 is operating on wanted, not on Acq_Name. Operating on the original strings gives

                  Code:
                  gen wanted3 = ustrregexra(Acq_Name,"\b(Inc|Corp|LP|Ltd|LLC|Holdings|Trust|Partners|PLC|Co|Group|\&)\b$","")
                  li, noobs sep(0)
                    +--------------------------------------------------------------------------------------+
                    |                      Acq_Name                     wanted                     wanted3 |
                    |--------------------------------------------------------------------------------------|
                    |    Atlas Pipeline Partners LP             Atlas Pipeline    Atlas Pipeline Partners  |
                    |   Heartware International Inc    Heartware International    Heartware International  |
                    |          Thomson Reuters Corp            Thomson Reuters            Thomson Reuters  |
                    |                   Patheon Inc                    Patheon                    Patheon  |
                    | Lions Gate Entertainment Corp   Lions Gate Entertainment   Lions Gate Entertainment  |
                    |         Kodiak Oil & Gas Corp           Kodiak Oil & Gas           Kodiak Oil & Gas  |
                    |     SXC Health Solutions Corp       SXC Health Solutions       SXC Health Solutions  |
                    |                Catamaran Corp                  Catamaran                  Catamaran  |
                    |              Energy Fuels Inc               Energy Fuels               Energy Fuels  |
                    |                   Domtar Corp                     Domtar                     Domtar  |
                    |         Test Company Corp Inc               Test Company          Test Company Corp  |
                    |               Invincible Corp                 Invincible                 Invincible  |
                    |            Extracorporeal Inc             Extracorporeal             Extracorporeal  |
                    |          The Corp Company Inc           The Corp Company           The Corp Company  |
                    |         Hutton & Partners PLC                     Hutton          Hutton & Partners  |
                    +--------------------------------------------------------------------------------------+
                  Using ustrregexra instead of ustrregexrf makes no difference to the result, because only one string would match with the end-of-string restriction $.

                  Would you know how to make recursive substitution work in a regular expression?
                  Last edited by Hemanshu Kumar; 14 Oct 2022, 18:50.

                  Comment


                  • #10
                    Also, I somehow cannot get the regular expression to work properly on the ampersand:

                    Code:
                    . dis ustrregexra("Heroes &","\b(Inc|Corp|LP|Ltd|LLC|Holdings|Trust|Partners|PLC|Co|Group|\&)\b$","")
                    Heroes &
                    
                    . dis ustrregexra("Heroes &","\b(Inc|Corp|LP|Ltd|LLC|Holdings|Trust|Partners|PLC|Co|Group|&)\b$","")
                    Heroes &
                    Any thoughts?

                    Comment


                    • #11
                      With regard to getting the regular expression to work on the ampersand, some experimentation led to me to the following.
                      Code:
                      . dis ustrregexra("Heroes &","&|\b(Inc|Corp|LP|Ltd|LLC|Holdings|Trust|Partners|PLC|Co|Group)\b$","")
                      Heroes
                      but I won't hazard a guess as to why that works - I just guessed that the requirement of a word-break character following a character that itself was not part of a word led to the failure.

                      With regard to the major oversight on my part, for which I apologize, the following provides loopless code matching your output starting with Acq_Name.
                      Code:
                      gen wanted3 = ustrregexra(Acq_Name,"([& ]*|\b(Inc|Corp|LP|Ltd|LLC|Holdings|Trust|Partners|PLC|Co|Group)\b)*$","")
                      replace wanted3 = trim(stritrim(wanted3))
                      Code:
                      . gen wanted3 = ustrregexra(Acq_Name,"([& ]*|\b(Inc|Corp|LP|Ltd|LLC|Holdings|Trust|Partners|PLC|Co|Group)\b)*$","")
                      
                      . replace wanted3 = trim(stritrim(wanted3))
                      (0 real changes made)
                      
                      . 
                      . list, clean noobs
                      
                                               Acq_Name                     wanted                    wanted3  
                             Atlas Pipeline Partners LP             Atlas Pipeline             Atlas Pipeline  
                            Heartware International Inc    Heartware International    Heartware International  
                                   Thomson Reuters Corp            Thomson Reuters            Thomson Reuters  
                                            Patheon Inc                    Patheon                    Patheon  
                          Lions Gate Entertainment Corp   Lions Gate Entertainment   Lions Gate Entertainment  
                                  Kodiak Oil & Gas Corp           Kodiak Oil & Gas           Kodiak Oil & Gas  
                              SXC Health Solutions Corp       SXC Health Solutions       SXC Health Solutions  
                                         Catamaran Corp                  Catamaran                  Catamaran  
                                       Energy Fuels Inc               Energy Fuels               Energy Fuels  
                                            Domtar Corp                     Domtar                     Domtar  
                                  Test Company Corp Inc               Test Company               Test Company  
                                        Invincible Corp                 Invincible                 Invincible  
                                     Extracorporeal Inc             Extracorporeal             Extracorporeal  
                                   The Corp Company Inc           The Corp Company           The Corp Company  
                                  Hutton & Partners PLC                     Hutton                     Hutton  
                      
                      .
                      If I were to express the intent of the regular expression in English, it is to replace the longest possible terminal string consisting spaces, ampersands, and members of the list of legal forms to be removed. I think the replace command is no longer needed ... .

                      Comment


                      • #12
                        Excellent, thank you! I don't know why in my head I kept thinking this would need recursion.

                        Incidentally, again ustrregexra and ustrregexrf produce the same output here.

                        Comment


                        • #13
                          Thank you very much guys. I really appreciate the effort you put in every single of your answers. This helps a lot, thank you.

                          Comment

                          Working...
                          X