Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • removing some parts of a string variable

    Hi everyone,

    I have a variable that contains company names. We have multiple variations for a single company since these names are written in different formats. To make them as unique and consistent as possible, I need to remove quotation marks at the beginning and end of the names, as well as terms like "INC," "INC.," "CO," "CO.," "LLC," and "CORP." However, I do not want to remove these terms if they appear in the middle of the company names. Could you help me with how to solve this problem? Is it possible to do that without without using a loop?

    Thanks

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str69 Co_Name str4 Co_Code
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
    `""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
    `""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
    `""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
    `ING PILGRIM INVESTMENTS, INC.'       "PLG"
    `ING PILGRIM INVESTMENTS, INC.'       "PLG"
    `ING PILGRIM INVESTMENTS, INC.'       "PLG"
    `ING PILGRIM INVESTMENTS, INC.'       "PLG"
    `ING PILGRIM INVESTMENTS, INC.'       "PLG"
    `ING PILGRIM INVESTMENTS, INC.'       "PLG"
    `JAMES C. EDWARDS & CO., CORP'        "JCE"
    `JAMES C. EDWARDS & CO., CORP'        "JCE"
    `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
    `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
    `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
    `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
    `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
    `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
    end

  • #2
    Hi Fimi,

    Stata allows regular expressions. Try these:

    Code:
    // remove the " INC." (beginning with a space) in the end of the name string
    gen s1 = ustrregexra(Co_Name, "\x20INC\.$",  "")
    
    // then remove the " INC" in the end of the var s1
    gen s2 = ustrregexra(s1, "\x20INC$",  "")
    
    // or you can combine these...
    gen s3 = ustrregexra(ustrregexra(Co_Name, "\x20INC\.$",  ""), "\x20INC$",  "")

    Comment


    • #3
      Hi Fimi,
      unfortunately, I'm having a bit of trouble reading in your example data correctly.
      For example, most of the input values ​​of Co_name are embedded in compound double quotes AND a single quote, which suggests that your strings actually start and end with quotes. However, this does not apply to lines 43-50 (ING PILGRIM INVESTMENTS, INC). Your sample data will not be read in correctly without adjustments. I'm assuming that these strings also have quotation marks and corrected the input accordingly.
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input str69 Co_Name str4 Co_Code
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
      `""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
      `""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
      `""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
      `""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
      `""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
      `""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
      `""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
      `""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
      `""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
      `""JAMES C. EDWARDS & CO., CORP""'        "JCE"
      `""JAMES C. EDWARDS & CO., CORP""'        "JCE"
      `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
      `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
      `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
      `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
      `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
      `""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
      end
      If I understood you correctly, you want to (1) remove quotation marks at the beginning and at the end and (2) remove everything in the company name like INC LTD PLG etc. Using Xinya Hao's syntax you could now go to the trouble of manually putting all the relevant strings to be removed into a local, write some loop and remove those. However, this would be very error-prone and you would have to repeat this complex process every time the data changes. But there may be an easier way: Looking at it the other way around, you want to keep everything BEFORE the last comma, right?
      If that is true then you could simply:

      Code:
      gen s1 = ustrregexra(Co_Name, "(.*),(.*)$",  "$1")
      replace s1 = subinstr(s1, `"""',"",.)
      The result would be:
      Code:
      duplicates drop
      list
           +---------------------------------------------------------------------------------+
           |                               Co_Name   Co_Code                              s1 |
           |---------------------------------------------------------------------------------|
        1. |     "DEUTSCHE ASSET MANAGEMENT, INC."       MGC       DEUTSCHE ASSET MANAGEMENT |
        2. |    "DUNHILL INVESTMENT ADVISORS, LTD"       DNH     DUNHILL INVESTMENT ADVISORS |
        3. |       "ING PILGRIM INVESTMENTS, INC."       PLG         ING PILGRIM INVESTMENTS |
        4. |        "JAMES C. EDWARDS & CO., CORP"       JCE          JAMES C. EDWARDS & CO. |
        5. | "LOWE, BROCKENBROUGH & COMPANY, INC."       LBT   LOWE, BROCKENBROUGH & COMPANY |
           +---------------------------------------------------------------------------------+
      To be honest, I can't say for sure why only the string before the last comma is selected here; we would probably need an expert in the field of regular expressions. But I added a few extra commas randomly into the data to test the syntax and sure enough the last comma is always selected correctly. So I would probably continue working with it and still critically examine it with the entire data set.


      All the best!
      Last edited by Benno Schoenberger; 19 Sep 2024, 02:15.

      Comment


      • #4
        Thanks Xinya. When I run your command, it just generate 3 new variables which are the same as my original variable, without any change:


        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str69(Co_Name s1 s2 s3)
        `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""'
        `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""'
        `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""'
        `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""'


        Comment


        • #5
          Hi Beno,

          Thanks for your suggestion and explanations. The names of companies in my dataset are written in different styles. I am trying to show you some of them in the following table:

          1- Some names are between a quotation at the beginning and a quotation at the end (I don't know why Dataex put an extra quotation. For example, the company in the first row of the table is written "DEUTSCHE ASSET MANAGEMENT, INC." in my dataset. I run your code and it just removed "".
          2- I want to remove all quotation marks whether beginning or end of names.
          3- I also need to remove INC / INC. / , INC. / CORP / CORP. / LLC / , LLC. / L.P. / I II III IV / and any (text) in the end or middle of names.
          4- I want to keep CO if its after & like SELIGMAN, J&W & CO


          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str41 Co_Name str3 Co_Code
          `""DEUTSCHE ASSET MANAGEMENT, INC.""'       "MGC"
          `""ING PILGRIM INVESTMENTS, INC.""'         "PLG"
          `""METAMARKETS INVESTMENTS, LLC""'          "MTM"
          `""SELIGMAN, J&W & CO""'                    "SLG"
          `""SG COWEN ASSET MANAGMENT, INC.""'        "COW"
          `""STEWART, W.P. & CO INC""'                "WPS"
          "1492 Capital Management LLC"               "FTN"
          "1838 INVESTMENT ADVISORS L.P."             "ETE"
          "1ST SOURCE CORPORATION INV ADVISORS INC"   "FSB"
          "2480 SECURITIES LLC"                       ""   
          "ABN AMRO ASSET MANAGEMENT (USA) LLC"       "ABN"
          "AC Funds, LLC"                             "ACU"
          "ACCESSOR CAPITAL MGMT LP"                  "ACS"
          "AETNA LIFE INS & ANNUITY CO"               "AET"
          "AGF Investments LLC"                       "FFM"
          "AIG SUNAMERICA ASSET MANAGEMENT CORP"      "SAM"
          "AIM DISTRIBUTORS INC."                     ""   
          "AIM Growth Series (Invesco Growth Series)" "AIM"
          "ALLEGIANT ASSET MANAGEMENT COMPANY"        "NCC"
          "ALLIANCE CAPITAL MANAGEMENT CORP."         ""   
          "ALLIANCE CAPITAL MGMT LP"                  "ALL"
          "ALPHA ANALYTICS INVESTMENT GROUP LLC"      "AAI"
          "AMERICAN EXPRESS FINANCIAL ADVISORS"       ""   
          "VANTAGEPOINT INVESTMENT ADVISERS LLC"      "VNP"
          "VARIABLE ANNUITY LIFE INS CO"              "VLC"
          "ARTISAN PARTNERS LP"                       "ART"
          "SELECTED/VENTURE ADVISERS L.P."            "SVA"
          "PRUDENTIAL INVESTMENT PORTFOLIOS, INC. 10" "BHB"
          "MFS SERIES TRUST IV"                       "MFS"
          "MFS SERIES TRUST III"                      "MFS"
          "MFS SERIES TRUST X"                        "MFS"
          "MFS SERIES TRUST I"                        "MFS"
          "MFS SERIES TRUST III"                      "MFS"
          end

          Comment


          • #6
            Because your data have quotation mark as the last character of the string. The ustrregexra() function replace the string end with "INC" (no quotation marks). But your string actually end with a quotation mark.

            You can first remove all the quotation marks in the strings then remove the " INC." and "INC" and "CORP" etc in the end of the strings.

            Code:
            // remove all the quotation marks in the Co_Name and generate s1
            gen s1 = subinstr(Co_Name, `"""',"",.)
            
            
            // remove the "CORP" (no quotation marks) in the end of s1. the "$" indicates only matching CORP in the end of the string.
            replace s1 = ustrregexra(s1 , "CORP$", "")

            Comment


            • #7
              Thanks Xinya, It works perfectly.

              Comment


              • #8
                Could I ask you how I can tell stata that keep CO in the end of all names if it has & before it " & CO" but remove "CO" and "CO."

                Comment


                • #9
                  Sure. Here are some example data and codes for you.

                  But my more general advice is:
                  1. Google regular expressions and try to learn and apply them. Regular expressions are useful when dealing with strings, and they also work not only in Stata but also in other languages, ​​such as R or Python.
                  2. The easier way is to ask ChatGPT and let the bot generate the regular expressions you need for you. That usually works.


                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input str9 s
                  `""ABC CO""'
                  `""Firm CO.""' 
                  `""FirmTCO""'  
                  `""AFF4CO""'  
                  `""AFF &CO""'  
                  `""FBC &CO.""' 
                  `""A11& CO""' 
                  `""A12 & CO.""'
                  end
                  
                  
                  
                  * remove all quation marks in s and gen s1
                  gen s1 = subinstr(s, `"""',"",.)
                  
                  * remove if ending with -CO- or -CO.-, but remain unchanged if ending with -& CO- or -&CO- or -& CO.- or -&CO.-
                  replace s1 = ustrregexra(s1, "(?<!&\x20?)CO\.?$", "")
                  // \x20 means a space
                  // ? means the character can appear 0 or multiple times
                  // ! means the logic NOT
                  // (?<!&\x20?) means: matching the strings NOT beginning with "&" or "& " or "&"+ multiple spaces
                  // CO meas CO
                  // . means any character in regular expressions, but \. just means the "."
                  // \.? means 0 or multile dots (in other words, ".")
                  // $ means: matching the strings ending with the given pattern
                  
                  // thus, "(?<!&\x20?)CO\.?$" means...(a combination of the above pattern)
                  // the ustrregexra() function will try to find the given pattern and replace them with "" (noting).
                  
                  * remove the beginning or ending spaces (trim the s1 variable)
                  replace s1 = trim(s1)
                  
                  list *
                  /*
                  
                       +----------------------+
                       |         s         s1 |
                       |----------------------|
                    1. |  "ABC CO"       ABC  |
                    2. | "Firm CO.      Firm  |
                    3. | "FirmTCO"      FirmT |
                    4. |  "AFF4CO"       AFF4 |
                    5. | "AFF &CO"    AFF &CO |
                       |----------------------|
                    6. | "FBC &CO.   FBC &CO. |
                    7. | "A11& CO"    A11& CO |
                    8. | "A12 & CO   A12 & CO |
                       +----------------------+
                  
                  */

                  Comment


                  • #10
                    Thank you so much for your help and very helpful advice.

                    Comment

                    Working...
                    X