removing some parts of a string variable

Fimi Karimi

Join Date: Jun 2023
Posts: 71

removing some parts of a string variable

18 Sep 2024, 22:19

Hi everyone,

I have a variable that contains company names. We have multiple variations for a single company since these names are written in different formats. To make them as unique and consistent as possible, I need to remove quotation marks at the beginning and end of the names, as well as terms like "INC," "INC.," "CO," "CO.," "LLC," and "CORP." However, I do not want to remove these terms if they appear in the middle of the company names. Could you help me with how to solve this problem? Is it possible to do that without without using a loop?

Thanks

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str69 Co_Name str4 Co_Code
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
`""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
`""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
`ING PILGRIM INVESTMENTS, INC.'       "PLG"
`ING PILGRIM INVESTMENTS, INC.'       "PLG"
`ING PILGRIM INVESTMENTS, INC.'       "PLG"
`ING PILGRIM INVESTMENTS, INC.'       "PLG"
`ING PILGRIM INVESTMENTS, INC.'       "PLG"
`ING PILGRIM INVESTMENTS, INC.'       "PLG"
`JAMES C. EDWARDS & CO., CORP'        "JCE"
`JAMES C. EDWARDS & CO., CORP'        "JCE"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
end

Tags: None

Xinya Hao

Join Date: Sep 2022
Posts: 48

18 Sep 2024, 23:11

Hi Fimi,

Stata allows regular expressions. Try these:

Code:

// remove the " INC." (beginning with a space) in the end of the name string
gen s1 = ustrregexra(Co_Name, "\x20INC\.$",  "")

// then remove the " INC" in the end of the var s1
gen s2 = ustrregexra(s1, "\x20INC$",  "")

// or you can combine these...
gen s3 = ustrregexra(ustrregexra(Co_Name, "\x20INC\.$",  ""), "\x20INC$",  "")

Comment

Benno Schoenberger

Join Date: Apr 2024
Posts: 61

19 Sep 2024, 02:08

Hi Fimi,
unfortunately, I'm having a bit of trouble reading in your example data correctly.
For example, most of the input values of Co_name are embedded in compound double quotes AND a single quote, which suggests that your strings actually start and end with quotes. However, this does not apply to lines 43-50 (ING PILGRIM INVESTMENTS, INC). Your sample data will not be read in correctly without adjustments. I'm assuming that these strings also have quotation marks and corrected the input accordingly.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str69 Co_Name str4 Co_Code
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DEUTSCHE ASSET MANAGEMENT, INC.""'     "MGC"
`""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
`""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
`""DUNHILL INVESTMENT ADVISORS, LTD""'    "DNH"
`""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
`""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
`""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
`""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
`""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
`""ING PILGRIM INVESTMENTS, INC.""'       "PLG"
`""JAMES C. EDWARDS & CO., CORP""'        "JCE"
`""JAMES C. EDWARDS & CO., CORP""'        "JCE"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
`""LOWE, BROCKENBROUGH & COMPANY, INC.""' "LBT"
end

If I understood you correctly, you want to (1) remove quotation marks at the beginning and at the end and (2) remove everything in the company name like INC LTD PLG etc. Using Xinya Hao's syntax you could now go to the trouble of manually putting all the relevant strings to be removed into a local, write some loop and remove those. However, this would be very error-prone and you would have to repeat this complex process every time the data changes. But there may be an easier way: Looking at it the other way around, you want to keep everything BEFORE the last comma, right?
If that is true then you could simply:

Code:

gen s1 = ustrregexra(Co_Name, "(.*),(.*)$",  "$1")
replace s1 = subinstr(s1, `"""',"",.)

The result would be:

Code:

duplicates drop
list
     +---------------------------------------------------------------------------------+
     |                               Co_Name   Co_Code                              s1 |
     |---------------------------------------------------------------------------------|
  1. |     "DEUTSCHE ASSET MANAGEMENT, INC."       MGC       DEUTSCHE ASSET MANAGEMENT |
  2. |    "DUNHILL INVESTMENT ADVISORS, LTD"       DNH     DUNHILL INVESTMENT ADVISORS |
  3. |       "ING PILGRIM INVESTMENTS, INC."       PLG         ING PILGRIM INVESTMENTS |
  4. |        "JAMES C. EDWARDS & CO., CORP"       JCE          JAMES C. EDWARDS & CO. |
  5. | "LOWE, BROCKENBROUGH & COMPANY, INC."       LBT   LOWE, BROCKENBROUGH & COMPANY |
     +---------------------------------------------------------------------------------+

To be honest, I can't say for sure why only the string before the last comma is selected here; we would probably need an expert in the field of regular expressions. But I added a few extra commas randomly into the data to test the syntax and sure enough the last comma is always selected correctly. So I would probably continue working with it and still critically examine it with the entire data set.

All the best!

Last edited by Benno Schoenberger; 19 Sep 2024, 02:15.

Comment

Fimi Karimi

Join Date: Jun 2023

Posts: 71
#4

19 Sep 2024, 09:13

Thanks Xinya. When I run your command, it just generate 3 new variables which are the same as my original variable, without any change:

* Example generated by -dataex-. For more info, type help dataex
clear
input str69(Co_Name s1 s2 s3)
`""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""'
`""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""'
`""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""'
`""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""' `""DEUTSCHE ASSET MANAGEMENT, INC.""'
Comment

Fimi Karimi

Join Date: Jun 2023
Posts: 71

19 Sep 2024, 09:40

Hi Beno,

Thanks for your suggestion and explanations. The names of companies in my dataset are written in different styles. I am trying to show you some of them in the following table:

1- Some names are between a quotation at the beginning and a quotation at the end (I don't know why Dataex put an extra quotation. For example, the company in the first row of the table is written "DEUTSCHE ASSET MANAGEMENT, INC." in my dataset. I run your code and it just removed "".
2- I want to remove all quotation marks whether beginning or end of names.
3- I also need to remove INC / INC. / , INC. / CORP / CORP. / LLC / , LLC. / L.P. / I II III IV / and any (text) in the end or middle of names.
4- I want to keep CO if its after & like SELIGMAN, J&W & CO

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str41 Co_Name str3 Co_Code
`""DEUTSCHE ASSET MANAGEMENT, INC.""'       "MGC"
`""ING PILGRIM INVESTMENTS, INC.""'         "PLG"
`""METAMARKETS INVESTMENTS, LLC""'          "MTM"
`""SELIGMAN, J&W & CO""'                    "SLG"
`""SG COWEN ASSET MANAGMENT, INC.""'        "COW"
`""STEWART, W.P. & CO INC""'                "WPS"
"1492 Capital Management LLC"               "FTN"
"1838 INVESTMENT ADVISORS L.P."             "ETE"
"1ST SOURCE CORPORATION INV ADVISORS INC"   "FSB"
"2480 SECURITIES LLC"                       ""   
"ABN AMRO ASSET MANAGEMENT (USA) LLC"       "ABN"
"AC Funds, LLC"                             "ACU"
"ACCESSOR CAPITAL MGMT LP"                  "ACS"
"AETNA LIFE INS & ANNUITY CO"               "AET"
"AGF Investments LLC"                       "FFM"
"AIG SUNAMERICA ASSET MANAGEMENT CORP"      "SAM"
"AIM DISTRIBUTORS INC."                     ""   
"AIM Growth Series (Invesco Growth Series)" "AIM"
"ALLEGIANT ASSET MANAGEMENT COMPANY"        "NCC"
"ALLIANCE CAPITAL MANAGEMENT CORP."         ""   
"ALLIANCE CAPITAL MGMT LP"                  "ALL"
"ALPHA ANALYTICS INVESTMENT GROUP LLC"      "AAI"
"AMERICAN EXPRESS FINANCIAL ADVISORS"       ""   
"VANTAGEPOINT INVESTMENT ADVISERS LLC"      "VNP"
"VARIABLE ANNUITY LIFE INS CO"              "VLC"
"ARTISAN PARTNERS LP"                       "ART"
"SELECTED/VENTURE ADVISERS L.P."            "SVA"
"PRUDENTIAL INVESTMENT PORTFOLIOS, INC. 10" "BHB"
"MFS SERIES TRUST IV"                       "MFS"
"MFS SERIES TRUST III"                      "MFS"
"MFS SERIES TRUST X"                        "MFS"
"MFS SERIES TRUST I"                        "MFS"
"MFS SERIES TRUST III"                      "MFS"
end

Comment

Xinya Hao

Join Date: Sep 2022

Posts: 48
#6

19 Sep 2024, 09:57

Because your data have quotation mark as the last character of the string. The ustrregexra() function replace the string end with "INC" (no quotation marks). But your string actually end with a quotation mark.

You can first remove all the quotation marks in the strings then remove the " INC." and "INC" and "CORP" etc in the end of the strings.

Code:

// remove all the quotation marks in the Co_Name and generate s1 gen s1 = subinstr(Co_Name, `"""',"",.) // remove the "CORP" (no quotation marks) in the end of s1. the "$" indicates only matching CORP in the end of the string. replace s1 = ustrregexra(s1 , "CORP$", "")
1 like
Comment
Fimi Karimi

Join Date: Jun 2023

Posts: 71
#7

19 Sep 2024, 11:30

Thanks Xinya, It works perfectly.
Comment
Fimi Karimi

Join Date: Jun 2023

Posts: 71
#8

19 Sep 2024, 13:24

Could I ask you how I can tell stata that keep CO in the end of all names if it has & before it " & CO" but remove "CO" and "CO."
Comment

Xinya Hao

Join Date: Sep 2022
Posts: 48

19 Sep 2024, 19:01

Sure. Here are some example data and codes for you.

But my more general advice is:
1. Google regular expressions and try to learn and apply them. Regular expressions are useful when dealing with strings, and they also work not only in Stata but also in other languages, such as R or Python.
2. The easier way is to ask ChatGPT and let the bot generate the regular expressions you need for you. That usually works.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str9 s
`""ABC CO""'
`""Firm CO.""' 
`""FirmTCO""'  
`""AFF4CO""'  
`""AFF &CO""'  
`""FBC &CO.""' 
`""A11& CO""' 
`""A12 & CO.""'
end



* remove all quation marks in s and gen s1
gen s1 = subinstr(s, `"""',"",.)

* remove if ending with -CO- or -CO.-, but remain unchanged if ending with -& CO- or -&CO- or -& CO.- or -&CO.-
replace s1 = ustrregexra(s1, "(?<!&\x20?)CO\.?$", "")
// \x20 means a space
// ? means the character can appear 0 or multiple times
// ! means the logic NOT
// (?<!&\x20?) means: matching the strings NOT beginning with "&" or "& " or "&"+ multiple spaces
// CO meas CO
// . means any character in regular expressions, but \. just means the "."
// \.? means 0 or multile dots (in other words, ".")
// $ means: matching the strings ending with the given pattern

// thus, "(?<!&\x20?)CO\.?$" means...(a combination of the above pattern)
// the ustrregexra() function will try to find the given pattern and replace them with "" (noting).

* remove the beginning or ending spaces (trim the s1 variable)
replace s1 = trim(s1)

list *
/*

     +----------------------+
     |         s         s1 |
     |----------------------|
  1. |  "ABC CO"       ABC  |
  2. | "Firm CO.      Firm  |
  3. | "FirmTCO"      FirmT |
  4. |  "AFF4CO"       AFF4 |
  5. | "AFF &CO"    AFF &CO |
     |----------------------|
  6. | "FBC &CO.   FBC &CO. |
  7. | "A11& CO"    A11& CO |
  8. | "A12 & CO   A12 & CO |
     +----------------------+

*/

Comment

Fimi Karimi

Join Date: Jun 2023

Posts: 71
#10

21 Sep 2024, 07:52

Thank you so much for your help and very helpful advice.
Comment

Announcement

removing some parts of a string variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment