Help for replacing parts of a string variable

Elio Bolliger

Join Date: Aug 2017
Posts: 9

Help for replacing parts of a string variable

24 Mar 2020, 08:04

Dear community,

I want to clean a string variable that contains the name of a bank. Sometimes, it also includes some numeric content (a percentage indication) as the example from dataex shows below:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str79 Lender str20 LenderCountry_clean
"ABN AMRO BIBF"                 ""          
"ABN AMRO BIBF"                 ""          
"ABN AMRO BIBF"                 ""          
"ABN AMRO BIBF"                 ""          
"ABN AMRO BIBF"                 "Thailand"  
"ABN AMRO BIBF 10.00%"          ""          
"ABN AMRO BIBF 5.40%"           ""          
"ABN AMRO BIBF 6.00%"           ""          
"ABN AMRO BIBF 6.13%"           ""          
"ABN AMRO BIBF 6.29%"           ""          
"ABN AMRO BIBF 7.42%"           ""          
"ABN AMRO Bank Bhd"             ""          
"ABN AMRO Bank Bhd"             ""          
"ABN AMRO Bank Bhd"             ""          
"ABN AMRO Bank Bhd"             ""          
"ABN AMRO Bank Bhd"             ""          
"ABN AMRO Bank Bhd"             ""          
"ABN AMRO Bank Bhd"             ""          
"ABN AMRO Bank Bhd 1.50%"       ""          
"ABN AMRO Bank Bhd 1.50%"       ""          
"ABN AMRO Bank NV"              ""          
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        ""            
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        ""          
"ABN AMRO Bank NV [RBS]"        "Netherlands"
"ABN AMRO Bank NV [RBS]"        "Netherlands"
"ABN AMRO Bank NV [RBS]"        "Netherlands"
"ABN AMRO Bank NV [RBS] 1.10%"  ""          
"ABN AMRO Bank NV [RBS] 20.00%" ""          
"ABN AMRO Bank NV [RBS] 32.62%" ""          
"ABN AMRO Bank NV [RBS] 5.00%"  ""          
"ABN AMRO Bank NV [RBS] 6.67%"  ""          
"ABN AMRO Inc"                  ""          
"ABN AMRO Inc"                  ""          
"ANZ Grindlays Bank Plc"        ""          
end

My objective is to complete the observations of the variable LenderCountry_clean if the bank name is the same. Now, what I did so far was to use the following command to fill up the observations:

Code:

bysort Lender (LenderCountry_clean) : replace LenderCountry_clean = LenderCountry_clean[_N] if missing(LenderCountry_clean) & _n < _N

However, the problem are now the observations that contain some values (percentages), i.e. 1.10%, 20.00%. In this case, the bank name is not exactly the same and I cannot fill up the observations as I would like to. For example, for the lender "ABN AMRO Bank NV [RBS] 1.10%" I would like to have the information "Netherlands" for the variable LenderCountry_clean.

Hence, my question is if there is a way to "clean" the variable Lender to get rid of the percentages. Note that these percentage indications, if there are any, are always at the end of the string. Is there a way to clean the variable Lender such that I can fill up the observations with the information from LenderCountry_clean ?

I greatly appreciate any help or comments and thank you in advance.

Best wishes,
Elio

Tags: data, data matching, string

Clyde Schechter

Join Date: Apr 2014

Posts: 28542
#2

24 Mar 2020, 09:55

There is little doubt in my mind that this can be done as a one-liner by using regular expressions. But regular expressions syntax has always eluded me. So here's a four-line solution using "ordinary" Stata commands.

Code:

gen to_erase = word(reverse(Lender), 1) replace to_erase = "" if substr(to_erase, 1, 1) != "%" replace Lender = subinstr(Lender, reverse(to_erase), "", .) if !missing(to_erase) replace Lender = trim(Lender)
1 like
Comment
Elio Bolliger

Join Date: Aug 2017

Posts: 9
#3

24 Mar 2020, 10:53

Thank you very much, this works beautifully! 4 lines or one line, what works is best. Thank you.
Comment

Announcement

Help for replacing parts of a string variable

Comment

Comment