Replacing variable value if it starts with certain number

Abdan Syakura

Join Date: Nov 2018
Posts: 58

Replacing variable value if it starts with certain number

18 Aug 2021, 17:40

Dear all,

Please kindly help. I would like to replace the value of variable

Code:

sector

with 1 if the value of variable

Code:

FTAG5

(8 digit) starts with 45, 6010, or 502060.

After browsing the forum, I think I can do it by using the following code:

Code:

replace sector = 1 if substr(FTAG5, 1, 2) == "45"
replace sector = 1 if substr(FTAG5, 1, 4) == "6010"
replace sector = 1 if substr(FTAG5, 1, 6) == "502060"

Is there any way to make the code more concise? Here is the sample data. Thank you for your help.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long firm float year str8 FTAG5 float sector
1 1990 "45102010" .
1 1991 "45102010" .
1 1992 "45102010" .
1 1993 "45102010" .
1 1994 "45102010" .
1 1995 "45102010" .
1 1996 "45102010" .
1 1997 "45102010" .
1 1998 "45102010" .
1 1999 "45102010" .
1 2000 "45102010" .
1 2001 "45102010" .
1 2002 "45102010" .
1 2003 "45102010" .
1 2004 "45102010" .
1 2005 "45102010" .
1 2006 "45102010" .
1 2007 "45102010" .
1 2008 "45102010" .
1 2009 "45102010" .
1 2010 "45102010" .
1 2011 "45102010" .
1 2012 "45102010" .
1 2013 "45102010" .
1 2014 "45102010" .
1 2015 "45102010" .
1 2016 "45102010" .
1 2017 "45102010" .
1 2018 "45102010" .
1 2019 "45102010" .
1 2020 "45102010" .
2 1990 "45101020" .
2 1991 "45101020" .
2 1992 "45101020" .
2 1993 "45101020" .
2 1994 "45101020" .
2 1995 "45101020" .
2 1996 "45101020" .
2 1997 "45101020" .
2 1998 "45101020" .
2 1999 "45101020" .
2 2000 "45101020" .
2 2001 "45101020" .
2 2002 "45101020" .
2 2003 "45101020" .
2 2004 "45101020" .
2 2005 "45101020" .
2 2006 "45101020" .
2 2007 "45101020" .
2 2008 "45101020" .
2 2009 "45101020" .
2 2010 "45101020" .
2 2011 "45101020" .
2 2012 "45101020" .
2 2013 "45101020" .
2 2014 "45101020" .
2 2015 "45101020" .
2 2016 "45101020" .
2 2017 "45101020" .
2 2018 "45101020" .
2 2019 "45101020" .
2 2020 "45101020" .
3 1990 "60101040" .
3 1991 "60101040" .
3 1992 "60101040" .
3 1993 "60101040" .
3 1994 "60101040" .
3 1995 "60101040" .
3 1996 "60101040" .
3 1997 "60101040" .
3 1998 "60101040" .
3 1999 "60101040" .
3 2000 "60101040" .
3 2001 "60101040" .
3 2002 "60101040" .
3 2003 "60101040" .
3 2004 "60101040" .
3 2005 "60101040" .
3 2006 "60101040" .
3 2007 "60101040" .
3 2008 "60101040" .
3 2009 "60101040" .
3 2010 "60101040" .
3 2011 "60101040" .
3 2012 "60101040" .
3 2013 "60101040" .
3 2014 "60101040" .
3 2015 "60101040" .
3 2016 "60101040" .
3 2017 "60101040" .
3 2018 "60101040" .
3 2019 "60101040" .
3 2020 "60101040" .
end
label values firm firm
label def firm 1 "ID:AAL", modify
label def firm 2 "ID:AAP", modify
label def firm 3 "ID:ABM", modify

Best,

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

18 Aug 2021, 18:37

The magic of "regular expressions", a feature of may popular programming languages, including Stata.

Code:

replace sector = 1 if ustrregexm(FTAG5,"^(45|6010|502060)")

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at https://unicode-org.github.io/icu/us...gs/regexp.html. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#3

19 Aug 2021, 05:10

William Lisowski gave an excellent answer, but for completeness let's just mention

Code:

replace sector = 1 if substr(FTAG5, 1, 2) == "45" | substr(FTAG5, 1, 4) == "6010" | substr(FTAG5, 1, 6) == "502060"

I have a suspicion that this followed

Code:

gen sector = 0

as code to create an indicator variable (some say "dummy variable", wording better avoided in my view). If so, know that

Code:

gen sector = substr(FTAG5, 1, 2) == "45" | substr(FTAG5, 1, 4) == "6010" | substr(FTAG5, 1, 6) == "502060"

gets you there in one, and there is naturally an equivalent one-liner from William's code

Code:

gen sector = ustrregexm(FTAG5,"^(45|6010|502060)")

More at https://www.stata-journal.com/articl...article=dm0099
1 like
Comment
Abdan Syakura

Join Date: Nov 2018

Posts: 58
#4

20 Aug 2021, 12:37

Thank you for the comprehensive answer, William and Nick!

Out of curiosity, actually I have 6 codes (e.g. 5510, 60, 45102030, 45102035, etc.) that belong to category 1. But another 25 codes for each category 2 and 3. Currently I am inputing it manually to William's one line code above, e.g. for category 1:

Code:

replace sector = 1 if ustrregexm(FTAG5,"^(5510|60|45102030|45102035|45102010|451030)")

Is there a smarter way to input it, e.g. by inserting excel file maybe, to reduce human error in entering the 25 codes? Thank you very much.

Best regards,

Last edited by Abdan Syakura; 20 Aug 2021, 12:51.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#5

21 Aug 2021, 03:42

I don't know what you mean exactly by an Excel file. See https://www.stata.com/support/faqs/d...s-for-subsets/ for an approach when there are many possible codes.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

21 Aug 2021, 07:07

To make three lines of code concise, as post #1 defined the problem, suggests a one-line solution is what is sought.

To make 25 lines of code concise, as post #4 now redefines the problem, suggests that a readable four-line solution is appropriately concise.

Code:

local list 45 6010 502060 foreach s of local list { replace sector = 1 if substr(FTAG5, 1, length("`s'")) == "`s'" }

I think this strikes a decent balance of concision, readability and maintainability, since the list of codes are included in the do-file directly rather than in a separate file that needs to be kept track of.

I like to keep the list of values on their own command rather than list 25 values on the foreach command.
1 like
Comment

Announcement

Replacing variable value if it starts with certain number

Comment

Comment

Comment

Comment

Comment