Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing variable value if it starts with certain number

    Dear all,

    Please kindly help. I would like to replace the value of variable
    Code:
    sector
    with 1 if the value of variable
    Code:
    FTAG5
    (8 digit) starts with 45, 6010, or 502060.

    After browsing the forum, I think I can do it by using the following code:

    Code:
    replace sector = 1 if substr(FTAG5, 1, 2) == "45"
    replace sector = 1 if substr(FTAG5, 1, 4) == "6010"
    replace sector = 1 if substr(FTAG5, 1, 6) == "502060"

    Is there any way to make the code more concise? Here is the sample data. Thank you for your help.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long firm float year str8 FTAG5 float sector
    1 1990 "45102010" .
    1 1991 "45102010" .
    1 1992 "45102010" .
    1 1993 "45102010" .
    1 1994 "45102010" .
    1 1995 "45102010" .
    1 1996 "45102010" .
    1 1997 "45102010" .
    1 1998 "45102010" .
    1 1999 "45102010" .
    1 2000 "45102010" .
    1 2001 "45102010" .
    1 2002 "45102010" .
    1 2003 "45102010" .
    1 2004 "45102010" .
    1 2005 "45102010" .
    1 2006 "45102010" .
    1 2007 "45102010" .
    1 2008 "45102010" .
    1 2009 "45102010" .
    1 2010 "45102010" .
    1 2011 "45102010" .
    1 2012 "45102010" .
    1 2013 "45102010" .
    1 2014 "45102010" .
    1 2015 "45102010" .
    1 2016 "45102010" .
    1 2017 "45102010" .
    1 2018 "45102010" .
    1 2019 "45102010" .
    1 2020 "45102010" .
    2 1990 "45101020" .
    2 1991 "45101020" .
    2 1992 "45101020" .
    2 1993 "45101020" .
    2 1994 "45101020" .
    2 1995 "45101020" .
    2 1996 "45101020" .
    2 1997 "45101020" .
    2 1998 "45101020" .
    2 1999 "45101020" .
    2 2000 "45101020" .
    2 2001 "45101020" .
    2 2002 "45101020" .
    2 2003 "45101020" .
    2 2004 "45101020" .
    2 2005 "45101020" .
    2 2006 "45101020" .
    2 2007 "45101020" .
    2 2008 "45101020" .
    2 2009 "45101020" .
    2 2010 "45101020" .
    2 2011 "45101020" .
    2 2012 "45101020" .
    2 2013 "45101020" .
    2 2014 "45101020" .
    2 2015 "45101020" .
    2 2016 "45101020" .
    2 2017 "45101020" .
    2 2018 "45101020" .
    2 2019 "45101020" .
    2 2020 "45101020" .
    3 1990 "60101040" .
    3 1991 "60101040" .
    3 1992 "60101040" .
    3 1993 "60101040" .
    3 1994 "60101040" .
    3 1995 "60101040" .
    3 1996 "60101040" .
    3 1997 "60101040" .
    3 1998 "60101040" .
    3 1999 "60101040" .
    3 2000 "60101040" .
    3 2001 "60101040" .
    3 2002 "60101040" .
    3 2003 "60101040" .
    3 2004 "60101040" .
    3 2005 "60101040" .
    3 2006 "60101040" .
    3 2007 "60101040" .
    3 2008 "60101040" .
    3 2009 "60101040" .
    3 2010 "60101040" .
    3 2011 "60101040" .
    3 2012 "60101040" .
    3 2013 "60101040" .
    3 2014 "60101040" .
    3 2015 "60101040" .
    3 2016 "60101040" .
    3 2017 "60101040" .
    3 2018 "60101040" .
    3 2019 "60101040" .
    3 2020 "60101040" .
    end
    label values firm firm
    label def firm 1 "ID:AAL", modify
    label def firm 2 "ID:AAP", modify
    label def firm 3 "ID:ABM", modify

    Best,


  • #2
    The magic of "regular expressions", a feature of may popular programming languages, including Stata.
    Code:
    replace sector = 1 if ustrregexm(FTAG5,"^(45|6010|502060)")
    The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at https://unicode-org.github.io/icu/us...gs/regexp.html. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

    Comment


    • #3
      William Lisowski gave an excellent answer, but for completeness let's just mention


      Code:
       
       replace sector = 1 if substr(FTAG5, 1, 2) == "45" | substr(FTAG5, 1, 4) == "6010" | substr(FTAG5, 1, 6) == "502060"
      I have a suspicion that this followed

      Code:
      gen sector = 0
      as code to create an indicator variable (some say "dummy variable", wording better avoided in my view). If so, know that


      Code:
       
       gen sector = substr(FTAG5, 1, 2) == "45" | substr(FTAG5, 1, 4) == "6010" | substr(FTAG5, 1, 6) == "502060"
      gets you there in one, and there is naturally an equivalent one-liner from William's code

      Code:
       
       gen sector = ustrregexm(FTAG5,"^(45|6010|502060)")

      More at https://www.stata-journal.com/articl...article=dm0099







      Comment


      • #4
        Thank you for the comprehensive answer, William and Nick!

        Out of curiosity, actually I have 6 codes (e.g. 5510, 60, 45102030, 45102035, etc.) that belong to category 1. But another 25 codes for each category 2 and 3. Currently I am inputing it manually to William's one line code above, e.g. for category 1:

        Code:
        replace sector = 1 if ustrregexm(FTAG5,"^(5510|60|45102030|45102035|45102010|451030)")
        Is there a smarter way to input it, e.g. by inserting excel file maybe, to reduce human error in entering the 25 codes? Thank you very much.


        Best regards,
        Last edited by Abdan Syakura; 20 Aug 2021, 12:51.

        Comment


        • #5
          I don't know what you mean exactly by an Excel file. See https://www.stata.com/support/faqs/d...s-for-subsets/ for an approach when there are many possible codes.

          Comment


          • #6
            To make three lines of code concise, as post #1 defined the problem, suggests a one-line solution is what is sought.

            To make 25 lines of code concise, as post #4 now redefines the problem, suggests that a readable four-line solution is appropriately concise.
            Code:
            local list 45 6010 502060
            foreach s of local list {
                replace sector = 1 if substr(FTAG5, 1, length("`s'")) == "`s'"
            }
            I think this strikes a decent balance of concision, readability and maintainability, since the list of codes are included in the do-file directly rather than in a separate file that needs to be kept track of.

            I like to keep the list of values on their own command rather than list 25 values on the foreach command.

            Comment

            Working...
            X