Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove part of string variable if it contains a word

    Dear Statalist members,

    I am trying to remove a part of my name variable if it contains 'DEAD'. This is an example of my data:

    name
    AAH DEAD
    AB AIRLINES DEAD
    AB CONSULTING
    AB DYNAMICS
    AB ELTN.PRDS.
    ABACUS GROUP DEAD

    Desired result:
    name
    AAH
    AB AIRLINES
    AB CONSULTING
    AB DYNAMICS
    AB ELTN.PRDS.
    ABACUS GROUP

    Essentially, I need a code that removes ' DEAD' if an observation contains it. I have tried the code below, but that removes the entire name of the company and results in empty observations for companies that had 'DEAD' in their names.

    replace name = substr(name, 1, strpos(name,"DEAD")-1)


    Thank you for your help!

  • #2
    If you have not already, try looking at the entries in -help string functions- to learn about the various functions that would help with problems related to strings For this question, you will find -subinstr()- to be useful:
    Code:
    replace name = subinstr(name,"DEAD", .)

    Comment


    • #3
      Code:
      replace name = subinstr(name, " DEAD", "", .)
      Note: This will remove " DEAD" wherever it occurs--even if that's not the end of the name. Also, are you sure that all values of name are upper case in your data? If a string contains " dead" this code will fail.

      The reason your code removes all the strings that do not contain "DEAD" is that, when "DEAD" does not appear in name, strpos(name, "DEAD") is 0. The code then replaces name by the substring of name beginning at character 1 with length 0-1. A string with negative length is an empty string!

      Added: Crossed with #2, which, I believe, has an error because it omits the "" argument.

      Comment


      • #4
        Entering this yields an 'Invalid syntax r(198)'

        However, if I use

        Code:
        replace name = subinstr(name,"DEAD","",.)
        it works, but leaves a spacebar after the company name (e.g. 'AAH ' instead of 'AAH'). How would I go about taking care of this spacebar?

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Code:
          replace name = subinstr(name, " DEAD", "", .)
          Note: This will remove " DEAD" wherever it occurs--even if that's not the end of the name. Also, are you sure that all values of name are upper case in your data? If a string contains " dead" this code will fail.

          The reason your code removes all the strings that do not contain "DEAD" is that, when "DEAD" does not appear in name, strpos(name, "DEAD") is 0. The code then replaces name by the substring of name beginning at character 1 with length 0-1. A string with negative length is an empty string!

          Added: Crossed with #2, which, I believe, has an error because it omits the "" argument.
          Dear Clyde,

          This worked! However, in order to be safe, the 'DEAD' part should only be removed at the end of the name. Is that possible too? I.e. that only the last 5 characters are removed (spacebar+DEAD) if spacebar+DEAD are the last 5 characters.

          Comment


          • #6
            So, to be fully general, we need the code to remove " DEAD" when it appears at the end of the string, and, I presume, if a string contains DEAD both at the end and somewhere else, it should remove only the one at the end. That requires a few little tricks:

            Code:
            clear
            input str18 name
            "AAH DEAD"        
            "AB AIRLINES DEAD"
            "AB CONSULTING"    
            "AB DYNAMICS"      
            "AB ELTN.PRDS."    
            "ABACUS GROUP DEAD"
            "GRATEFUL DEAD INC"
            "ONE DEAD1 TWO DEAD"
            end
            
            replace name = reverse(subinstr(reverse(name), "DAED ", "", 1)) ///
                if substr(name, length(name)-4, 5) == " DEAD"
            
            list, noobs clean
            Notice by the way that "DAED " in the -replace name =...- line is not a typo. It is " DEAD" typed backwards, which is exactly what is needed here.

            By the way, in the future, please post example data using the -dataex- command. While what you have shown was easy enough to import to Stata, more complicated example data sets would have been troublesome. You can get the -dataex- command by running -ssc install dataex- and then you can run -help dataex- to see the instructions for using it. When you use -dataex-, you make it possible for those who want to help you to create a completely faithful replica of your Stata example with a simple copy/paste operation.

            Added: There is probably a really short simple way to do this with regular expressions, but I always struggle with them, perhaps because I use them too infrequently to ever really internalize their syntax.
            Last edited by Clyde Schechter; 15 Jun 2017, 09:36.

            Comment


            • #7
              Rising to Clyde's challenge, but only for expository purposes, and to demonstrate technique that removes any excess blanks before the trailing "DEAD".
              Code:
              clear
              input str18 name
              "AAH DEAD"        
              "AB AIRLINES DEAD"
              "AB CONSULTING"    
              "AB DYNAMICS"      
              "AB ELTN.PRDS."    
              "ABACUS GROUP DEAD"
              "GRATEFUL DEAD INC"
              "ONE DEAD1 TWO DEAD"
              "THE UNDEAD"
              "TWO BLANKS  DEAD"
              end
              
              clonevar rename = name
              replace name = reverse(subinstr(reverse(name), "DAED ", "", 1)) ///
                  if substr(name, length(name)-4, 5) == " DEAD"
              replace rename = regexr(rename," +DEAD$","")
              
              list, noobs clean
              Code:
              . list, noobs clean
              
                               name              rename  
                                AAH                 AAH  
                        AB AIRLINES         AB AIRLINES  
                      AB CONSULTING       AB CONSULTING  
                        AB DYNAMICS         AB DYNAMICS  
                      AB ELTN.PRDS.       AB ELTN.PRDS.  
                       ABACUS GROUP        ABACUS GROUP  
                  GRATEFUL DEAD INC   GRATEFUL DEAD INC  
                      ONE DEAD1 TWO       ONE DEAD1 TWO  
                         THE UNDEAD          THE UNDEAD  
                        TWO BLANKS           TWO BLANKS

              Comment


              • #8
                Hi!
                What about if you have a list of words you want to remove?
                So instead of just having "DEAD" you also want to remove "ALIVE" and "DON'T KNOW".
                For example:
                Code:
                name
                AAH DEAD
                AB AIRLINES DEAD
                AB CONSULTING ALIVE
                AB DYNAMICS DON'T KNOW
                AB ELTN.PRDS.
                ABACUS GROUP DEAD
                I have a long list of string i would like to remove so i'm trying to use a foreach loop but haven't had luck yet.
                So far my code looks something like this
                Code:
                foreach scenario in ALIVE DEAD "DON'T KNOW"{    
                    replace name="`scenario'" if strpos(name, "`scenario'") >0    
                   }
                My problem here is that code will leave "AAH DEAD" unchanged since it doesn't contain DON'T KNOW (last loop), and so ignores the replace done in the previous loops.
                Thanks in advance!

                Comment


                • #9
                  I don't believe your loop does what you say it does, but it also doesn't do what you want. Try this:

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str22 name
                  "AAH DEAD"              
                  "AB AIRLINES DEAD"      
                  "AB CONSULTING ALIVE"   
                  "AB DYNAMICS DON'T KNOW"
                  "AB ELTN.PRDS."         
                  "ABACUS GROUP DEAD"     
                  end
                  
                  foreach scenario in DEAD ALIVE "DON'T KNOW"{
                      local target = reverse(`"`scenario'"')
                      local length = length(`"`target'"')
                      replace name = reverse(subinstr(reverse(name), "`target' ", "", 1)) ///
                          if substr(name, length(name)-`length', `=`length'+1') == `" `scenario'"'
                  }
                  In the future, please use the -dataex- command to show example data, as I have done here. It greatly simplifies the process of replicating your Stata example in another person's Stata, so that code can be tested on it. If you are running version 15.1, -dataex- is part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it.

                  Comment


                  • #10
                    Hi all,

                    A follow-up question to this old thread, which I found very helpful: What about if you have a list of words you want to keep, rather than drop?

                    My data consists in a list of different financial instruments:

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input str28 instrument
                    "4 % cons. cum. stock"              
                    "6 1/2 % Electric Power debs."      
                    "6 % debs."    
                    "5 1/4 % stock"      
                    "5 % stock org. cons."    
                    "5 % stock (1923)"    
                    "5 1/4 % stock"          
                    end
                    My goal is to harmonize the names of the instruments. To do this, I want to keep the interest rate + a list of words (cons. cum. and stock.) and drop all words that are not part of this list (In this case, I want to drop "Electric Power", "org." and "(1923)" ) How can I do that?

                    Code:
                    * desired result:
                    clear
                    input str25 instrument
                    "4 % cons. cum. stock"              
                    "6 1/2 % debs."      
                    "6 % debs."    
                    "5 1/4 % stock"      
                    "5 % stock cons."    
                    "5 % stock"    
                    "5 1/4 % stock"          
                    end
                    Many thanks in advance for any ideas/leads!
                    Last edited by Maia DEBS; 27 Apr 2023, 08:37.

                    Comment


                    • #11
                      This may help with #10

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input str28 instrument
                      "4 % cons. cum. stock"              
                      "6 1/2 % Electric Power debs."      
                      "6 % debs."    
                      "5 1/4 % stock"      
                      "5 % stock org. cons."    
                      "5 % stock (1923)"    
                      "5 1/4 % stock"          
                      end
                      
                      gen first = substr(instrument, 1, strpos(instrument, "%")) 
                      
                      gen work = trim(substr(instrument, strpos(instrument, "%") + 1, .))  
                      
                      gen second = "" 
                      
                      foreach w in cons cum stock {
                        replace second = second + "`w' " if strpos(work, "`w'")
                      } 
                      
                      replace second = trim(second) 
                      
                      list instrument first second 
                      
                           +---------------------------------------------------------+
                           |                   instrument     first           second |
                           |---------------------------------------------------------|
                        1. |         4 % cons. cum. stock       4 %   cons cum stock |
                        2. | 6 1/2 % Electric Power debs.   6 1/2 %                  |
                        3. |                    6 % debs.       6 %                  |
                        4. |                5 1/4 % stock   5 1/4 %            stock |
                        5. |         5 % stock org. cons.       5 %       cons stock |
                           |---------------------------------------------------------|
                        6. |             5 % stock (1923)       5 %            stock |
                        7. |                5 1/4 % stock   5 1/4 %            stock |
                           +---------------------------------------------------------+

                      Comment


                      • #12
                        Apologies about the slow reply, that did the job perfectly, many thanks!

                        Comment


                        • #13
                          A follow-up question. I have the variable CompanyInstrument, which combines company names and financial instruments. I want to split it in two variables: Company and Instrument.

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input str40 CompanyInstrument
                          "Hull Corporatn. 3 1/2 %, red. 1943"
                          "Harrisons and Crosfield 5 1/2 % cum. prf."
                          "Green (J. W.) 4 % deb. red."
                          "Delaware and Hudson. $100 shares"
                          "Gas Light and Coke (10 % with gas at 3/9)"
                          "Lynn & Boston cum. bonds 1924"
                          "Argentine Land pref. shares 10% guaranteed"
                          "Birmingham Gas cum. shares (max., 9 pr ct.)"
                          end

                          a) In most cases, the name of the financial instrument starts with the interest rate. How can I adapt Nick's code to obtain the following result:
                          Company Instrument
                          Hull Corporatn 3 1/2 %, red. 1943
                          Harrisons and Crosfield 5 1/2 % cum. prf.
                          Green (J. W.) 4 % deb. red.
                          b) In other cases it is less straightforward, because the instrument's name doesn't start with the interest rate. It begins with the characteristics of the instrument (cum. , pref.) or with a $ sign, or with a parenthesis. How can I tell Stata to extract the instrument name starting the first of these words that appear in CompanyInterest?

                          Desired result is:
                          Company Instrument
                          Delaware and Hudson. $100 shares
                          Gas Light and Coke (10 % with gas at 3/9)
                          Lynn & Boston cum. Bonds 1924
                          Argentine Land L. pref. shares 10% guaranteed
                          Birmingham Gas cum. shares (max., 9 pr ct.)
                          I am struggling to write a code that takes into account both scenarios a) and b). Many thanks for your suggestions!
                          Last edited by Maia DEBS; 12 May 2023, 08:41.

                          Comment


                          • #14
                            Any automated solution requires you to specify a specific set of conditions. If you are finding it difficult to spell out your rules, then it also becomes difficult to write code. For example, there appears no obvious way to specify that the parentheses

                            Green (J. W.) 4% deb. red.
                            belong to "company" whereas the parentheses

                            Gas Light and Coke (10 % with gas at 3/9)
                            do not. Unless you come up with some specific rule, e.g., only parentheses that are placed at the end of the sentence belong to "instrument".

                            Comment


                            • #15
                              Hi Andrew. Right, thank you for pointing it out. The rule goes as follows: the code should extract the sentence between the parentheses only if the first word that immediately follows the opening parenthesis is either a percentage (e.g. 10%) or the word 'max'.
                              Last edited by Maia DEBS; 14 May 2023, 22:50.

                              Comment

                              Working...
                              X