Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Special Characters in String Vars

    Hi everyone--

    I have a string var that is riddled with special characters, which is ultimately precluding me to complete a fuzzy match on two data sets. I am trying to strip this string var of all special characters (i.e. "" , . - () [] etc.), but have not been successful in finding a way to get rid of all characters. I have tried using commands such as:

    replace shortcounterpartyname=regexr(shortcounterpartyname , "`" "," "." "'" "inc." " inc." "/" " - " "-" " . " "[0-9][0-9][0-9][0-9]" " s.a. " "llc" " llc" " ltd." "(" ")", "")
    replace shortcounterpartyname=subinstr(shortcounterpartyna me, "(","",.)

    However, I have yet to find a way to get rid of quotation marks (even after reading a plethora of other help articles on this same issue including http://www.stata.com/statalist/archi.../msg00179.html). Additionally, I have found that Stata is dropping the first letter of some names, even if that observation doesn't have any special characters within its name.

    If anyone has any suggestions regarding this issue, please let me know!

    Thanks,

    Rebecca

  • #2
    I'm sure there is some relatively simple way to do this using regular expressions, but I think the time it will take you to figure that out, if you are not really proficient at them, is not worth the trouble when you can do it fairly simply:

    Code:
    gen clean_string = dirty_string
    fovalues i = 0/255 {
        if !inrange(`i', 48, 57) /// DIGITS
            & !inrange(`i', 65, 90) /// UPPER CASE LETTERS
            & !inrange(`i', 97, 122) { // LOWER CASE LETTERS
                    replace clean_string = subinstr(clean_string, `"`=char(`i')'"', "", .)
        }
    }
    Note: Not tested. Beware of typos.

    The above code will strip out everything except alphanumeric characters from dirty_string to create clean_string. If you want to retain certain punctuation marks and spaces, you can add a & !inlist( `i',....) term to the -if- command, listing the ascii codes for the particular characters you want to save. If you don't know the numeric ascii codes, you can find many tables available on google. But more convenient is to download Nick Cox's -asciiplot- command from SSC. It produces a table of numeric codes for ASCII characters in a Graph window.

    Comment


    • #3
      You can use -charlist- to get the list of ascii codes that you could replace.

      Comment


      • #4
        Here's some simple technique using a Stata 14 function that you may be able to build on.

        Code:
        . generate str20 clean = ustrregexra(test,`"[^a-zA-Z0-9]"',"")
        
        . list, clean
        
                   test    clean  
          1.    abc.efg   abcefg  
          2.   aB!@#hij    aBhij  
          3.      ab"7d     ab7d  
        
        .

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          I'm sure there is some relatively simple way to do this using regular expressions, but I think the time it will take you to figure that out, if you are not really proficient at them, is not worth the trouble when you can do it fairly simply:

          Code:
          gen clean_string = dirty_string
          fovalues i = 0/255 {
          if !inrange(`i', 48, 57) /// DIGITS
          & !inrange(`i', 65, 90) /// UPPER CASE LETTERS
          & !inrange(`i', 97, 122) { // LOWER CASE LETTERS
          replace clean_string = subinstr(clean_string, `"`=char(`i')'"', "", .)
          }
          }
          Note: Not tested. Beware of typos.

          The above code will strip out everything except alphanumeric characters from dirty_string to create clean_string. If you want to retain certain punctuation marks and spaces, you can add a & !inlist( `i',....) term to the -if- command, listing the ascii codes for the particular characters you want to save. If you don't know the numeric ascii codes, you can find many tables available on google. But more convenient is to download Nick Cox's -asciiplot- command from SSC. It produces a table of numeric codes for ASCII characters in a Graph window.
          This loop for the most part works great! However, I am receiving an error message "too few quotes". I have combed through the code a few times now, and cannot find where the missing quote should be.

          Comment


          • #6
            Originally posted by William Lisowski View Post
            Here's some simple technique using a Stata 14 function that you may be able to build on.

            Code:
            . generate str20 clean = ustrregexra(test,`"[^a-zA-Z0-9]"',"")
            
            . list, clean
            
            test clean
            1. abc.efg abcefg
            2. aB!@#hij aBhij
            3. ab"7d ab7d
            
            .
            I am using Stata 13.1, is there a similar function available for this version?

            Comment


            • #7
              There is an ancient egen function sieve() in egenmore (SSC) that will do this.

              *! 1.0.0 NJC 23 Sept 2002
              program define _gsieve
              version 7.0

              I certainly haven't tested it on anything exotic, but it passes William's test

              Code:
              clear
              input str8 test 
              `"abc.efg"'    
              `"aB!@#hij"'    
              `"ab"7d"' 
              end 
              
              egen wanted = sieve(test), keep(a n)
              
              list 
              
                   +-------------------+
                   |     test   wanted |
                   |-------------------|
                1. |  abc.efg   abcefg |
                2. | aB!@#hij    aBhij |
                3. |    ab"7d     ab7d |
                   +-------------------+
              But note the general Advice already given you:

              11. What should I say about the version of Stata I use?

              The current version of Stata is 14.1. Please specify if you are using an earlier version; otherwise, the answer to your question may refer to commands or features unavailable to you. Moreover, as bug fixes and new features are issued frequently by StataCorp, make sure that you update your Stata before posting a query, as your problem may already have been solved.

              Comment


              • #8
                Notwithstanding what -sieve- can do, I'd be curious to see a solution along the lines of Clyde's suggestion, as I also have in similar context encountered the "too few quotes" error message while (I think) dealing with strings containing embedded quotes. I suppose this amounts to understand just how -sieve- successfully manages to escape the quote character.

                Comment


                • #9
                  *!!! Warning: in-joke ahead.

                  If you can break the sieve magic, you have to face the dragons.

                  Comment


                  • #10
                    OK, I don't get the in joke.

                    Looking at the code for -egen, sieve()-, it appears that when used with the keep() option, rather than looping over all possible characters, it builds up its result by going one character at a time through the source string and accepting each character only if it matches the list of "keepers." So it never really has to deal with the expression `"`"', which is what breaks my code in #5 when it tries to purge all instances of the ` character.

                    But the magic can be broken. If you ask -egen, sieve()- to omit(`), it fails.

                    So, where are these dragons I must face now?

                    Comment


                    • #11
                      Clyde's code can be coerced into doing what was intended.
                      Code:
                      gen clean_string = dirty_string
                      forvalues i = 0/255 {
                          if !inrange(`i', 48, 57) /// DIGITS
                              & !inrange(`i', 65, 90) /// UPPER CASE LETTERS
                              & !inrange(`i', 97, 122) { // LOWER CASE LETTERS
                                  quietly replace clean_string = subinstr(clean_string, `= `"char(`i')"' ', "", .)
                          }
                      }
                      Code:
                      . list, clean
                      
                             dirty_s~g   clean_~g  
                        1.     abc.efg     abcefg  
                        2.   aB!@#hi j      aBhij  
                        3.       ab"7d       ab7d  
                        4.        xy`z        xyz

                      Comment


                      • #12

                        The in-joke hinges on "sieve" being read as another way of saying "goblet".

                        On the technical point: Clyde is correct. sieve() fails on being asked explicitly to omit single left quotes. But for the problem stated, that doesn't bite, as the choice is (still) to keep alphabetic and numeric.

                        Code:
                        clear
                        input str8 test
                        `"abc.efg"'    
                        `"aB!@#hij"'    
                        `"ab"7d"'
                        `"xy`z"'
                        end
                        
                        egen wanted = sieve(test), keep(a n) 
                        
                        list
                        
                             +-------------------+
                             |     test   wanted |
                             |-------------------|
                          1. |  abc.efg   abcefg |
                          2. | aB!@#hij    aBhij |
                          3. |    ab"7d     ab7d |
                          4. |     xy`z      xyz |
                             +-------------------+

                        Comment


                        • #13
                          I have a follow-up question: My string variable has text in Persian language. When I apply the code recommended by Clyde Schechter , all except some values that are English get removed. How do I modify the code so I can keep the values of interest? My data is in Persian.

                          Comment

                          Working...
                          X