Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Foreach vs. Forvalues when using char() function to remove special characters in a string variable

    Hello all,

    Using Stata 15.1/IC

    I need to submit a bulk file with a string variable ("NAME" variable in this example) that is required to have no special characters besides ampersand and dash. I am able to accomplish this using the following series of commands:


    charlist NAME //shows which characters are in my string var NAME
    "&',-./01234689ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnop qrstuvwxyz



    egen NEWNAME= sieve(NAME), omit(,./`"""'`"'"') // generates new variable with the special characters omitted but retains & and -

    Results:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str86(NAME NEWNAME)
    "Single-Benefits, Inc."                                       "Single-Benefits Inc"                                              
    "Superstar, LLC"                                              "Superstar LLC"                                                  
    "RML Agency, Inc."                                            "RML Agency Inc"                                        
    "A & M Company, Inc."                                         "A & M Company Inc"
    end
    While this approach works as intended, I wanted to be able to use a command that is not dependent on the specific characters to be omitted, which could change between datasets (e.g. a character like "+" or "@" would not be excluded in a string variable that had them with my code--I'd have to manually update the command). Plus, the way you have to set off double- and single quote marks makes it hard to read in the log file.

    I thought I could use the char() function to generalize the command by using the integer values associated with ASCII characters with a forvaluesloop (under the assumption I will nor run into any non-ASCII special characters), but I get the following error:

    . forvalues i = 33/37 39/44 46/47 58/64 91/96 123/126 {
    2. replace NAME = subinstr(NAME, char(`i'), "", .)
    3. }
    invalid syntax
    r(198);


    I am, however, able to use the foreachcommand without error:
    . foreach i in 33 34 35 36 37 39 40 41 42 43 44 46 47 58 59 60 61 62 63 64 91 92 93 94 95 96 123 124 125 126 {
    2. replace NAME =
    subinstr(NAME, char(`i'), "", .)
    3. }


    My question is why the forvalues command doesn't work. My presupposition is that I just did something wrong in the command syntax-wise, but I also wondered if Stata treats values in the char() function differently than I thought when used with forvalues.

    Of course, if there is an even better way to accomplish the elimination of all special characters besides ampersands and dashes, I am all ears. Thanks for any advice.



  • #2
    Read -help forvalues- and you will see that the kind of multiple range number list you are using in your -forvalues- command is not supported. You can, however, get that with:
    Code:
    foreach i of numlist 33/37 39/44 46/47 58/64 91/96 123/126 {
        something involving `i'
    }
    -forvalues- is for use with simpler number lists and single ranges.

    Comment


    • #3
      I only read the help page 4x and still didn't catch the error--thanks very much Clyde Schechter!

      Comment


      • #4
        I will again mount my regular expression hobby horse and demonstrate code that accepts a regular expression containing the list of acceptable characters and deletes all others, and which has the fringe benefit of working on unicode text as well as the ASCII characters 0-127. Requires Stata 14 or later.
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str49 dirty
        "ABC"                                              
        "!XYZ"                                            
        "très bien"                                      
        "P Q"                                              
        "ABCabc123"                                        
        "12,345"                                          
        "每股收益EPS基本报告期2015年报币种原"
        end
        generate str clean =  ustrregexra(dirty,"[^a-zA-Z0-9]","")
        list, clean
        Code:
         generate str clean =  ustrregexra(dirty,"[^a-zA-Z0-9]","")
        
        . list, clean
        
                                             dirty       clean  
          1.                                   ABC         ABC  
          2.                                  !XYZ         XYZ  
          3.                             très bien     trsbien  
          4.                                   P Q          PQ  
          5.                             ABCabc123   ABCabc123  
          6.                                12,345       12345  
          7.        每股收益EPS基本报告期2015年报币种原     EPS2015

        Comment


        • #5
          William Lisowski , thanks--a lot faster. But can an adjustment be made to the command you presented to retain spaces, ampersands, and hypens (but none of the others)? I was able to retain ampersands and spaces by using the function ustrregexra(dirty,"[^a-zA-Z0-9& ]","") , but I can't seem to figure out where to put in a hyphen to retain it.

          Comment


          • #6
            Putting the hyphen in the front of the list, so that it doesn't look like part of a character range like a-z, does the trick.
            Code:
            generate str clean =  ustrregexra(dirty,"[^-& a-zA-Z0-9]","")

            Comment


            • #7
              Thanks very much William Lisowski

              Comment


              • #8
                #1 For future threads please note the permanent request (FAQ Advice #12) to explain where community-contributed commands you use come from. charlist and the egen function sieve() are community-contributed on SSC (the latter in the egenmore package).

                Comment

                Working...
                X