Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to drop all chars *except* 0123456789 and abc...xyz?

    I'm trying to figure out how to purge a subset of variables of all characters that are not 1234567890 and abc....xyz. I want to keep only those 36 characters.

    Over time, the various characters beyond those 36 basics in the strings of interest may change, and I'd like to write code that adapts to that issue. I've used
    charlist, but I believe it requires me to specify which characters I want to drop, not those that I want to retain. I've also used cleanvars but encountered a similar problem.

    Thank you.

  • #2
    charlist is from SSC. I can't find cleanvars using search; please tell the forum where it comes from. On citing provenance please see FAQ Advice Section 12.

    For a string variable strvar I would do this to keep only 0..9 and a..z as characters.

    Code:
     
    
    gen newstrvar = "" 
    
    gen length = length(strvar) 
    su length, meanonly 
    
    qui forval j = 1/`r(max)' { 
         replace newstrvar = newstrvar + substr(strvar, `j', 1) if inrange(substr(strvar, `j', 1), "0", "9") | inrange(substr(strvar, `j', 1), "a", "z") 
    }
    You may well be able to skip the calculation of string length, or alternatively just look up the precise storage type of each string variable.

    This exploits the facts that "0" .. "9" and "a" ... "z" are contiguous blocks in terms of ASCII code and that Stata resolves inequalities in string values using alphanumeric order.

    In fact a similar but more general logic is implemented in sieve() from egenmore (SSC).

    There is also a solution (are also solutions) using regular expression syntax.

    Comment


    • #3
      How about using charlist to get the list of characters, then remove the ones you want kept from that list, then fix the variable. Something like:

      Code:
      sysuse auto
      gen newmake = make
      charlist make
      local tokill `r(sepchars)'
      local good `c(alpha)' 0 1 2 3 4 5 6 7 8 9 
      local tokill : list tokill - good
      foreach ll of local tokill {
          replace newmake = subinstr(newmake,`"`ll'"',"",.)
      }
      // space will be missed
      replace newmake = subinstr(newmake," ","",.)
      If you want to include upper-case as well, just include `c(ALPHA)' in the definition of local good

      Comment


      • #4
        Actually, I just realized my solution does not work if the variable includes the ` character. That can be dealt with individually before the charlist command:

        Code:
        replace newmake = subinstr(newmake,"\`","",.)

        Comment


        • #5
          charlist *! NJC 1.3.0 28 Feb 2014 includes code to cope with that character, char(96). Check to see that you have the most recent version. If you note other problems, email the author with a reproducible example. No doubt he will try to fit you in to his hectic social schedule.

          Comment


          • #6
            If the longest string in your variable does not exceed 32 characters, strtoname() might be a good starting point.

            Best
            Daniel

            Comment


            • #7
              Daniel: That's an interesting idea, but others should note both false negatives (strings with 0 .. 9 as first character will be rejected) and false positives (A..Z and _ are acceptable as characters in Stata names).

              Comment


              • #8
                Thanks for the reminder, Nick.

                Perhaps I should have been more explicit about the term "starting point". I had in mind something like

                Code:
                g newvar = strtoname(oldvar, 0)
                Note the (optional) second argument, that allows the first character to be a number. For those who wonder, local macro names may start with a number (as they are internally prefixed with an underscore anyway).

                However, not much is won when specifying this argument, as strtoname() will prefix strings starting with a number with an underscore if the second argument is omitted (or set to 1).

                In a next step we will have to fix the underscores (inserted by strtoname() or present in the original string), using subinstr(), as shown earlier. Up to this point no loop is needed.

                The uppercase characters still need to be dealt with, so loops might be unavoidable.

                Regular expressions might be another way to go here, but I have no clear idea how that would look like.

                Best
                Daniel
                Last edited by daniel klein; 27 Aug 2014, 13:17.

                Comment


                • #9
                  Here is a slightly different approach. Note that he said that the list of acceptable characters may change, and this accommodates that. Nick's code was nice, but depending on how complicated the list got, could have a bunch of conditionals. This will run slower, but doesn't count on things being in a row:

                  Code:
                  clear
                  set obs 100
                  
                  *=======create random strings, 20 characters long
                  gen str20 x = ""
                  forvalues i=1/20 {
                  replace x=x + char(int(uniform()*256))
                  }
                  
                  *====list of acceptable characters
                  local goodchars="1234567890abcdefghijklmnopqrstuvwxyz"
                  
                  
                  gen str20 xclean=""
                  *======outer loop: going width of variable
                  forvalues i=1/20 {
                  *======inner loop: going width of list of good characters
                      forvalues j=1/36 {
                      replace xclean=xclean+substr(x,`i',1) if substr(x,`i',1)==substr("`goodchars'",`j',1)
                      }
                  }

                  Comment


                  • #10
                    Thank you, all. I ended up using sieve() from egenmore (SSC) and it worked perfectly.

                    Comment


                    • #11
                      Nick: I have the latest charlist, which works fine with `. The problem is the subinstr() function, which needs the ` prefixed with a backslash.

                      Comment


                      • #12
                        Nick: OK, and thanks for the clarification. An alternative approach is

                        Code:
                         
                         replace newmake = subinstr(newmake,char(96),"",.)

                        Comment


                        • #13
                          The problem is the subinstr() function, which needs the ` prefixed with a backslash.
                          Not so.

                          Code:
                          . di subinstr("foo`bar", "`", "42")
                          foo42bar
                          The problem is that Stata gets confused if compound quotes are used around a string containing the left single quote.

                          Code:
                          . di subinstr("foo`bar", `"`"', "42")
                          too few quotes
                          r(132);
                          Best
                          Daniel

                          Comment

                          Working...
                          X