Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • strkeep added to SSC

    Hello all,

    Thanks to Kit Baum for his help in adding strkeep to SSC.

    strkeep is a small command used for data management and cleaning string variables and varlists by keeping only whitelisted characters.

    I created this command after spending too much time using subinstr() to find and replace individual special characters in variables containing first and last names. Instead of finding characters that you want to remove, strkeep finds characters that you want to keep and removes all other characters.

    For example, instead of doing the following to try to clean two variables so that they only have letters in them...
    Code:
    replace firstname = subinstr(firstname,".","",.)
    replace firstname = subinstr(firstname,"'","",.)
    replace firstname = subinstr(firstname," ","",.)
    replace firstname = subinstr(firstname,"`","",.)
    replace firstname = subinstr(firstname,"-","",.)
    replace lastname = subinstr(lastname,".","",.)
    replace lastname = subinstr(lastname,"'","",.)
    replace lastname = subinstr(lastname," ","",.)
    replace lastname = subinstr(lastname,"`","",.)
    replace lastname = subinstr(lastname,"-","",.)
    You could do...
    Code:
    strkeep firstname lastname, replace alpha
    In essence, strkeep is a "whitelist" command, while using a function like subinstr() to clean variables is a "blacklist" command.

    Please let me know if you have any questions or suggestions. I hope this command proves useful to people other than myself!

    Best,
    Roger

  • #2
    Roger Chu couldn't you use the regular expression functions to handle this a bit more easily?

    Comment


    • #3
      While I have no experience with it myself, I believe this can also be done with the egenmore subfunction sieve, written by Nick Cox if I'm not mistaken. I think it only takes one variable at a time though.

      Comment


      • #4
        wbuchanan I'm not sure that regular expressions would do what I want because it's not clear to me that Stat's regex engine allows for a "not" operator. If Stata does have a "not" operator, could you share it and how to use it?

        Assuming that Stata's regex engine doesn't allow for a "not" operator, if I use regexr(), I run into a similar problem as if I use subinstr(). I still have to identify all of the individual non-alphabet characters that I don't want in stringvar. While regexr() would save time over subinstr(), I still have an issue if I miss a non-letter character because I have to name every non-letter character that I don't want. With strkeep, you just need to name what characters you do want.

        However, if Stata's regex does have a "not" operator, then yes, regexr() may be a simpler solution.


        Jesse Wursten I looked up the egenmore function, and yes, it looks very similar. Thanks for pointing it out! I'll have to try it out myself.

        Comment


        • #5
          Thanks for the mention of egenmore, but note that sieve() was written for Stata 7 in 2002, so itself contains absolutely no functionality to work with anything introduced later.

          Comment


          • #6
            You would want to look at the Unicode based regex functions. [^...] is the equivalent of a not operator, but the end user could also define what they wanted to keep as well. It's only one variable at at time, but if you're using Stata 13 I had put something together a while ago using the Java API for a similar purpose that also allowed more robust regular expressions compared to the regex* functions that existed for Stata 13.

            Comment

            Working...
            X