Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Text processing - Replace remaining non-printable characters, ascii < 32 or > 126 by ""


    Hi, I am trying to clean up a large list of names before trying to fuzzy match it with another dataset. I would like to replace remaining non-printable characters, ascii < 32 or > 126 by "". How can I do that? Thanks!

  • #2
    Code:
    help filefilter

    Comment


    • #3
      I would use some type of regular expression.

      Code:
      x = ustrregexra(x, `”(^\d\w\s\.\?\*\+=<>_\[\]\(\):;-‘“,!&)”’, “”)
      there is likely a metacharacter for punctuation marks that you could use, but that is at least the basic gist of the approach I would take.

      Comment


      • #4
        Here's a simple approach using the -fileread()- and -filewrite()- functions, of which I am a fan. No regular expressions required. This approach is not very efficient, but on my pretty ordinary Windows machine, it handles about 10,000 characters a second of the input file. My approach completely ignores any structure of the input file, and does not assume it is in Stata format.

        Another alternative would be appropriate if the names are already a variable in a Stata data file, which appears at the end. (I have not tested or checked the following code.)

        Code:
        // Names are not a variable in a Stata data file
        clear
        set obs 1
        gen strL s = fileread("my_infile")
        local len = length(s)
        di `len'
        gen str1 c = ""
        local low = char(32)
        local high = char(126)
        forval i = 1/`len' {
           if (mod(`i', 10000) == 0) {
             di "At byte #`i'"
           }
           qui replace c = substr(s, `i', 1)
           qui replace s = subinstr(s, c, "", .) if (c < "`low'") | (c > "`high'")
        }
        gen b = filewrite("my_outfile", s)
        //
        // Alternative if names are a variable in a Stata file
        gen namelen = length(name)
        local maxlen = r(max)
        local low = char(32)
        local high = char(126)
        gen str1 c = ""
        forval i = 1/`maxlen' {
           qui replace c = substr(name, `i', 1)
           qui replace name = subinstr(name, c, "", .) if (c < "`low'") | (c > "`high'")
        }

        Comment


        • #5
          I agree with wbuchanan.

          My usually (and very aggressive) approach for names is the following:
          Code:
          // for people
          gen cleanname=upper(trim(itrim(ustrregexra(ustrto(ustrnormalize(rawname, "nfd"), "ascii", 2), "^[a-z ]", " ",1))))
          
          // for entities
          gen cleanname=upper(trim(itrim(ustrregexra(ustrto(ustrnormalize(rawname, "nfd"), "ascii", 2), "^[a-z0-9 ]", " ",1))))
          Best,

          J.

          Comment


          • #6
            PS: I have a typo in the REGEX it should say:

            Code:
            // for people
            gen cleanname=upper(trim(itrim(ustrregexra(ustrto(ustrnormalize(rawname, "nfd"), "ascii", 2), "[^a-z ]", " ",1))))
            // for entities
            gen cleanname=upper(trim(itrim(ustrregexra(ustrto(ustrnormalize(rawname, "nfd"), "ascii", 2), "[^a-z0-9 ]", " ",1))))

            Comment


            • #7
              If the data are in a text file you could insheet the file into a single variable, or load the file into a column vector in Mata, to handle things along the lines of what Julio and I have suggested. While filefilter isn’t a bad solution, if you have a lot of data you’ll feel a performance penalty from all of the overhead associated with reading and writing a file multiple times in order to accomplish your goal. Another option would be to use a tool like -sed- or -awk- to handle a regular expression replacement from the command line in a way that would be similar to what filefilter would do, except in the case of sed you would be streaming the data from the file and making the modifications in a single pass over the data.

              Comment

              Working...
              X