Efficiently cleaning strings of unwanted characters and spaces

Ben Hoen

Join Date: May 2014
Posts: 85

Efficiently cleaning strings of unwanted characters and spaces

24 Sep 2020, 12:17

Hi all,

I have 40 string variables in a dataset with 4 million cases. The str variables have many characters that I do not want that therefore I wish to replace them with "". I am currently doing it via a loop. This is OK but takes about 30 minutes to run.

I was not able to find a better solution on the forum. I found this and this, but they are similar to my code, a one at a time iteration through the characters.

Anyone have any ideas on how one might do a wholesale substitution? E.g., if any one of the following characters are present, replace them with a "". (I realize that is pseudo-code but hopefully you understand what I am hoping to find.)

Here is my code using a sample dataset:

Code:

clear all
input str30 start str10 finish
"!  he%L(L&^o::   {[ ])"    "hello"
"  T_-h))er!`E< - - &*" "there"    
" $%m#y     f+*INe "    "my fine"
"Fu,.R??RY_  _    " "furry"
"fr$%#@ei``n--__DS<> ><" "friends"
end

* this was my first version to be clear what each step is doing
foreach var of varlist start {
g `var'_1=strtrim(lower(`var'))
g `var'_2=`var'_1
    foreach char in "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" ":" ";" "=" "?" "@" "[" "\" "_" "`" "{" "}" "~" "]" "! " "!" "~" "<" ">" "^" {
    replace `var'_2=subinstr(`var'_2,"`char'","",.)
    }
g `var'_3=stritrim(`var'_2)
g `var'_4=strtrim(`var'_3)
}


* this is my final version except I have 30 variables following the foreach statement.
foreach var of varlist start {
g `var'_clean=lower(`var')
    foreach char in "#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" ":" ";" "=" "?" "@" "[" "\" "_" "`" "{" "}" "~" "]" "! " "!" "~" "<" ">" "^" {
    replace `var'_clean=subinstr(`var'_clean,"`char'","",.)
    }
replace `var'_clean=stritrim(`var'_clean)
replace `var'_clean=strtrim(`var'_clean)
}


order start* finish

Thanks for whatever advice you can offer.

Ben

Tags: None

daniel klein

Join Date: Mar 2014
Posts: 3859

24 Sep 2020, 13:25

Code:

foreach var of varlist ... {
    generate `var'_clean = ///
        strtrim(stritrim(ustrregexra(strlower(`var'), "[^a-z ]", "")))
}

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10213
#3

24 Sep 2020, 13:33

Code:

gen wanted= ustrregexra(lower(start), "\W|_", "", .)
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#4

24 Sep 2020, 15:04

Thanks, Daniel and Andrew. This will help a lot! I also find it interesting how your two approaches are similar but not the same (e.g., "strlower(..." vs. "lower(..." and "[^a-z ]" vs. "\W|_").

And Andrew's solution seems more efficient (i.e., shorter). It will be interesting if that actually makes a difference.

I am running MP15 on a 6-core machine

Thanks you two.

Ben
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10213
#5

24 Sep 2020, 16:02

I think daniel klein interpreted your question as asking how this could be applied across a range of variables, hence the loop. His solution takes care of the spaces between words, whereas mine does not. lower() and strlower() are equivalent, and are needed here to take care of capitalization. "\W" in my code implies any non-word character (so I am eliminating all non-word characters). So Daniel's code applied to your example in #1:

Code:

gen wanted= trim(itrim(ustrregexra(lower(start), "[^a-z ]", "")))

Last edited by Andrew Musau; 24 Sep 2020, 16:12.
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#6

24 Sep 2020, 16:08

OK thanks for the clarification. I tested them both and yours, if I applied it correctly, eliminates the spaces between words, which was not desirable. So I have used Daniel's for the solution.

They are both MUCH faster!

Thank you.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

25 Sep 2020, 02:32

The problem in this thread has been resolved to Original Poster's satisfaction, so I am probably just spamming right now.

However, Nick Cox -egenmore, msub- function is supposed to do this:

Code:

. egen end = msub(start), find("#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" ":" ";" "=" "?" "@" "[" "\" "_" "`" "{" "}" "~" "]" "! " "!" "~" "<" ">" "^") replace("")
{ required
r(100);

but it does not. What bothers me is that I cannot understand the error message (with or without -trace on-, and that by cutting certain symbols, I can generate another error message

Code:

. egen end = msub(start), find("#" "$" "%" "&" "'" "(" ")" "*" "+" "," "-" "." "/" ":" ";" "=" "?" ) replace("")
type mismatch
r(109);

or by cutting even more symbols it starts to work:

Code:

. egen end = msub(start), find("#" "$" "%" "&" "'" "(" ")" "*") replace("")

. list, sep(0)

     +--------------------------------------------------------+
     |                  start    finish                   end |
     |--------------------------------------------------------|
  1. | !  he%L(L&^o::   {[ ])     hello    !  heLL^o::   {[ ] |
  2. |    T_-h))er!`E< - - &*     there       T_-her!`E< - -  |
  3. |      $%m#y     f+*INe    my fine         my     f+INe  |
  4. |      Fu,.R??RY_  _         furry     Fu,.R??RY_  _     |
  5. | fr$%#@ei``n--__DS<> ><   friends   fr@ei``n--__DS<> >< |
     +--------------------------------------------------------+

Comment

daniel klein

Join Date: Mar 2014

Posts: 3859
#8

25 Sep 2020, 03:39

A quick glance at the code of _gmsub.ado suggests that this appears to be a problem with parsing the string in option find(). Most likely the problem is with the single left quote (`), which often causes problems when working with strings.

Because speed seems to be a major concern here, egen is most likely the worst alternative. egen is implemented as an ado-file which, in turn, usually calls other ado-files, which will ultimately implement some kind of loop similar to the one outlined in the initial post. So if you know how to write the loop and if spelling it out is equally cumbersome (which is arguably the case here, as you need to list all the elements that are supposed to be removed either way), the direct approach is almost always preferable. In this particular case, _gmsub.ado works with Stata 6 and uses a while loop, which (in Stata, not Mata) is even slower than the modern foreach/forvalues approach.

Do not get me wrong: there are situations where egen reduces your typing significantly. Also, some egen functions might implement a faster approach than the one you have in mind. However, generally speaking, egen is not really about speeding execution time.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#9

25 Sep 2020, 04:41

Daniel, philosophically speaking I do not think that the problem Original Poster (OP) posed can possibly constitute a speed of execution time problem. I clean my data once, so if I were in OP's place, I write something in the lines of the loops he wrote, and if it takes 30min on 4mil observations, I go to have a beer while Stata is calculating, when I am back Stata is done. Or more likely if it takes 8 hours (because OP has an expensive advanced MP15 on a 6-core machine, and I am working on plain vanilla Stata 15 SE), I set the code to run in the evening, I go to sleep, and in the morning it is done.

The point being is that for tasks that I do once, I do not care much whether it takes 1sec or 24 hours.

Of course OP would have a problem if he receives more than 48 such files each day, and he needs to clean them up. Then the speed of execution would be of highest importance, and in fact he would not be able to carry out the task with his code taking 30min to complete.

In any case I did not have on my mind speed of execution. I just wondered "How would I have done this if I had to", and the answer was "I would have done it as OP originally did it, with two loops one over variables and one over symbols" but then I remembered that I have seen a function like this somewhere in egenmore, it seemed to me that it would be even easier than writing two loops, and I was curious to see what would be faster, the double loop, or the egen. So I tried it. And it did not work.

Originally posted by daniel klein View Post

A quick glance at the code of _gmsub.ado suggests that this appears to be a problem with parsing the string in option find(). Most likely the problem is with the single left quote (`), which often causes problems when working with strings.

Because speed seems to be a major concern here, egen is most likely the worst alternative. egen is implemented as an ado-file which, in turn, usually calls other ado-files, which will ultimately implement some kind of loop similar to the one outlined in the initial post. So if you know how to write the loop and if spelling it out is equally cumbersome (which is arguably the case here, as you need to list all the elements that are supposed to be removed either way), the direct approach is almost always preferable. In this particular case, _gmsub.ado works with Stata 6 and uses a while loop, which (in Stata, not Mata) is even slower than the modern foreach/forvalues approach.

Do not get me wrong: there are situations where egen reduces your typing significantly. Also, some egen functions might implement a faster approach than the one you have in mind. However, generally speaking, egen is not really about speeding execution time.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#10

25 Sep 2020, 05:12

Philihilosophically (and personally), I tend to agree. However, the OP presents a working solution to start with. Therefore, and also because of the title, I assume that, at least in the OP's mind, the question indeed centers around execution time. egen is not very fast and often slow. That is all I wanted to say.
Comment
Ben Hoen

Join Date: May 2014

Posts: 85
#11

25 Sep 2020, 06:37

I like this discussion, as it points out the balance all programmers have between getting something to work now to solve a problem vs. finding a solution that takes longer to discover but which saves time over a longer period. This is analogous to the "Teach a man to fish..." parable. In my case, I had solved the problem, as Daniel pointed out. And I might never have gone back to it (and to this list), but I now find that I will be getting these string data repeatedly and therefore will need to run this code repeatedly. As well, this code necessarily falls in the middle of a few programming steps. So as I have been tweaking the code to develop my analysis dataset I had to run it more than a few times. Each time I would hit run, and go grab lunch, or minimize it and work on something else. So, after doing that for, maybe, the 10th time, I decided I should just learn to fish.

Just finding a way to ask the question took more time than it took to originally write the code. But, it paid off. I now have an elegant and fast solution. Further, I learned a few other tidbits (as often happens when you all chime in).

So thank you all!

PS: Maybe if I do have to wait for code to run I should grab a beer as Joro does. That would be a lot more fun!

Last edited by Ben Hoen; 25 Sep 2020, 06:44.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35708

#12

25 Sep 2020, 06:59

#4 #5 #6

Joro Kolev is one up on me in remembering the msub() function for egen -- which I would have to strain to do. Here is the beast entire:

Code:

*! 1.0.0 NJC 11 December 2000
program define _gmsub
    version 6.0
        
    gettoken type 0 : 0
    gettoken g    0 : 0
    gettoken eqs  0 : 0

    syntax varlist(max=1 string) [if] [in], /*
    */ Find(str asis) [ Replace(str asis) N(int -1) Word ]

           local fcn = cond("`word'" != "", "subinword", "subinstr")
    
    * doesn't work => user needs to update Stata    
    capture local bar = `fcn'("foo","foo","bar",.)
    if _rc {
        di in r "Your version of Stata doesn't recognise `fcn'( )."
        di in r "I guess that you need to update."
        exit 198
    }    

    local nfind : word count `find'
    local nrepl : word count `replace'
    
    if `nrepl' == 0 { /* no replacement => delete */
        local nrepl = `nfind'
    }    
    else if `nrepl' == 1 { /* many to one replacements allowed */
            local nrepl = `nfind'
        local replace : di _dup(`nfind') `"`replace' "'  
    }
    else if `nfind' != `nrepl' {
            di in r "number of find and replace arguments not equal"
        exit 198
    }
    
    marksample touse, strok
    local type "str1" /* ignores type passed from -egen- */
    local n = cond(`n' == -1, ., `n')

    quietly {
            gen `type' `g' = ""
            replace `g' = `varlist' if `touse'
        
            local i = 1
        while `i' <= `nfind' {
            local f : word `i' of `find'
            local r : word `i' of `replace'
                    replace `g' = `fcn'(`g', `"`f'"', `"`r'"', `n')
                    local i = `i' + 1
        }    
        }
end

The clear implication of its remaining posted on SSC as part of egenmore is that it works. So, a problem report means that I should think about looking at the code to see if I can fix the problem behind #4. I tend otherwise to agree with daniel klein while pointing out that its main device is a loop over awkward characters, which is precisely what is done in #1.

Announcement

Efficiently cleaning strings of unwanted characters and spaces

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment