Special Characters in String Vars

Rebecca Sansale

Join Date: May 2015

Posts: 18
#1

Special Characters in String Vars

24 Aug 2016, 13:34

Hi everyone--

I have a string var that is riddled with special characters, which is ultimately precluding me to complete a fuzzy match on two data sets. I am trying to strip this string var of all special characters (i.e. "" , . - () [] etc.), but have not been successful in finding a way to get rid of all characters. I have tried using commands such as:

replace shortcounterpartyname=regexr(shortcounterpartyname , "`" "," "." "'" "inc." " inc." "/" " - " "-" " . " "[0-9][0-9][0-9][0-9]" " s.a. " "llc" " llc" " ltd." "(" ")", "")
replace shortcounterpartyname=subinstr(shortcounterpartyna me, "(","",.)

However, I have yet to find a way to get rid of quotation marks (even after reading a plethora of other help articles on this same issue including http://www.stata.com/statalist/archi.../msg00179.html). Additionally, I have found that Stata is dropping the first letter of some names, even if that observation doesn't have any special characters within its name.

If anyone has any suggestions regarding this issue, please let me know!

Thanks,

Rebecca
Tags: fuzzy match, string
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

24 Aug 2016, 13:58

I'm sure there is some relatively simple way to do this using regular expressions, but I think the time it will take you to figure that out, if you are not really proficient at them, is not worth the trouble when you can do it fairly simply:

Code:

gen clean_string = dirty_string fovalues i = 0/255 { if !inrange(`i', 48, 57) /// DIGITS & !inrange(`i', 65, 90) /// UPPER CASE LETTERS & !inrange(`i', 97, 122) { // LOWER CASE LETTERS replace clean_string = subinstr(clean_string, `"`=char(`i')'"', "", .) } }

Note: Not tested. Beware of typos.

The above code will strip out everything except alphanumeric characters from dirty_string to create clean_string. If you want to retain certain punctuation marks and spaces, you can add a & !inlist( `i',....) term to the -if- command, listing the ascii codes for the particular characters you want to save. If you don't know the numeric ascii codes, you can find many tables available on google. But more convenient is to download Nick Cox's -asciiplot- command from SSC. It produces a table of numeric codes for ASCII characters in a Graph window.
1 like
Comment
Dimitriy V. Masterov

Join Date: Mar 2014

Posts: 609
#3

24 Aug 2016, 15:08

You can use -charlist- to get the list of ascii codes that you could replace.
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

24 Aug 2016, 16:41

Here's some simple technique using a Stata 14 function that you may be able to build on.

Code:

. generate str20 clean = ustrregexra(test,`"[^a-zA-Z0-9]"',"")

. list, clean

           test    clean  
  1.    abc.efg   abcefg  
  2.   aB!@#hij    aBhij  
  3.      ab"7d     ab7d  

.

Comment

Rebecca Sansale

Join Date: May 2015

Posts: 18
#5

25 Aug 2016, 07:30

Originally posted by Clyde Schechter View Post

I'm sure there is some relatively simple way to do this using regular expressions, but I think the time it will take you to figure that out, if you are not really proficient at them, is not worth the trouble when you can do it fairly simply:

Code:

gen clean_string = dirty_string fovalues i = 0/255 { if !inrange(`i', 48, 57) /// DIGITS & !inrange(`i', 65, 90) /// UPPER CASE LETTERS & !inrange(`i', 97, 122) { // LOWER CASE LETTERS replace clean_string = subinstr(clean_string, `"`=char(`i')'"', "", .) } }

Note: Not tested. Beware of typos.

The above code will strip out everything except alphanumeric characters from dirty_string to create clean_string. If you want to retain certain punctuation marks and spaces, you can add a & !inlist( `i',....) term to the -if- command, listing the ascii codes for the particular characters you want to save. If you don't know the numeric ascii codes, you can find many tables available on google. But more convenient is to download Nick Cox's -asciiplot- command from SSC. It produces a table of numeric codes for ASCII characters in a Graph window.

This loop for the most part works great! However, I am receiving an error message "too few quotes". I have combed through the code a few times now, and cannot find where the missing quote should be.
Comment
Rebecca Sansale

Join Date: May 2015

Posts: 18
#6

25 Aug 2016, 07:31

Originally posted by William Lisowski View Post

Here's some simple technique using a Stata 14 function that you may be able to build on.

Code:

. generate str20 clean = ustrregexra(test,`"[^a-zA-Z0-9]"',"") . list, clean test clean 1. abc.efg abcefg 2. aB!@#hij aBhij 3. ab"7d ab7d .

I am using Stata 13.1, is there a similar function available for this version?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

25 Aug 2016, 07:48

There is an ancient egen function sieve() in egenmore (SSC) that will do this.

*! 1.0.0 NJC 23 Sept 2002
program define _gsieve
version 7.0

I certainly haven't tested it on anything exotic, but it passes William's test

Code:

clear input str8 test `"abc.efg"' `"aB!@#hij"' `"ab"7d"' end egen wanted = sieve(test), keep(a n) list +-------------------+ | test wanted | |-------------------| 1. | abc.efg abcefg | 2. | aB!@#hij aBhij | 3. | ab"7d ab7d | +-------------------+

But note the general Advice already given you:

11. What should I say about the version of Stata I use?

The current version of Stata is 14.1. Please specify if you are using an earlier version; otherwise, the answer to your question may refer to commands or features unavailable to you. Moreover, as bug fixes and new features are issued frequently by StataCorp, make sure that you update your Stata before posting a query, as your problem may already have been solved.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#8

25 Aug 2016, 08:15

Notwithstanding what -sieve- can do, I'd be curious to see a solution along the lines of Clyde's suggestion, as I also have in similar context encountered the "too few quotes" error message while (I think) dealing with strings containing embedded quotes. I suppose this amounts to understand just how -sieve- successfully manages to escape the quote character.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#9

25 Aug 2016, 08:27

*!!! Warning: in-joke ahead.

If you can break the sieve magic, you have to face the dragons.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

25 Aug 2016, 08:51

OK, I don't get the in joke.

Looking at the code for -egen, sieve()-, it appears that when used with the keep() option, rather than looping over all possible characters, it builds up its result by going one character at a time through the source string and accepting each character only if it matches the list of "keepers." So it never really has to deal with the expression `"`"', which is what breaks my code in #5 when it tries to purge all instances of the ` character.

But the magic can be broken. If you ask -egen, sieve()- to omit(`), it fails.

So, where are these dragons I must face now?
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#11

25 Aug 2016, 09:20

Clyde's code can be coerced into doing what was intended.

Code:

gen clean_string = dirty_string
forvalues i = 0/255 {
    if !inrange(`i', 48, 57) /// DIGITS
        & !inrange(`i', 65, 90) /// UPPER CASE LETTERS
        & !inrange(`i', 97, 122) { // LOWER CASE LETTERS
            quietly replace clean_string = subinstr(clean_string, `= `"char(`i')"' ', "", .)
    }
}

Code:

. list, clean

       dirty_s~g   clean_~g  
  1.     abc.efg     abcefg  
  2.   aB!@#hi j      aBhij  
  3.       ab"7d       ab7d  
  4.        xy`z        xyz

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#12

25 Aug 2016, 09:42

The in-joke hinges on "sieve" being read as another way of saying "goblet".

On the technical point: Clyde is correct. sieve() fails on being asked explicitly to omit single left quotes. But for the problem stated, that doesn't bite, as the choice is (still) to keep alphabetic and numeric.

Code:

clear input str8 test `"abc.efg"' `"aB!@#hij"' `"ab"7d"' `"xy`z"' end egen wanted = sieve(test), keep(a n) list +-------------------+ | test wanted | |-------------------| 1. | abc.efg abcefg | 2. | aB!@#hij aBhij | 3. | ab"7d ab7d | 4. | xy`z xyz | +-------------------+
Comment
Ahmad Mobariz

Join Date: Oct 2019

Posts: 26
#13

25 Mar 2020, 10:49

I have a follow-up question: My string variable has text in Persian language. When I apply the code recommended by Clyde Schechter , all except some values that are English get removed. How do I modify the code so I can keep the values of interest? My data is in Persian.
Comment

Announcement

Special Characters in String Vars

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment