How to identify observations with special characters, numbers and whitespaces (expect leading and trailing spaces) - string var

Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#1

How to identify observations with special characters, numbers and whitespaces (expect leading and trailing spaces) - string var

17 Oct 2020, 10:23

Dear all,

I am working with string data and I would like to identify observations that have special characters ("/" "?" "*" "(" "." ")" and others), numbers and whitespace (except leading and trailing spaces) in a string variable.

The dataset is like

string_var

E.V.A

F7010P

175/70R14

C B U Q

*E.D.T.A

(S6X60D)

COLA

In particular, I would like the dataset to be as the following

string_var has_special_char

E.V.A 1

F7010P 1

175/70R14 1

C B U Q 1

*E.D.T.A 1

(S6X60D) 1

COLA 0

where 'has_special_char' is a dummy for whether the observation has special characters, numbers or whitespace (except leading and trailing spaces) in the variable 'string_var'.

Could you help me to find a solution for that?

Thank you very much!

Below I provide the code for importing the example dataset into Stata:

clear
input str10 string_var
"E.V.A"
"F7010P"
"175/70R14"
"C B U Q"
"*E.D.T.A"
"(S6X60D)"
"COLA"
end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30083
#2

17 Oct 2020, 10:49

Code:

gen byte wanted = regexm(string_var, "[^a-zA-Z0-9]")

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

17 Oct 2020, 11:42

Clyde's code overlooked your desire to accept leading and trailing spaces. My approach would be to trim leading and trailing spaces from your variable before doing anything else: I can't see a good reason to keep them, and they can cause no end of confusion.

Code:

replace string_var = trim(string_var)

But if you're unwilling to go that far,

Code:

gen byte wanted = regexm(trim(string_var), "[^a-zA-Z0-9]")

will enhance Clyde's code to ignore leading and trailing spaces.
Comment
Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#4

17 Oct 2020, 12:34

Thank you Clyde Schechter and William Lisowski !

Yes, Clyde, dataex is much better! I will use it in the future!

Just one remaining question: I ran the code that you suggested, i.e.,

replace string_var = trim(string_var)
gen byte wanted = regexm(trim(string_var), "[^a-zA-Z0-9]")

and I noted that the value of the dummy variable for the second observation (which is such that string_var == "F7010P") is zero and not one as it was meant to be.

Can you help me to figure out what is the point of that?

Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30083
#5

17 Oct 2020, 12:56

Oh, sorry, I misread your original request. I didn't think numbers were also to be marked as 1 along with special characters. So just remove the 0-9 from inside the brackets in the code:

Code:

replace string_var = trim(string_var) gen byte wanted = regexm(trim(string_var), "[^a-zA-Z]")

That will flag any observation that contains anything other than letters of the alphabet.
Comment
Otavio Conceicao

Join Date: Feb 2017

Posts: 65
#6

17 Oct 2020, 13:01

Perfect! Now it works smoothly! Thank you very much, Clyde Schechter !
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#7

17 Oct 2020, 14:59

Code:

assert regexm("Conceição", "[^a-zA-Z]") != ustrregexm("Conceição", "\P{L}")

https://www.regular-expressions.info/unicode.html
Comment

string_var	has_special_char
E.V.A	1
F7010P	1
175/70R14	1
C B U Q	1
*E.D.T.A	1
(S6X60D)	1
COLA	0

Announcement

How to identify observations with special characters, numbers and whitespaces (expect leading and trailing spaces) - string var

Comment

Comment

Comment

Comment

Comment

Comment