Keep only string variables with character . in the string

Andrew Breazeale

Join Date: Jan 2015

Posts: 15
#1

Keep only string variables with character . in the string

18 Jan 2015, 01:51

Hey Statalist Community!

This is my first time posting, so if I commit any faux pas, please let me know.

Anyway, I'm working with a dataset that includes (among many other things) company names. Many of these names have non-alphanumeric characters (e.g. - / .). Below is code that I have been using to isolate those names that include non-alphanumeric characters:

BEGIN CODE
gen slash = regexm(cname, "/")
keep if slash == 1
drop slash

duplicates drop
export excel using location/excelfile.xlsx, sheet("Slash") sheetmodify cell(B2)
END CODE

Where "cname" refers to the company name, and "location/excelfile.xlsx" is some arbitrary excel file that I've been exporting to.

This code has worked well for all of the characters with the exception of the period. Whenever I use gen period = regexm(cname, "."), every entry is tagged, not just those that have a period in the name. I presume this occurs because of Stata's default interpretation of ".", but I'm not sure what to do next.

Any suggestions would be welcome.

Thank you,

Andrew
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3862
#2

18 Jan 2015, 02:44

Advice depends on your ultimate goal here, which is not clear to me.

I would go with simple string functions (help string functions) instead of regexm here. Probably something like

Code:

keep if strpos(cname, "/")

However, depending on what you want to do, there might be better ways.

Best
Daniel
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

18 Jan 2015, 06:18

I agree with Daniel: if you can use a simple string function to accomplish your ultimate objective, it's going to be a lot clearer..

For those who, like me, follow along on these forums to learn from others, let me add the answer to Andrew's specific question of why searching for the period returns every entry. Regular expressions support complex pattern matching, and that means some characters have a specific meaning in constructing a pattern to match. A single period matches any single character, so as long as the entry is at least one character long, it will match.

I did a -search regular expression- and one of the results it returned was this Stata FAQ on regular expressions. I wasn't able to find anything in the help files or documentation PDFs.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3862
#4

18 Jan 2015, 06:57

You can use the backslash character if you insist on regexm. Code

Code:

gen period = regexm(varname, "\.")

Best
Daniel
Comment
Andrew Breazeale

Join Date: Jan 2015

Posts: 15
#5

18 Jan 2015, 16:38

Thank you Daniel and William!

Daniel: Both codes that you provided accomplish exactly what I wanted. After looking into strpos() I must agree that it is a much simpler, more direct mechanism to accomplish my goals for this project.

William: Thank you for the explanation as to why my previous code matched all entries. I knew that Stata used "." to represent missing values, but I didn't realize that it's also the ... universal character (for lack of a better descriptor). I've bookmarked that link to reference for future complex matching patterns.

Thank you again,

Andrew
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

07 Mar 2015, 11:55

Above I wrote

I did a -search regular expression- and one of the results it returned was this Stata FAQ on regular expressions. I wasn't able to find anything in the help files or documentation PDFs.

I later learned that this FAQ is out of date and does not reflect the subsequently enhanced capabilities of Stata regular expressions. See here for more on documentation of regular expressions as implemented in Stata.

Last edited by William Lisowski; 07 Mar 2015, 12:02.
Comment

Announcement

Keep only string variables with character . in the string

Comment

Comment

Comment

Comment

Comment