Removing all capitalized letters from a string

Peter Goff

Join Date: Dec 2014

Posts: 1
#1

Removing all capitalized letters from a string

12 May 2015, 13:53

Hi All,
I have a variable that contains a 200-500 word description of individuals perspective on leadership. Many observations contain unwanted characters (e.g., # or %) and proper nouns. I'm using this data in a correlated topic model and would like to remove both the odd characters (anything that is not a letter) and all proper nouns. Any direction would be greatly appreciated!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

12 May 2015, 18:37

If you export the data to a delimited text file, you can apply -filefilter- to remove the special characters. Then re-import that back to Stata.

Proper nouns are harder. Let me assume your variable is called description, and that any group of blank-delimited (or initial or terminal) characters that begins with an upper case letter is a proper noun. I also assume that you have no variables whose names begin with w. (If you do, pick a different stub in the -split- command that won't clash with any existing variable names.)

Code:

// BREAK DESCRIPTION INTO WORDS split description, gen(w) // IF A WORD BEGINS WITH AN UPPER CASE LETTER // REPLACE IT WITH THE NULL STRING foreach v of varlist w* { replace `v' = "" if strops("`c(ALPHA)'", substr(`v', 1, 1)) } // PUT THE PIECES BACK TOGETHER egen new_description = concat(w*), punct(" ")

The limitations of this approach are:
1. It will delete things that may not be proper nouns if they happen to be groups of characters starting with an upper case letter.
2. If the whitespace in the variable description includes tabs or non-printing spaces they will not be picked up as delimiters of a word, which may result in missing a "proper noun" or falsely deleting additional material attached to a "proper noun."
3. I have no idea how it will respond to Unicode characters, particularly if they occur at the beginning of a "word."
4. The removed "proper nouns" vanish without a trace. You might prefer to put some placeholder characters in there to mark the deletion sites, particularly if (as I understand to be the case) this variable contains text that should read like narrative. If so, just modify the -replace- command accordingly.

But it's a start.

By the way, the description of what you want to do in your post is rather different from what you say in the title. If what you want to do is just remove upper case letters, not the words they begin with, then -filefilter- will handle that for you as well.
2 likes
Comment

Announcement

Removing all capitalized letters from a string

Comment