Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing all capitalized letters from a string

    Hi All,
    I have a variable that contains a 200-500 word description of individuals perspective on leadership. Many observations contain unwanted characters (e.g., # or %) and proper nouns. I'm using this data in a correlated topic model and would like to remove both the odd characters (anything that is not a letter) and all proper nouns. Any direction would be greatly appreciated!

  • #2
    If you export the data to a delimited text file, you can apply -filefilter- to remove the special characters. Then re-import that back to Stata.

    Proper nouns are harder. Let me assume your variable is called description, and that any group of blank-delimited (or initial or terminal) characters that begins with an upper case letter is a proper noun. I also assume that you have no variables whose names begin with w. (If you do, pick a different stub in the -split- command that won't clash with any existing variable names.)
    Code:
    // BREAK DESCRIPTION INTO WORDS
    split description, gen(w)
    
    // IF A WORD BEGINS WITH AN UPPER CASE LETTER
    // REPLACE IT WITH THE NULL STRING
    foreach v of varlist w* {
        replace `v' = "" if strops("`c(ALPHA)'", substr(`v', 1, 1))
    }
    
    // PUT THE PIECES BACK TOGETHER
    egen new_description = concat(w*), punct(" ")
    The limitations of this approach are:
    1. It will delete things that may not be proper nouns if they happen to be groups of characters starting with an upper case letter.
    2. If the whitespace in the variable description includes tabs or non-printing spaces they will not be picked up as delimiters of a word, which may result in missing a "proper noun" or falsely deleting additional material attached to a "proper noun."
    3. I have no idea how it will respond to Unicode characters, particularly if they occur at the beginning of a "word."
    4. The removed "proper nouns" vanish without a trace. You might prefer to put some placeholder characters in there to mark the deletion sites, particularly if (as I understand to be the case) this variable contains text that should read like narrative. If so, just modify the -replace- command accordingly.

    But it's a start.

    By the way, the description of what you want to do in your post is rather different from what you say in the title. If what you want to do is just remove upper case letters, not the words they begin with, then -filefilter- will handle that for you as well.

    Comment

    Working...
    X