Delete string portion in one field based on contents of other fields

David Torres

Join Date: Jun 2014

Posts: 45
#1

Delete string portion in one field based on contents of other fields

25 Apr 2016, 06:47

So I'm using Stata 13.1 and have a question. How can I delete string portions in a field based on the contents of another field?

I am using a dataset that includes a variable called comments that includes brief write-ups of student disciplinary issues (this is individual-level data). Some comments may include a student's name and I need to de-identify those comments before the data are sharable. Can I use the variables for first, middle, and last names to search the comments field to get rid of identifying information?

I realize there's probably at least two ways to accomplish what I'm asking.

1. The first is to take the data one row at a time.
2. The other is to search each row's comment field and remove name strings based on the complete list of names available in the data.

The second way is preferable, if doable.
Tags: None
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#2

25 Apr 2016, 06:58

The simplest way to do this is to use subinstr() and replace any name found in the comment variable with a null string, effectively erasing it. This is effectively your first approach.

Code:

replace Comment=subinstr(Comment, first_name, "", .) replace Comment=subinstr(Comment, middle_name, "", .) replace Comment=subinstr(Comment, last_name, "",.)

You will have a problem if names are misspelled or if abbreviated names are used in the comments (Bob, Will, Sam, Liz, etc).

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
David Torres

Join Date: Jun 2014

Posts: 45
#3

25 Apr 2016, 07:22

Thanks, Carole. Would your code take into account differences in case (capitalized, not capitalized)? There are so many difficulties that can present themselves; this is just one.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#4

25 Apr 2016, 07:34

No, the previous code require the cases to match. If you want to search for names in Comment that are lowercase versions of the name variables, you would use this:

Code:

replace Comment=subinstr(Comment, lower(first_name) , "" , .)

You can search for both upper & lower simultaneously with regular expressions. See help regex.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
David Torres

Join Date: Jun 2014

Posts: 45
#5

25 Apr 2016, 07:45

My problem is that the names fields are input in all caps, but the comments are 1) all caps, 2) all lower case, or 3) only first letter capitalized. Most are in the latter category. The first two I have taken care of using your code.

Another issue I've noticed is that some names from other rows show up in comments. Since the data I'm working with are real world disciplinary data, you have cases where students get in trouble as a group. In these cases, the same write-up might be used and all students' names included. Is there no way to handle that?
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#6

25 Apr 2016, 07:53

How large are the datasets (how many names)?

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
David Torres

Join Date: Jun 2014

Posts: 45
#7

25 Apr 2016, 07:59

There are about 60,000 rows of data representing 20,000 students. As you can imagine, some students get in trouble more than once in an academic year.

By the way, I figured out how to convert my all caps strings into a string where only the first letter is capitalized, so I've got all three possible formats I mentioned above. I just want to make sure that I exclude student names from comments that might come from another row in the data.

Thanks so much for your help thus far. You've made what I thought would be a challenging project quite easy.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#8

25 Apr 2016, 08:14

You can also use

Code:

replace Comment=subinstr(Comment, proper(first_name) , "" , .)

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#9

25 Apr 2016, 08:27

This may take a long time, but will remove any specified name from your Comment variable:

Code:

forvalues i=1/`=_N' { forvalues j=1/`=_N' { replace var3=subinstr(Comment[`i'], lower(first_name[`j']) , "" , .) in `i' replace var3=subinstr(Comment[`i'], lower(last_name[`j']) , "" , .) in `i' **and all other replace commands; add-- in `i' --at the end of the replace command local ++j } local ++i }

This loops through each Comment (i) and then loops through each name (j) to check whether it is in Comment. There is probably a much faster way to do this, but I can't think of it at the moment.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2415

#10

25 Apr 2016, 08:41

Here's a Mata-based approach, which avoids looping over observations, but which might not be quicker (and which could have errors, as I have not tested it).

Code:

preserve
// get list of nonduplicating lower-case names as a vector in Mata
keep first last middle
stack first last middle, into(name) clear
replace name = lower(name)
duplicates drop name, force
local N_names = _N
putmata name = name
restore
// Substitute "" for any name in comment.  Using subinword() will prevent names that are part of a
// word in the comment field from being removed.  Think "jan" and "janitor."
replace comment = lower(comment) 
forval i = 1/`N_names' {
   mata: st_local("eachname", name[`i'])
   replace comment = subinword (comment, "`eachname'", "", .)
}

Comment

Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#11

25 Apr 2016, 08:52

Mike is correct, using subinword is a much preferred option to subinstr.
If I am reading Mike's code correctly, this will replace your Comment variable with all lowercase characters. If this is a problem, I think you can avoid this by removing the replace comment=lower(comment) line and using lower(), upper(), proper() within multiple replace comment for the last lines. Note that there should be no space between subinword and the open parentheses "(".

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
David Torres

Join Date: Jun 2014

Posts: 45
#12

25 Apr 2016, 11:10

Sorry, I was away from my desk. I will attempt what you've given here as soon as I can and let you know what I was able to accomplish. Thanks.
Comment
David Torres

Join Date: Jun 2014

Posts: 45
#13

26 Apr 2016, 08:47

I'm just now getting to the code Mike provided (thanks, sir), but I'm getting an error message that the syntax is invalid.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35691
#14

26 Apr 2016, 08:51

That really doesn't allow us to explain what you are doing wrong. Perhaps you typed something incorrectly. Perhaps your Stata is not up-to-date. There are other possibilities too. Where does the code fail? Which statement preceded the error?
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#15

26 Apr 2016, 09:36

Here's yet another approach that shows some techniques I like to use:

Code:

clear
input str6(first middle lastname)
"Robert" ""       "Picard"
"David"  ""       "Torres"
"Carole" "J."     "Wilson"
"John"   "Quincy" "Adams" 
end
gen initials = substr(first,1,1) + substr(middle,1,1) + substr(lastname,1,1)
save "students.dta", replace

clear
input str44 comments
"John is a good student"                      
"The dog ate her homework according to Carole"
"Robert has a tendency to daydream"           
"Adams stole Torres' lunch money"             
"JQA said something snotty to the lunchlady"  
"The lunchlady is not mad at JQA."  
end

* put the names and the comments side-by-side in memory
merge 1:1 _n using "students.dta", nogen

* work on a copy of the original comments, add spaces to match on words
gen comments_fixed = " " + lower(comments) + " "

* pad punctuation characters to match by word
replace comments_fixed = subinstr(comments_fixed,"'"," ' ",.)
replace comments_fixed = subinstr(comments_fixed,"."," . ",.)
replace comments_fixed = subinstr(comments_fixed,","," , ",.)

* text is always easier to read when left justified (leftalign if from SSC)
leftalign

* loop over each name 
gen id = _n if !mi(lastname)
sum id, meanonly
local n = r(N)
forvalues i = 1/`n' {
    foreach v of varlist first middle lastname initials {
        local s = `v'[`i']
        local s = " " + lower("`s'") + " "
        replace comments_fixed = subinstr(comments_fixed, "`s'", " ", .)
    }
}

Announcement

Delete string portion in one field based on contents of other fields

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment