Using Contents of String Variable in a Regular Expression

Gary Coyne

Join Date: Jul 2023

Posts: 2
#1

Using Contents of String Variable in a Regular Expression

31 Jul 2023, 18:07

I am working on a research project where my data (text message transcripts) contain readily identifiable information I'd like to remove. The text message transcripts are all stored in one string variable (transcript) and then first and last name are stored in another column in the same data set, so the exact pattern of characters I want to remove is nicely isolated in the same data set it just changes from row to row. Is there anyway to use the contents of one string variable in a regular expression to replace parts of another string? So something like the following:

gen deidentified = regexr(lower(transcript), lower(studentname), "NAME")

where the variable transcript is string variable containg the contents of text message exchange, the variable studentname is the students name I'd like to remove (i.e.: Frank, Jose, Susan) and NAME is just the generic string I want to replace the students actual name.

Being able to remove names would de-identify the data to a point where I would be comfortable sharing it with colleagues who have more interest in and time to analyze this data set than I do but who should not have access to such readily identifiable data.

Many thanks!

Last edited by Gary Coyne; 31 Jul 2023, 18:10.
Tags: data cleaning, regexr, Regular expressions
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

31 Jul 2023, 18:34

Perhaps a data example (with made-up data) would make it clearer, but I don't see why you need to use regular expressions here. Why won't

Code:

replace transcript = subinstr(lower(transcript), lower(name), "NAME", .)

do what you need?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2405
#3

31 Jul 2023, 18:53

Transcripts are usually fairly structured documents, so you may be able to take advantage of this to remove names without needing to rely on matching (sub)strings like a name. This could be useful if you have legal vs given names to deal with. Is there always a fixed string that appears immediately after the student name?
Comment
Gary Coyne

Join Date: Jul 2023

Posts: 2
#4

01 Aug 2023, 10:26

Thanks Clyde! I didn't know about the subinstr() command but that was what needed.
Comment

Announcement

Using Contents of String Variable in a Regular Expression

Comment

Comment

Comment