I am working on a research project where my data (text message transcripts) contain readily identifiable information I'd like to remove. The text message transcripts are all stored in one string variable (transcript) and then first and last name are stored in another column in the same data set, so the exact pattern of characters I want to remove is nicely isolated in the same data set it just changes from row to row. Is there anyway to use the contents of one string variable in a regular expression to replace parts of another string? So something like the following:
gen deidentified = regexr(lower(transcript), lower(studentname), "NAME")
where the variable transcript is string variable containg the contents of text message exchange, the variable studentname is the students name I'd like to remove (i.e.: Frank, Jose, Susan) and NAME is just the generic string I want to replace the students actual name.
Being able to remove names would de-identify the data to a point where I would be comfortable sharing it with colleagues who have more interest in and time to analyze this data set than I do but who should not have access to such readily identifiable data.
Many thanks!
gen deidentified = regexr(lower(transcript), lower(studentname), "NAME")
where the variable transcript is string variable containg the contents of text message exchange, the variable studentname is the students name I'd like to remove (i.e.: Frank, Jose, Susan) and NAME is just the generic string I want to replace the students actual name.
Being able to remove names would de-identify the data to a point where I would be comfortable sharing it with colleagues who have more interest in and time to analyze this data set than I do but who should not have access to such readily identifiable data.
Many thanks!
Comment