Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Contents of String Variable in a Regular Expression

    I am working on a research project where my data (text message transcripts) contain readily identifiable information I'd like to remove. The text message transcripts are all stored in one string variable (transcript) and then first and last name are stored in another column in the same data set, so the exact pattern of characters I want to remove is nicely isolated in the same data set it just changes from row to row. Is there anyway to use the contents of one string variable in a regular expression to replace parts of another string? So something like the following:

    gen deidentified = regexr(lower(transcript), lower(studentname), "NAME")

    where the variable transcript is string variable containg the contents of text message exchange, the variable studentname is the students name I'd like to remove (i.e.: Frank, Jose, Susan) and NAME is just the generic string I want to replace the students actual name.

    Being able to remove names would de-identify the data to a point where I would be comfortable sharing it with colleagues who have more interest in and time to analyze this data set than I do but who should not have access to such readily identifiable data.

    Many thanks!
    Last edited by Gary Coyne; 31 Jul 2023, 18:10.

  • #2
    Perhaps a data example (with made-up data) would make it clearer, but I don't see why you need to use regular expressions here. Why won't
    Code:
    replace transcript = subinstr(lower(transcript), lower(name), "NAME", .)
    do what you need?

    Comment


    • #3
      Transcripts are usually fairly structured documents, so you may be able to take advantage of this to remove names without needing to rely on matching (sub)strings like a name. This could be useful if you have legal vs given names to deal with. Is there always a fixed string that appears immediately after the student name?

      Comment


      • #4

        Thanks Clyde! I didn't know about the subinstr() command but that was what needed.

        Comment

        Working...
        X