Need to anonymize data, how to generate a new unique identifier to replace previous participant ID number?

Matt Price

Join Date: Aug 2023

Posts: 43
#1

Need to anonymize data, how to generate a new unique identifier to replace previous participant ID number?

22 Mar 2024, 16:54

Hi everyone,
I'm a bit stuck with a problem. I have a data set that I will need to make available as part of publishing our study results. To do this, I will remove the study-assigned ID variable, and replace it with something else in a non-systematic way (i.e., so that it cannot be traced back to an original study ID number). Participants typically have 6 or 7 study visits, and each record corresponds to one study visit; records are uniquely identified by a participant id and the study visit date. Participant ID is a string, with letters and numbers.

And I can't quite figure out how to do this. I don't want to sort participant id ascending or descending, and assign a new, consecutive number to identify participants (e.g., participant 1, participant 2, etc) because that might make it possible to trace back to original study id numbers.

Is there a command in Stata to sort randomly? (i.e., keep all visits from a person ordered together, but randomly sort them rather than sort them ascending or descending) Then I could replace participant id with a consecutive number that would therefore not be traceable back to the original id?

Is there some way to assign a random, new number to each individual?

How might you go about tackling this... I'm really stuck!

Thanks in advance.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

22 Mar 2024, 20:06

Code:

by participant_id (visit_num), sort: gen double shuffle1 = runiform() by participant_id (visit_num): replace shuffle1 = shuffle[1] by shuffle1 (visit_num), sort: gen `c(obs_t)' deidentified_id = _n

This will sort the participant IDs into random order, but it will give each participant_id's observations all the same deidentified_id.

Don't forget to set your random number seed before you start, and to save a crosswalk between the deidentified_id and the participant_id after you're done so you can link them if it becomes necessary later on.

Note: If you have only a million observations or so, this will work fine. If you have substantially more than that, it is possible that the same value will be assigned to different participants, which, of course is unacceptable. So if your data set is larger than that, create two random doubles, shuffle1 and shuffle2. And then in the final command, use both shuffle variables in the -by- command.
1 like
Comment

Announcement

Need to anonymize data, how to generate a new unique identifier to replace previous participant ID number?

Comment