Generate unique persistent person identifiers (PID) with Stata

Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#1

Generate unique persistent person identifiers (PID) with Stata

18 Aug 2017, 07:40

I would like to know if there is anyone out there who has implemented a away to create unique persistent person identifiers in Stata.
My idea about the identifier is that it contains a mix of person characteristics (e.g. first letter of first and last name) and a 6-digit-random number.
This task should not be too difficult to implement, but I have a database that is expanding by time. Meaning that in addition to the 30.000 identities of today I have to create new ones in a half years' time. So is there a way to avoid identical random numbers in a later run of the syntax?
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

18 Aug 2017, 07:49

Is there a combination of characteristics of which you are reasonably sure it will always be unique by person? If so, you could concatenate their values and use the hash1() function from mata to generate a unique ID.
Comment
Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#3

18 Aug 2017, 08:05

Jesse, no, unfortunately we only have the first and last name to identify the persons. All other attributes might not be known or are not unique.
In principle, we try to get external identifiers of these persons but at the moment good external identifiers are not common and would lead to an enourmous task to identify a person with a correct external identifier.Thus, a lot of persons do not have identifiers in our system.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#4

18 Aug 2017, 09:21

I'd say that then you have a fundamental issue, unrelated to any programming concerns. If the data doesn't allow you to identify unique persons, then how can Stata to generate IDs for them?
Comment
Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#5

18 Aug 2017, 09:37

Maybe I was not precise enough: The assignment of an identifier should follow a process of deduplication. We have persons with same first and last name. We look at the identifiers we know (but only a few have some) and we look at subject areas of their activities. After that we decide if they get merged or should get an unique identifier in order to know the next time that these two persons have separate identities. Some manual work is involved in looking persons up.
Later a third person with the same first and last name may be entered in our database and we should be able to assign him another identifier if (s)he is not one of the earlier two.
And this identifier should not be used before.

Why not using a running number? Because we might use this identifier for external representation and an identifier should be opaque to at least some degree. (that is why we might skip the first letter of first and last name idea above)
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

18 Aug 2017, 09:47

You could pre-generate a table of random numbers, de-duplicate it, and then use it in place of a running number. It might not be feasible if you have multiple people assigning IDs, but you'd have the same problem in the case of a running number anyway. So just pre-generate a few million (or billion) ids, de-duplicate, and cross each one off the list as it gets used.
Comment

Jesse Wursten

Join Date: Jan 2016
Posts: 915

18 Aug 2017, 10:19

But then you can just use what I suggested in the first place, or am I missing something? The code below generates a unique hash for the combination of fname, sname and area (in this case generated as monthnames, weekdays and letters). Note that observations that share the same characteristics get awarded the same id. There's also no randomisation involved, so if you run this again tomorrow with more people, you will get the same hashes. You can also run it on the subset of new people separately and then append those.

Code:

clear
set obs 12
local months = "`c(Months)'"
local wdays = "`c(Weekdays)' `c(Weekdays)'"
local letters = "`c(alpha)'"

di "`months' `wdays' `letters'"

gen fname = ""
gen sname = ""
gen area = ""
forvalues i = 1/12 {
    replace fname = word("`months'", `i') in `i'
    replace sname = word("`wdays'", `i') in `i'
    replace area = word("`letters'", `i') in `i'
}
expand 2

mata:
    info = .
    st_sview(info, ., "fname sname area")
    personid  = J(rows(info), 1, .)
    for(i=1; i<=rows(info); i++) {
        personid[i, 1] = hash1(invtokens(info[i, .], "#"), ., 2)
    }
    st_addvar("double", "personid")
    st_store(., "personid", personid)
end

format personid %10.0f

Comment

Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#8

21 Aug 2017, 02:30

Hi Jesse, thanks for the code. I have to think about this a bit more because in principle I have two persons with the same basic characteristics that I want to separate. To have something persistent I cannot rely on changing attributes. But of course there are ways to preserve the attributes used to generate the hashes.

A slight practical issue is that I get 9- and 10-digit hashes from your code. I am not sure yet if that is problematic or not.
Comment
Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#9

21 Aug 2017, 02:34

Hi Ben, thanks for your comment. I think your suggestion is a good alternative to Jesses.
Comment
Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#10

23 Aug 2017, 03:33

ben earnhart Jesse Wursten : Again thanks for your input that helped me to do some decisions about my persistent unique person identifier problem:
1. We decided that the identifier should be randomly chosen and not be based on hashes that need characteristics that are not unique or maybe changing.
2. The identifier should be opaque as much as possible. We keep the first letter of first- and lastname idea because the identifier might be used by external users and so this could be used as a rough validity check. (we know people change their names etc.)
3. We decided to use a mix of two letters, two numerals, two letters, two numerals - which gives us roughly 6 billion unique identifiers - as this is far two much we decided that the pairs of numerals should not start with 0 and the second pair of letters is a consonant-vocal pair (and qxyz are excluded from the consonant list). Leaving us with roughly 500 million identifiers.
Now we are removing political delicate letter-number combinations.
4. The next step is to randomly choose 10.000 identifiers per pair of first letter of first- and lastname.
5. From this far smaller set of identifiers I randomly chose one per person.
1 like
Comment

Announcement