Is there an easy to use command that creates one-way hashes that are hard to reverse-engineer? Something like SHA-1 or MD5?
I have personally-identifiable data that I need to de-identify. New observations will be added regularly. Individuals appear repeatedly in the data at irregular intervals — it's arrest, criminal history, and corrections data. I'm linking very different data, so I need an ID variable that is insensitive to the sort order. That is, the usual suspects using obs numbers or by: egen group won't work.
The original data has an ID number that uniquely identifies persons. I could just use that unaltered, but the ID number is itself sensitive information (rather like a social security number but for criminal justice systems in the state). I would prefer to use a one-way hash on the original ID, then drop the original. Newly added obs with the same original ID would have the same one-way hash, so linkage could be maintained even if the original ID is not retained after computing the one-way hashed value of the original ID.
I know this doesn't "anonymize" the data, and that's not my goal. I'm mostly trying to avoid storing sensitive data when I don't need to for the analysis at hand. And to do that without breaking the ability to link many, many other data sources together. I don't need the hash to be cryptographically secure against sophisticated attacks, but it would be nice if it could survive trivial attempts at reverse engineering.
Searches turned up not a lot about this on Statalist, with this sorta vague thread and a few workarounds that avoided creating hashes at all. But I have found this post by William Matsuoka that uses Mata to do HMAC-SHA-1: http://www.wmatsuoka.com/stata/hmac-sha1-in-stata
I think I can use the SHA bit of his example Mata code to get done what I need to do, but I wanted to make sure there wasn't an easier/better way in common use by the smart folks here before I adapted Matsuoka's code for my use.
I have personally-identifiable data that I need to de-identify. New observations will be added regularly. Individuals appear repeatedly in the data at irregular intervals — it's arrest, criminal history, and corrections data. I'm linking very different data, so I need an ID variable that is insensitive to the sort order. That is, the usual suspects using obs numbers or by: egen group won't work.
The original data has an ID number that uniquely identifies persons. I could just use that unaltered, but the ID number is itself sensitive information (rather like a social security number but for criminal justice systems in the state). I would prefer to use a one-way hash on the original ID, then drop the original. Newly added obs with the same original ID would have the same one-way hash, so linkage could be maintained even if the original ID is not retained after computing the one-way hashed value of the original ID.
I know this doesn't "anonymize" the data, and that's not my goal. I'm mostly trying to avoid storing sensitive data when I don't need to for the analysis at hand. And to do that without breaking the ability to link many, many other data sources together. I don't need the hash to be cryptographically secure against sophisticated attacks, but it would be nice if it could survive trivial attempts at reverse engineering.
Searches turned up not a lot about this on Statalist, with this sorta vague thread and a few workarounds that avoided creating hashes at all. But I have found this post by William Matsuoka that uses Mata to do HMAC-SHA-1: http://www.wmatsuoka.com/stata/hmac-sha1-in-stata
I think I can use the SHA bit of his example Mata code to get done what I need to do, but I wanted to make sure there wasn't an easier/better way in common use by the smart folks here before I adapted Matsuoka's code for my use.
Comment