Encryption in Stata, a fork from "...hashing ..."

Mike Lacy

Join Date: Apr 2014

Posts: 2413
#1

Encryption in Stata, a fork from "...hashing ..."

06 Jul 2023, 15:13

After following the recent interesting thread about hashing an ID string, I wanted to post something of a discussion query about encryption in Stata. I'd be interested to hear what others do about encryption, and get some comment about whether having a user-written module for encryption might be convenient.

I've met my own past needs for encryption by sorting the data file into random order, assigning a sequential id from that order, removing the true id, saving a linkage file containing the actual and random ids, and encrypting that linkage file with some non-Stata utility program. Once, I worked with some U.S. federal government educational data in which we were required to store some of the data on separate media, in encrypted form, and only use it with a "decrypt on the fly" utility.

Despite the existence of these solutions, I've had the thought that having some kind of encryption capacity as a command within Stata would be convenient. For my own amusement, I wrote a small and simple program (using Mata) to use Stata to encrypt an entire data set using a pad file containing random bytes as long as the file to be encrypted, which was fast and presumably quite hard to break. A similar program could presumably be used to encrypt a single variable. I'd also think that it would be possible to use Stata to call some external program (e.g., a compression utility with AES-256 encryption capacity), but I haven't looked at that seriously. I'd also suppose one could use some existing Python code to do encryption, giving a platform-independent feature from within Stata, but I'm not knowledgeable about that.

So, I'm posting to hear what people think, and possibly to stimulate someone with expertise to think about creating a user-written module, if that would be relevant.
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#2

06 Jul 2023, 16:13

I asked something similar a while ago and the big take away for me was that operating-system level whole disk encryption is a pretty good solution for whole file encryption at rest. I've used file system level encryption at rest before and it seems to work well, although keep in mind that EncFS may no longer be secure. Getting a unique, one-way cryptographic hash of a relatively short string is a different matter, because shorter strings are more susceptible to rainbow table attacks. These strings should be salted (meaning that cryptographically generated random data should be appended to the string) and the hash should be "stretched" meaning that the hash function should be iterated on the string roughly 200 to 1000 times, depending on how worried you are about denial of service attacks. In most Stata contexts, I'd go closer to 1000.

This one time pad encryption sounds pretty cool! It would be fun to study that if I ever find the time to get back into this kind of thing. I'd love to learn more about the details of public/private key encryption as well.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2400
#3

06 Jul 2023, 17:01

I just wanted to clarify that hashing and encryption are two distinct concepts, and are not mutually exclusive. Encryption is a cryptographic technique that can be easily reversed, given the right algorithm and key(s) and its intent here is to completely protect the dataset. Hashing on the other hand, is a one-way function whose intent is to represent some data with a difficult-to-reverse and arbitrary value, and here is meant as a means of pseudonymizing datasets, but the rest of the dataset remains "in the clear".

I've been involved in creating pseudonymous datasets in the past. For these it was good enough to randomly renumber participant ids and site numbers as well as fuzz age and race, and completely discard some other information. If these data were stolen, then they were in clear text and anyone could use it, but they would be (reasonably) anonymized such that they couldn't be linked back to specific trial participants. This is the standard used for sharing clinical trial data repositories.

It's now become standard to use full-disk encryption as fail-safe in case the computer is lost or stolen, the data are essentially completely protected. I've also used other solutions. In the past, I've used TrueCrypt (now VeraCrypt, I think) which offered very strong encrypted containers that could be mounted like a disk, then when not in use, remained in an encrypted state. I recall someone else on this forum being a frequent user of TrueCrypt. The other were commercial solutions that were USB keys or hard drives that had physical encryption mechanisms built-in, so they could only be unlocked by first entering a PIN code. All of these would suffice for protecting personally identifying information while the data is "at rest". The major concern with encryption is trusting that the encryption algorithm is correctly implemented, which may be why there is a lack of encryption packages for Stata, instead relying on more commonly available tools such as those above.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#4

06 Jul 2023, 17:38

Yes, entirely correct point from Leonardo re encryption vs. hashing. The original hashing question appeared (not entirely clear to me) to arise from the desire to anonymize an id variable, which is what prompted my thinking in this direction.

I did use the old TrueCrypt as Leonardo describes, but I was tentatively thinking that something smaller/quicker might be relevant, and I'd agree that using existing code or calling an existing external program from within Stata would be desirable.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#5

06 Jul 2023, 18:19

Hashing on the other hand, is a one-way function whose intent is to represent some data with a difficult-to-reverse and arbitrary value, and here is meant as a means of pseudonymizing datasets, but the rest of the dataset remains "in the clear".

This is a reasonable description of a cryptographic hash, but hashing is a more general term. For example, one more general use of hashing is to map a piece of data (like a string) to a unique location in memory, usually in an array. These hash functions are used to create hash-set or hash-table data structures, which are useful for constant (rather than O(n)) time lookup. If you've worked with a dictionary before, the underlying data structure is usually implemented with hashes, so if you look up the value of an element with a given key, you find the value in constant time rather than O(n) time. in many cases it doesn't really matter if a hash is reversible, but it matters a great deal for cryptographic hashes. There is also another important feature of cryptographic hashes, which is that they need to make collisions (situations where different strings map to the same hash) very unlikely. Collisions are a general problem with hash functions, but create special difficulties in cryptography.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#6

06 Jul 2023, 18:25

As a practical matter, cryptographic hashes are also fairly slow, and most hash functions are actually reversible. This is why the distinction between general hashing and cryptographic hashing is so important. Any old hash function will not do if you want to insure that the hash is not easily reversible.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2400
#7

06 Jul 2023, 18:48

Originally posted by Daniel Schaefer View Post

As a practical matter, cryptographic hashes are also fairly slow, and most hash functions are actually reversible. This is why the distinction between general hashing and cryptographic hashing is so important. Any old hash function will not do if you want to insure that the hash is not easily reversible.

What you've outlined in #5 and #6 is entirely correct. leaned more heavily to the cryptographic definition because this is more suitable for the purposes of this discussion. To pseudonymize data, the hashes must not be reversible and must avoid collisions, while also being useful as a lookup value. (Digression: Mata has an associative array object which is a type of dictionary.)
1 like
Comment

Announcement

Encryption in Stata, a fork from "...hashing ..."

Comment

Comment

Comment

Comment

Comment

Comment