One-way hash creation in Stata?

Troy Payne

Join Date: Aug 2014

Posts: 34
#1

One-way hash creation in Stata?

03 May 2018, 01:13

Is there an easy to use command that creates one-way hashes that are hard to reverse-engineer? Something like SHA-1 or MD5?

I have personally-identifiable data that I need to de-identify. New observations will be added regularly. Individuals appear repeatedly in the data at irregular intervals — it's arrest, criminal history, and corrections data. I'm linking very different data, so I need an ID variable that is insensitive to the sort order. That is, the usual suspects using obs numbers or by: egen group won't work.

The original data has an ID number that uniquely identifies persons. I could just use that unaltered, but the ID number is itself sensitive information (rather like a social security number but for criminal justice systems in the state). I would prefer to use a one-way hash on the original ID, then drop the original. Newly added obs with the same original ID would have the same one-way hash, so linkage could be maintained even if the original ID is not retained after computing the one-way hashed value of the original ID.

I know this doesn't "anonymize" the data, and that's not my goal. I'm mostly trying to avoid storing sensitive data when I don't need to for the analysis at hand. And to do that without breaking the ability to link many, many other data sources together. I don't need the hash to be cryptographically secure against sophisticated attacks, but it would be nice if it could survive trivial attempts at reverse engineering.

Searches turned up not a lot about this on Statalist, with this sorta vague thread and a few workarounds that avoided creating hashes at all. But I have found this post by William Matsuoka that uses Mata to do HMAC-SHA-1: http://www.wmatsuoka.com/stata/hmac-sha1-in-stata

I think I can use the SHA bit of his example Mata code to get done what I need to do, but I wanted to make sure there wasn't an easier/better way in common use by the smart folks here before I adapted Matsuoka's code for my use.
Tags: data, hash, sha-1, string
wbuchanan

Join Date: Mar 2014

Posts: 1362
#2

03 May 2018, 04:31

Sergio Correia
Didn’t you do something involving hashing for ftools?
Comment

Troy Payne

Join Date: Aug 2014
Posts: 34

11 May 2018, 00:47

Posting for others who might need to do something similar entirely in Stata/Mata. It's not very refined — I probably should make this a proper program — but it gets the job done. Feel free to adapt as needed.

It adds a salt in line 21 to add a little extra protection against brute force attacks on the hashed values. It's worth mentioning that 1) I am not a crypto expert; and 2) SHA-1 has been considered insecure against sophisticated attacks for quite some time now, so use with caution in truly sensitive situations.

Personally, if I need to move to one of the more modern secure hashing methods, I'll probably just use Python's libraries and deal with the hassle of mixing that into my Stata data analysis workflow.

Code:

// May 10, 2018
// SHA-1 
// Mata adapted from William Matsuoka, http://www.wmatsuoka.com/stata/hmac-sha1-in-stata

quietly {

noisily display "Started $S_TIME  $S_DATE"


// ==============================================
// 
// replace "make" below with your original variable name
// replace "hashed" below with your new variable name
//
// ==============================================

local originalvar make
local newvar hashed

tempvar tohash
gen `tohash' = `originalvar' + "12345" // adds salt, replace 12345 with a value of your choosing


gen str40 `newvar'=""

// SHA1

cap mata:mata clear
mata:
string scalar sha1(string scalar message)
{
    // Find Blocks and Padding
    m_a     = strlen(message) * 8
    message = message + char(128)
    m_l     = strlen(message) * 8
    m_l_b   = padbit(inbase(2, m_a), 64)
    blocks  = trunc((m_l + 64)/512) + 1
    total   = blocks * 512
    
    bit = inbase(2, ascii(message))
    bit = "0" :* (8 :- strlen(bit)) :+ bit
    bit = invtokens(bit, "")

    message = bit + "0"*(total - m_l - 64) + m_l_b
    
    // Initialize
    h0 = "01100111010001010010001100000001"
    h1 = "11101111110011011010101110001001"
    h2 = "10011000101110101101110011111110"
    h3 = "00010000001100100101010001110110"
    h4 = "11000011110100101110000111110000"
    
    for (i=1; i<=blocks; i++) {
        x = substr(message, (i-1)*512 + 1, 512)
        sha1_proc(x, h0, h1, h2, h3, h4)
    }
    
    hh = pb6to2(h0) + pb6to2(h1) + pb6to2(h2) + pb6to2(h3) + pb6to2(h4)
    return (hh)
}
void sha1_proc(string scalar bit, string scalar h0, string scalar h1,
    string scalar h2, string scalar h3, string scalar h4)
{
    // break word into 32 bit chunks
    w = J(1, 80, "")
    for (i = 1; i<= 16; i++) {
        w[i] = substr(bit, (i-1)*32 + 1, 32)
    }
    
    // extend rest of words
    for (i=17; i<=80; i++) {
        w[i] = bitxor( bitxor( bitxor(w[i-3] , w[i-8]) , w[i-14]) , w[i-16] )
        w[i] = leftrotate(w[i],1)
    }
    
    // initalize hash
    a = h0
    b = h1
    c = h2
    d = h3
    e = h4

    for (i=1; i<=80; i++) {
        if (i <= 20) {
            f = bitor(bitand(b, c) , bitand(bitnot(b), d))
            k = "01011010100000100111100110011001"
        }
        else if (i <= 40) {
            f = bitxor(bitxor(b, c), d)
            k = "01101110110110011110101110100001"
        }
        else if (i <= 60) {
            f = bitor(bitor(bitand(b, c) , bitand(b,d)) , bitand(c, d))
            k = "10001111000110111011110011011100"
        }
        else {
            f = bitxor(bitxor(b, c), d)
            k = "11001010011000101100000111010110"
        }

        temp = inbase(2,
                    frombase(2, leftrotate(a, 5)) + 
                    frombase(2, f) + frombase(2, e) + 
                    frombase(2, k) + frombase(2, w[i]) )
        e = overflow(d)
        d = overflow(c)
        c = overflow(leftrotate(b, 30))
        b = overflow(a)
        a = overflow(temp)
    }
    
    h0 = overflow(inbase(2, frombase(2, h0) + frombase(2, a)))
    h1 = overflow(inbase(2, frombase(2, h1) + frombase(2, b)))
    h2 = overflow(inbase(2, frombase(2, h2) + frombase(2, c)))
    h3 = overflow(inbase(2, frombase(2, h3) + frombase(2, d)))
    h4 = overflow(inbase(2, frombase(2, h4) + frombase(2, e)))

}
string scalar pb6to2(string scalar x)
{
    x = inbase(16, frombase(2, x))
    return(padbit(x, 8))
}
string scalar bitxor(string scalar bit1, string scalar bit2) 
{
    if (strlen(bit1) != strlen(bit2)) {
        errprintf("Your two strings must be of the same length\n")
        exit(198)
    }
    b1 = b2 = b3 = J(1, strlen(bit1), "")
    for (i=1; i<=strlen(bit1); i++) {
        b1[i] = substr(bit1, i, 1)
        b2[i] = substr(bit2, i, 1)
        b3[i] = strofreal(b1[i] != b2[i])
    }
    t = invtokens(b3, "")
    return(t)
}
string scalar bitand(string scalar bit1, string scalar bit2) 
{
    if (strlen(bit1) != strlen(bit2)) {
        errprintf("Your two strings must be of the same length\n")
        exit(198)
    }
    b1 = b2 = b3 = J(1, strlen(bit1), "")
    for (i=1; i<=strlen(bit1); i++) {
        b1[i] = substr(bit1, i, 1)
        b2[i] = substr(bit2, i, 1)
        b3[i] = strofreal((b1[i]=="1" & b2[i]=="1"))
    }
    t = invtokens(b3, "")
    return(t)
}
string scalar bitor(string scalar bit1, string scalar bit2) 
{
    if (strlen(bit1) != strlen(bit2)) {
        errprintf("Your two strings must be of the same length\n")
        exit(198)
    }
    b1 = b2 = b3 = J(1, strlen(bit1), "")
    for (i=1; i<=strlen(bit1); i++) {
        b1[i] = substr(bit1, i, 1)
        b2[i] = substr(bit2, i, 1)
        b3[i] = strofreal((b1[i]=="1" | b2[i]=="1"))
    }
    t = invtokens(b3, "")
    return(t)
}
string scalar bitnot(string scalar bit1)
{
    b1 = J(1, strlen(bit1), "")
    for (i=1; i<=strlen(bit1); i++) {
        b1[i] = strofreal((substr(bit1, i, 1) == "0"))
    }
    return (invtokens(b1, ""))
}
string scalar leftrotate(string scalar x, real scalar i)
{
    l_x = strlen(x)
    s_r = mod(i, l_x)

    if (s_r != 0) {
        a1 = substr(x, 1, s_r)
        a2 = substr(x, s_r + 1, .)
        re = a2 + a1
    }
    else {
        re = x
    }
    return (re)
}
string scalar leftshift(string scalar x, real scalar i) 
{
    l_x = strlen(x)
    if (i > 0) {
        a   = substr(x, i + 1, .)
        l_a = strlen(a)
        re  = a + "0" * (l_x-l_a)
    }
    else {
        re = x
    }
    return (re)
}
string scalar padbit(string scalar x, real scalar i, |real scalar d)
{
    l_x = strlen(x)
    if (i <= l_x) return(x)
    
    if (args() == 3) {
        x = x + "0"*(i-l_x)
        return(x)
    }
    else {
        x = "0"*(i-l_x) + x
        return(x)
    }
}
string scalar overflow(string scalar x)
{
    y = mod(frombase(2, x), 2^32)
    y = padbit(inbase(2, y), 32)
    return(y)
}
string scalar base16to64(string scalar x)
{
    y = J(1, strlen(x)/2, .)
    for(i=1; i<=strlen(x)/2; i++) {
        y[i] = frombase(16, substr(x, (i-1)*2 + 1, 2))
    }
    return(char(y))
}
end




mata:
data = st_sdata(., "`tohash'")
sha = J(st_nobs(), 1, "")

for (i=1; i<=rows(data); i++) {
sha[i] = sha1(data[i, 1])
}

st_sstore(., "`newvar'", sha)

end

noisily display "Ended $S_TIME  $S_DATE"
}

//eof

Comment

Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#4

11 May 2018, 01:46

A possibility to keep things both fast and integrated to Stata would be to write a Java plugin, making use of the MessageDigest class found in Java API, or Apache Commons Codec, Jacksum, or any other library having an implementation.

Stata's API for Java is reasonnably simple, and Kevin Crow has posted two blog entries recently about Java. See also one post in the nice series about estimation commands, by David Drukker. While their purpose in not to compute hashes, they show how the Stata/Java interaction works.

Jean-Claude Arbaut

Last edited by Jean-Claude Arbaut; 11 May 2018, 01:53.
Comment
Troy Payne

Join Date: Aug 2014

Posts: 34
#5

11 May 2018, 12:31

Jean-Claude Arbaut that's something I had not considered — I somehow missed Stata's Java integration. That's probably the "right" way to solve this sort of problem! Now the only task is to learn Java (nearly all of my programming experience is in Stata do-files)...
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#6

11 May 2018, 13:02

I could give it a try (wrote a lot of Java in university). Could you describe more precisely your problem ? It should be easy to write something that computes hashes of a string variable and puts the result in another string variable. Is that ok?

Note: in the french health system, there is a similar kind of hashing (even based on SHA, I believe - it's called FOIN, for fonction d'occultation d'identifiant nominatif), but if I remember correctly there is an additional "salt" to prevent rainbow attacks, and a number other features. I don't know the details very well, but in short, anonymization is a rather difficult process that's easy to fail. You must also take care that often, a few variables may allow someone to identify people in the database - in health care, typically: gender, age and the dates of several hospital stays, for instance. Make sure you know what you are doing!

That being said, I'll have a look at this Java program this week-end.

Jean-Claude Arbaut

Last edited by Jean-Claude Arbaut; 11 May 2018, 13:38.
Comment
Troy Payne

Join Date: Aug 2014

Posts: 34
#7

11 May 2018, 13:35

Wow. Thanks! Inputs will always be string, output is string. In the code I posted, there's a line that adds a small salt (line 21 or so) to the original string.

Very similar problem to what you're describing. I'm leading a project to create an integrated data platform that links data from across the criminal justice system for the purpose of research and analysis (i.e., not for day-to-day criminal justice operations). I'll have various sources for criminal justice data (e.g., arrests; prosecutions; convictions; prison rule violations; probation violations). Some of that information is public. Some is not public, with its acquisition and use governed by data use agreements with the relevant agencies. Each data source is linked to all the others through a state-issued identification number.

Our analysts will always be cleared to see the data they're looking at, and our systems are architected to be compliant with the FBI's CJIS security policies. But as an additional security measure I'd prefer to have the minimally viable data for the analysis at hand in memory. Hashing the ID numbers and storing names, DOB, etc separately from other fields (and merging them in only when absolutely necessary) allows me to limit access to the more sensitive fields while maintaining the ability to link tables provided by multiple data owners.

Eventually, we'd like to create extracts suitable for public dissemination and external researcher use — but we're a looooooong way away from that. We'd have a different process for that, including extensive checking to ensure sensitive information cannot be reverse engineered.
Comment
Mauricio Caceres

Join Date: Sep 2015

Posts: 130
#8

11 May 2018, 14:25

I recently wrote wrappers for a number of hash functions (SHA1, SHA256, and so on) using the OpenSSL library and Stata's C API. If it sounds useful, see https://github.com/mcaceresb/stata-shasum
Comment
Troy Payne

Join Date: Aug 2014

Posts: 34
#9

11 May 2018, 15:35

Mauricio Caceres That is exactly what I've been looking for, and it's very, very fast! That absolutely solves my problem. Thank you!
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#10

12 May 2018, 09:32

You already have a solution, but as promised, here is a Java package, using Jacksum: net from http://jean-claude.arbaut.pagesperso-orange.fr/stata/

Jean-Claude Arbaut
Comment
Troy Payne

Join Date: Aug 2014

Posts: 34
#11

15 May 2018, 12:49

Jean-Claude Arbaut This user community is just great. Thanks for all the help!
Comment

Announcement