Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computing hash of string data or files

    Hi,

    For those who may be interested, here is a package using Stata's Java API to compute hash functions, using the Jacksum library:

    net from http://jean-claude.arbaut.pagesperso-orange.fr/stata/

    The idea comes from this question on Statalist. One can compute the hash of string data, encoded in UTF-8, or file (with names passed in a string variable). All hash functions available in Jacksum can be used. Among others: MD5, SHA-1, SHA-512, Whirlpool, and non-cryptographics ones like Adler32 and CRC.

    This is my first public package, and I'll be happy to get any advice, bug report or feature request.

    Jean-Claude Arbaut

  • #2
    Great stuff!

    By the way, you might want to look at this corner case: Stata allows binary data in strL variables, and for some binary data the plugin fails. For instance, hashsing "gen y = char(0)" will print a java traceback. The use case would be "gen y = fileread(filename)" in order to hash files, not literally hashing null characters. Here's an example:

    Code:
    sysuse auto, clear
    save auto.dta
    clear
    set obs 1
    gen y = fileread("auto.dta")
    hash y, gen(hashy) kind(sha1)
    If it can't be solved internally, perhaps a more informative printout? (I mention it because I've been dealing with this use case recently in my own plugins.) Cheers!

    Comment


    • #3
      Thanks for the report. I suspect a problem when encoding to UTF-8, or maybe Java sees a null-terminating string where it shouldn't. Anyway, I'll investigate this.

      An alternative for files is to do the following:

      Code:
      findfile auto.dta
      clear
      set obs 1
      gen name=r(fn)
      hash name, gen(key) kind(sha1) file
      list key
      I also intend to add a command to directly compute the hash of a file (not needing to store the name in a variable). That's a functionality I use very often to check large files (to detect a possible disk or transfer failure), but till now I used Python programs for this instead.
      Last edited by Jean-Claude Arbaut; 13 May 2018, 09:41.

      Comment


      • #4
        Thanks for a nice contribution. One thought: What about letting users hash long or int numeric variables, without first converting to string? I can imagine reasons not to accept floats or doubles, but I'm thinking here of the original poster's context, in which a ID number might be a (say) nine-digit integer as is used in the U.S. for social security numbers, which might be stored as a long type in Stata. The simple solution, I suppose, would be to detect any long or int variable in the user's command, and use -strofreal()- to put it into a temporary string variable.
        Last edited by Mike Lacy; 13 May 2018, 10:42. Reason: (Part of my post crossed in the ether with and duplicated the previous comment.)

        Comment


        • #5
          Usually I don't like very much to store id data as numbers, because the leadings zeros are lost (sometimes, they are meaningful). Also, it can happen that most identifiers are numeric, but a few have letters. For instance, in France, cities have numeric codes except in one region (Corsica), for historical reasons. A program working in an area would fail when applied to anoth region or the whole country.

          These are mistakes I already made or saw, hence I'm a bit reluctant to store anything as numbers except "true" numbers. But I'll consider it anyway, as it seems to be useful in some circumstances. The risk will be that the hash of an id stored in a numeric variable will differ from the hash computed on the same id stored as string, if there is a leading zero. Quite dangerous. I'll probably print a warning when numeric variables are passed, then.

          Comment


          • #6
            Update: the web site must now be accessed in HTTPS. The command is now:

            Code:
            net from https://jean-claude-arbaut.pagesperso-orange.fr/stata/

            Comment

            Working...
            X