Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Very large Integers of variable length

    Hi all,

    I was wondering if anyone has found a good soloution to working with very large integers as ID variables of variable length.

    Background,
    Working with a national medical database and they have a key (EPIKEY) which is 19 Numeric characters long.

    It is stored in a SQL database as numeric((20,0),null)

    This variable should be unique, when you read it directly using odbc load it reads as "double".

    Unfortunately, this makes the variable non-unique, i am assuming precision issues.

    So i recast this to "BIGINT" and alias it and then it reads as string.

    Code:
     
    odbc load, exec("SELECT * , CAST(EPIKEY AS BIGINT) AS EPIKEY_STR FROM .....
    This seems like an ugly and inefficient workaround.

    20bytes is quite an expensive storage.

    An equally ugly workaround would be

    Code:
    gen long epikey_a = real(substr(epikey,1,9))
    gen long epikey_b = real(substr(epikey,10,9))
    gen byte epikey_c = real(substr(epikey,19,1))
    But this 10 bytes would be a lot more efficent, but who wants to work with 3 variables when you can have 1.

    However, splitting the keys like this unfortunately doesn't work as the ID's are not always the same length.
    and if the split occurs on "001" this is converted to "1" when stored as a long.

    Stata have this helpful comment on there help file
    If you are storing identification numbers, the rounding could matter. If the identification numbers are integers and take 9 digits or less, store them as
    longs; otherwise, store them as doubles. doubles have 16 digits of accuracy.
    But now I think we need some help on holding more than 16 digits of accuracy.

    It seems like i am stuck with a string for the moment.

    bw
    Adrian







  • #2
    I have found myself in this position from time to time. If the count of distinct values of the identifiers is small enough to fit into a long, I just create a new identifier:
    Code:
    egen long new_id = group(original_id)
    Actually, I do this by creating a file that crosswalks the original_id with the new_id. I keep that safe in case it is needed. Then I merge the new_id into the working data set(s) and eliminate the original_id variable.

    If the number of distinct values is so large that even 16 digits accuracy cannot work, then I think you are stuck with the workarounds you are currently doing.

    While I'm inclined to agree that some kind of big integer storage type would probably be useful, I have been impressed at the lengths that some institutions will go to to mask their identifier variable. I have seen several data sets from hospitals where the pseudoidentifier is a string of 64 characters (hexadecimal digits) and even seen one that used a string of 128 characters. Clearly the number of distinct patients cannot remotely require that many bytes. But it also suggests that no big integer storage type is going to be adequate for this task as there will always be system administrators out there who will break it in the name of confidentiality.

    Comment


    • #3
      Thanks Clyde,

      Sometimes it makes me wonder what they are trying to do with these super long ID.

      I had thought i was missing something, but perhaps sadly i am not.
      A

      Comment


      • #4
        Long I'd numbers can come about if they are generated in steps: first digit encodes the continent, second and third the country, fourth and fifth the site, sixth and seventh the village, eighth, nineth and tenth the household, eleventh and twelveth the individual. Now you have an id that won't fit in a float. This is of course the very opposite of annonymising the data.
        ​​​
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment

        Working...
        X