Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing data type: 64 bit integer variable

    I suppose this is simultaneously a feature request for Stata and a request for comment on the best solution that works in current versions of Stata.

    Context:
    I'm working with a very large panel dataset with many individuals. I receive the dataset from a data provider with individuals identified using a variable containing a 64-bit integer.

    Issue:
    • Stata's data types (help data_types) support 32 bit integers (long). But Stata does not support variables containing 64 bit integers (which are often called bigint in other software).
    • Given the size of the dataset, I would like to store this 64 bit integer in a manner that is both space efficient and computationally efficient.
    NOTE: There are many solutions that are conceptually simple but computationally inefficient. These will add hundreds of dollars to my research computing expenses for each analysis, given the size of the dataset and number of merges involved in my analysis.

    Long-term request:
    • It would be fantastic if Stata could add a new 64 bit integer variable type.
    • It would also be nice to have a binary display format for string variables, akin to %16H / %16L / %8H / %8L.
    Potential short-term solutions:
    1. A bad solution. Since Stata supports 64-bit floating point variables, I considered converting the 64-bit integer to a double and displaying it using the %16H format.
      • However, since there is a large space of values that map to missing, I will lose a significant portion of the 64-bit domain.
    2. Split the 64-bit integer into two separate 32-bit integers, and store each one as a long variable.
      • This solution generates two 4 byte variables, storing the result in 8 bytes.
    3. Convert the 64-bit integer into its base256 encoding, which can be stored as an 8-character ASCII string in a str8 variable.
      • This solution generates a single 8 byte variable, although it's not human readable when displayed as an ASCII string. Unfortunately I don't believe Stata supports a binary format like %16H for strings.
    Right now I'm planning to implement option 3, since it generates a single id variable which efficiently stores the information. But I would welcome any feedback regarding superior solutions!
    Last edited by Michael Stepner; 15 Jun 2022, 09:00.

  • #2
    I don't have any solutions for you, but I generally support the feature request for a "big integer" format (or even a bigger integer that will accommodate 128-bit integer UUIDs).

    Re #1: Stata seems to reserve a lot of precision for the decimal component of a -double-, so the largest integer is around 2^52, a far cry from the 64-bits required.

    Re #2: This is workable but feels a bit clumsy having to worry about manipulating a 2-variable id.

    Re #3: Encoding the value as a string seems the most intuitive to me. By human-readable, I assume that you mean containing simple alphanumeric characters and that no other "intelligible" information is encoded in this 64-bit number. If this is the case, then you may consider encoding the number in hexadecimal, which would only increase your storage needs for the id by an additional 8 bytes (str16), while achieving a human-readable encoding. When you eventually get to data analysis, you will need to assign a numeric id to these, but this is straightforward at the expense of one sort operation (which you'll do anyway since you have panel data).

    Comment


    • #3
      Originally posted by Leonardo Guizzetti View Post
      Re #3: Encoding the value as a string seems the most intuitive to me. By human-readable, I assume that you mean containing simple alphanumeric characters and that no other "intelligible" information is encoded in this 64-bit number. If this is the case, then you may consider encoding the number in hexadecimal, which would only increase your storage needs for the id by an additional 8 bytes (str16), while achieving a human-readable encoding. When you eventually get to data analysis, you will need to assign a numeric id to these, but this is straightforward at the expense of one sort operation (which you'll do anyway since you have panel data).
      Each byte in a string encodes a value between 0 and 255, which can be displayed as an ASCII-encoded string (ref). By noting that it's not human-readable, I just mean that many ASCII strings are not visually distinct when printed (e.g. whitespace characters, characters whose value varies depending on encoding are displayed as ⍰ by Stata). For example:

      Code:
      . di char(1)
      
      
      . di char(2)
      
      
      . di char(128)
      �
      
      . di char(129)
      �
      I could use a base64 encoding and store the 64-bit integer as a str11. The tradeoff between base64 and base256 encoding is between human-readability and computational efficiency, and for my purposes in this scenario I'll likely choose the base256 encoding and forego human-readable IDs.

      Comment


      • #4
        Originally posted by Michael Stepner View Post
        Each byte in a string encodes a value between 0 and 255, which can be displayed as an ASCII-encoded string (ref). By noting that it's not human-readable, I just mean that many ASCII strings are not visually distinct when printed (e.g. whitespace characters, characters whose value varies depending on encoding are displayed as ⍰ by Stata).

        I could use a base64 encoding and store the 64-bit integer as a str11. The tradeoff between base64 and base256 encoding is between human-readability and computational efficiency, and for my purposes in this scenario I'll likely choose the base256 encoding and forego human-readable IDs.
        I understand this, and my hunch about your strings not being visually distinct was correct. Hence, encoding with base16 which only requires 0-9 and A-F, but has the distinct advantage of being visually distinct.

        Yes, you could use base64 as another valid encoding (say, A-Z, a-z, 0-9 and 2 special characters), and you will save a few bytes per id and meet the notion of being visually distinct in a listing. However, it's not clear what kind of computational efficiency you think you will obtain. Certainly you will save some bytes, so storage on disk and storage in RAM can be improved using base64 or base256 compared to base16, and offer substantial footprint reductions if you have millions or more records. But for actual calculations, I don't believe you will see appreciable time savings.

        Comment


        • #5
          For anyone following along, base256 is not a valid solution for storing binary data in a string, because Stata str8 (or any other fixed-length string) uses the null byte (0) as an end-of-string delimiter. From the Stata manual: "str# variables cannot contain binary 0; strL variables can." However, nonempty strLs have an 80 byte overhead per string.

          So, my latest plan is to convert the 64-bit integer into its base64 encoding, which can be stored as an 11-character ASCII string in a str11 variable.

          Comment


          • #6
            Michael Stepner I noticed your post and thought the BigInteger class in Java might be useful to you. The do-file below encodes a big integer which stored as a string into a base36 string representation. It appears that 36 is the limit for the radix (a-z, 0-9). Note that a NumberFormatException will be thrown if the data in the string variable can't be encoded into a BigInteger.

            Code:
            clear
            
            local var "id"
            local newvar "encoded"
            
            // simulate some data
            set obs 100000
            gen str `var' = strofreal(ceil(100000000 * runiform()), "%20.0g") * 3
            
            gen `newvar' = ""
            
            java:
            import java.math.BigInteger;
            import java.util.stream.LongStream;
            
            // demonstrate storing base10 string to base36
            final int radix = 36;
            final int iSrcVar = Data.getVarIndex(Macro.getLocal("var"));
            final int iNewVar = Data.getVarIndex(Macro.getLocal("newvar"));
            final long obsCnt = Data.getObsTotal();
            
            if (iSrcVar <= 0 || iNewVar <= 0) SFIToolkit.setRC(111);
            
            // for (long n = 1;  n <= obsCnt; n++) {
            // using parallel loop instead of sequential
            LongStream.rangeClosed(1, obsCnt).parallel().forEach(n -> {
                    String srcVal = Data.getStr​(iSrcVar, n);
                    BigInteger big = new BigInteger(srcVal);
                    Data.storeStrf(iNewVar, n, big.toString(radix).toUpperCase());
                    // .toUpperCase() is optional and only used to improve readability
                }
            );
            end
            
            
            local newvar2 "roundTrip"
            gen `newvar2' = ""
            java:
            // demonstrate converting base36 back to base10
            final int radix = 36;
            final int iSrcVar = Data.getVarIndex(Macro.getLocal("newvar"));
            final int iNewVar = Data.getVarIndex(Macro.getLocal("newvar2"));
            final long obsCnt = Data.getObsTotal();
            
            if (iSrcVar <= 0 || iNewVar <= 0) SFIToolkit.setRC(111);
            
            // for (long n = 1;  n <= obsCnt; n++) {
            // using parallel loop instead of sequential
            LongStream.rangeClosed(1, obsCnt).parallel().forEach(n -> {
                    String encoded = Data.getStr​(iSrcVar, n);
                    BigInteger big = new BigInteger(encoded, radix);
                    Data.storeStrf(iNewVar, n, big.toString(10)); // back to base 10
                }
            );
            end
            
            list in 1/10
            list in -10/L
            assert id == roundTrip
            Last edited by James Hassell (StataCorp); 17 Jun 2022, 16:30.

            Comment


            • #7
              Originally posted by James Hassell (StataCorp) View Post
              Michael Stepner I noticed your post and thought the BigInteger class in Java might be useful to you. The do-file below encodes a big integer which stored as a string into a base36 string representation. It appears that 36 is the limit for the radix (a-z, 0-9). Note that a NumberFormatException will be thrown if the data in the string variable can't be encoded into a BigInteger.
              Thanks for the tip, James! Indeed, it seems that encoding my 64-bit integers as a string is the best solution right now. Your Java solution using base36 is clever. I've been working on a Python solution using base64.

              And I hope StataCorp might consider adding a new 64 bit integer data type in the future!

              Comment


              • #8
                Michael Stepner If this is about storage and display, you can definitely store a 64-bit integer as a 64-bit float (double) and display the same value in hexadecimal; the caveat is that you need to do some conversion. For example,

                Code:
                set seed 42
                clear
                set obs `=1024 * 1024'
                gen str var    = strofreal(ceil(1e6 * runiform()), "%20.0g") * 3
                gen double enc = .
                replace var    = "18446744073709551615" in 1
                replace var    = "18446744073709551614" in 2
                replace var    = "18446744073709551613" in 3
                replace var    = "9223372036854775807"  in 4
                replace var    = "9223372036854775806"  in 5
                replace var    = "9223372036854775805"  in 6
                tempfile storebin
                tempfile readbin
                You can convert integers to 64-bit float and display the hex value,

                Code:
                python:
                from sfi import Data, Macro
                import struct
                
                idstr = Data.get("var")
                with open(Macro.getLocal("storebin"), "wb") as bin:
                    for x in idstr:
                        _ = bin.write(struct.pack(">Q", int(x)))
                end
                
                mata
                fh = fopen(st_local("storebin"), "r")
                C  = bufio()
                st_store(., "enc", fbufget(C, fh, "%8z", st_nobs(), 1))
                fclose(fh)
                end
                
                format %16H enc
                l in 1/10
                l in `=_N-10'/l
                And you can convert the float back to an integer:

                Code:
                gen str check = ""
                mata
                fh = fopen(st_local("readbin"), "w")
                C  = bufio()
                fbufput(C, fh, "%8z", st_data(., "enc"))
                fclose(fh)
                end
                
                python:
                from sfi import Data, Macro
                import struct
                
                idstr = Data.getObsTotal() * [None]
                with open(Macro.getLocal("readbin"), "rb") as bin:
                    for i in range(Data.getObsTotal()):
                        idstr[i] = str(struct.unpack('>Q', bin.read(8))[0])
                
                Data.store('check', None, idstr)
                end
                
                l in 1/10
                l in `=_N-10'/l
                assert var == check
                You can see the conversion was without loss of data (even at the unsigned integer limits). Since enc is just a double, it only takes that much space:

                Code:
                drop var check
                save test.dta, replace
                python:
                import os
                (os.path.getsize("test.dta") - Data.getObsTotal() * 8) / 1024.0
                end

                Comment


                • #9
                  Originally posted by Mauricio Caceres View Post
                  Michael Stepner If this is about storage and display, you can definitely store a 64-bit integer as a 64-bit float (double) and display the same value in hexadecimal; the caveat is that you need to do some conversion. For example,
                  Mauricio, thank you for writing out this helpful sample code! We did a bit of testing and we had issues with the int64 -> binary -> double conversion due to the large space mapped to NaN in IEEE-754. In particular, here is the report my coworker shared with me:

                  unfortunately after doing some tests and spending some quality time reading up on the details of floating-point, an int64->binary->double conversion has one big downside as an identifier, since there are a huge number of bit patterns (2^51) that evaluate to NaN (plus excluding +/- infinity and signed zeros). A quick test in Athena shows that we'll probably lose about 6,400 tag rows due to this (2022-02-11). While that's certainly better than 41,000, it's not a slam-dunk refactoring decision, given I worry there might be more floating-point pitfalls I haven't thought of.
                  In short, although the binary domain of int64 and doubles are identical, we found that the domain interpreted as valid for a double was smaller than the valid domain for an int64. That said, we were trying to convert from int64 to double in AWS Athena, not in Stata. So our failed test doesn't rule out that your clever mix of mata and python code would allow us to store any int64 as a double without loss of data.

                  Even so, given the prevalence of BigInt as a datatype today and the difficulty in managing them efficiently (your solution is sophisticated!), I hope StataCorp will consider adding a new 64 bit integer data type

                  Comment


                  • #10
                    Aside: the issue we faced with the double type is analogous to the issue we faced with a base-256 str8 encoding: the domain of valid values did not span the entire domain of binary values.

                    Originally posted by Michael Stepner View Post
                    For anyone following along, base256 is not a valid solution for storing binary data in a string, because Stata str8 (or any other fixed-length string) uses the null byte (0) as an end-of-string delimiter. From the Stata manual: "str# variables cannot contain binary 0; strL variables can." However, nonempty strLs have an 80 byte overhead per string.

                    Comment


                    • #11
                      Michael Stepner

                      unfortunately after doing some tests and spending some quality time reading up on the details of floating-point, an int64->binary->double conversion has one big downside as an identifier, since there are a huge number of bit patterns (2^51) that evaluate to NaN (plus excluding +/- infinity and signed zeros). A quick test in Athena shows that we'll probably lose about 6,400 tag rows due to this (2022-02-11). While that's certainly better than 41,000, it's not a slam-dunk refactoring decision, given I worry there might be more floating-point pitfalls I haven't thought of.
                      This is true, which is why I do the mata workaround and save to a binary file (in my example I checked it against the upper unsigned integer limits, which normally evaluate to nan/missing, but they don't result in loss of data in this case because the code stores the binary information, not the double representation; the example would fail in the same way you quote above if I had stored the data into Stata from Python directly). For example, take this Python code

                      Code:
                      import struct
                      
                      xint64   = 18446744073709551615
                      yint64   = 18446744073709551614
                      xfloat64 = struct.unpack('>d', struct.pack('>Q', xint64))[0]
                      yfloat64 = struct.unpack('>d', struct.pack('>Q', yint64))[0]
                      
                      print(hex(xint64))
                      print(hex(yint64))
                      print(xfloat64)
                      print(yfloat64)
                      You can see that the hex strings are different, but both floats are nan. However, python is actually storing the data correctly:
                      Code:
                      assert xfloat64 != yfloat64
                      
                      xcheck64 = struct.unpack('>Q', struct.pack('>d', xfloat64))[0]
                      ycheck64 = struct.unpack('>Q', struct.pack('>d', yfloat64))[0]
                      
                      assert xint64 == xcheck64
                      assert yint64 == ycheck64
                      
                      print(hex(xcheck64))
                      print(hex(ycheck64))
                      So even though you are inducing nans, in Python you can go back and forth since the underlying data is being kept as is. This also true to a certain extent in Stata, which is why the mata code works. In my example, if you look at the first few values of enc without formatting it as hex, you will see some of them are displayed as "-" because Stata doesn't know what the number is, but it's storing the data anyway (and if you ask Stata whether this value is missing, it will tell you no).

                      All this said, I don't know the system/program you're using, so you'd have to be particularly careful to make sure the nans don't get accidentally interpreted as such (it's possible this type of lossless conversion isn't feasible). Mainly, though, I stumbled across this and thought it would be interesting to comment on the 64-bit integer->binary->64-bit float (double) possibility since it's not obvious how to do it (as you pointed out). PS: I also hope Stata someday has 64-bit integers (I'd get to close this bug).

                      Comment


                      • #12
                        Originally posted by Mauricio Caceres View Post
                        This is true, which is why I do the mata workaround and save to a binary file (in my example I checked it against the upper unsigned integer limits, which normally evaluate to nan/missing, but they don't result in loss of data in this case because the code stores the binary information, not the double representation; the example would fail in the same way you quote above if I had stored the data into Stata from Python directly).

                        [...]

                        Mainly, though, I stumbled across this and thought it would be interesting to comment on the 64-bit integer->binary->64-bit float (double) possibility since it's not obvious how to do it (as you pointed out). PS: I also hope Stata someday has 64-bit integers (I'd get to close this bug).
                        It's a clever workaround Mauricio, thanks for sharing it with us and explaining the methods behind it!

                        Comment

                        Working...
                        X