Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Testing whether recasting as a float from a double will result in lost precision

    I have a lot of datasets which are the result of an export from SAS. As far as I can tell, SAS always defaults to double-precision floats when exporting to Stata. The floating-point variables in these datasets appear to nearly always "stored" as single-precision floats in the sense that their accuracy is obviously round to the 6.92 digit precision of a single-precision float.

    Data items like this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double var1
    9.642
    8.165
    8.482
    9.351
    end
    will then be stored as:
    Code:
    var1
    9.641999999999999
    8.164999999999999
    8.481999999999999
    9.351000000000001
    (I know you can change the format and Stata will silently adapt the way it displays it, but this is more a question about type conversion.)

    If I try to test whether the variables are identical when rounded to within 6 decimals or 8 decimals
    Code:
    assert round(var1,.00001)== round(var1,.0000001)
    it will fail. I'm guessing this is because of some underlying precision in binary that I don't quite understand. On top of this, I'm not quite sure how to round to the exact precision of a single-precision float. Because of this, I'm sort of lost as to why
    Code:
    assert float(var1)==var1
    fails. I know it has something to do with binary representations of base 10 numbers, but I never had formal training in computer science on this.


    Does anyone have any advice on how to pseudo-test the implied decimal precision of a double so that it can be recast into a float if the precision is unwarranted?

  • #2
    What you report as the number stored is just a better decimal approximation to what is stored in binary.

    round() is a treacherous witness as it can't do what people sometimes think it does, namely round exactly to a multiple of a fractional power of 10. All it could do is produce another binary approximation to a different decimal.

    I think this is a more direct approximation to your question. A hexadecimal display format shows more clearly how many bits are being used.



    Code:
    . gen float fvar1 = float(var1)
    
    . gen double diff = var1 - float(var1)
    
    .
    . format * %21x
    
    .
    . list
    
         +-----------------------------------------------------------------------+
         |                  var1                   fvar1                    diff |
         |-----------------------------------------------------------------------|
      1. | +1.348b439581062X+003   +1.348b440000000X+003   -1.a9fbe78000000X-017 |
      2. | +1.0547ae147ae14X+003   +1.0547ae0000000X+003   +1.47ae140000000X-019 |
      3. | +1.0f6c8b4395810X+003   +1.0f6c8c0000000X+003   -1.78d4fe0000000X-016 |
      4. | +1.2b3b645a1cac1X+003   +1.2b3b640000000X+003   +1.6872b04000000X-017 |
         +-----------------------------------------------------------------------+
    
    .
    . format * %21.18f
    
    .
    . list
    
         +---------------------------------------------------------------------+
         |                 var1                  fvar1                    diff |
         |---------------------------------------------------------------------|
      1. | 9.641999999999999460   9.642000198364257813   -0.000000198364258353 |
      2. | 8.164999999999999147   8.164999961853027344    0.000000038146971804 |
      3. | 8.481999999999999318   8.482000350952148438   -0.000000350952149120 |
      4. | 9.351000000000000867   9.350999832153320313    0.000000167846680554 |
         +---------------------------------------------------------------------+
    Last edited by Nick Cox; 23 Apr 2020, 13:25.

    Comment


    • #3
      Is there any reference on how Stata coaxes floating point numbers into their exact decimal representations, and can this be replicated as a user program?

      Stata clearly understands on some level that the binary representations of 9.642 are identical in a decimal sense, since say the operation of outsheet-ing their values to a set of ascii characters produces bitwise identical outputs which are then bitwise identical double-precision floats.

      Code:
      gen float fvar1 = float(var1)
      outsheet using temp.csv, replace
      insheet using temp.csv, clear double
      assert var1==fvar1
      This is obviously an insane thing to do from a memory and cpu perspective, but something must be happening under the hood that aught to be replicable as a variable operation?

      Comment


      • #4
        To Nick's advice let me add as recommended reading Bill Gould's extensive blog post on numeric precision in Stata, found at

        https://blog.stata.com/2012/04/02/th...-to-precision/

        and in particular, to section 5.5, a discussion of "false precision". He writes "Little in this world is measured to a relative accuracy of ±2-24, the accuracy provided by float precision." and backs that up with examples. But he's not dogmatic, he does recognize that
        Nonetheless, a few things have been measured with more than float accuracy, and they stand out as crowning accomplishments of mankind. Use double as required.
        It seems likely that little is to be lost by using recast float ... , force on your double-precision numbers. The one exception I might make, and Bill discusses this as well, are exact data, such as that for currency. And even then, he points out
        The U.S. deficit in 2011 was $1.5 trillion. Stored as a float, this amount has a (maximum) error of ±2-24*1.5e+12 = ±$89,406.97. It would be difficult to imagine that ±$89,406.97 would affect any government decision maker dealing with the full $1.5 trillion.

        Comment


        • #5
          I guess to your point, it's probably not worth individually testing their precision if they are true floating point numbers and I can just crush all of them, maybe with a test to see if any of the doubles are actually big-integers that are potentially id-numbers. It feels unsatisfying, but maybe I should learn to live with that rather than feeding my ocd.

          Comment


          • #6
            In response to post #3, which arrived while I was writing post #4, the outsheet command simply transforms both numbers into character strings using their display formats.
            Code:
            . input double var1
            
                       var1
              1. 9.642
              2. 8.165
              3. 8.482
              4. 9.351
              5. end
            
            . generate float fvar1 = float(var1)
            
            . describe var1 fvar1
            
                          storage   display    value
            variable name   type    format     label      variable label
            ------------------------------------------------------------------------------------------------
            var1            double  %10.0g                
            fvar1           float   %9.0g                 
            
            . outsheet using "~/Downloads/temp.csv", replace
            
            . format var1 fvar1 %21x
            
            . list
            
                 +-----------------------------------------------+
                 |                  var1                   fvar1 |
                 |-----------------------------------------------|
              1. | +1.348b439581062X+003   +1.348b440000000X+003 |
              2. | +1.0547ae147ae14X+003   +1.0547ae0000000X+003 |
              3. | +1.0f6c8b4395810X+003   +1.0f6c8c0000000X+003 |
              4. | +1.2b3b645a1cac1X+003   +1.2b3b640000000X+003 |
                 +-----------------------------------------------+
            
            . type "~/Downloads/temp.csv"
            var1    fvar1
            9.642   9.642
            8.165   8.165
            8.482   8.482
            9.351   9.351
            So you could perhaps compare
            Code:
            strofreal(var1,"%10.0g") == strofreal(float(var1),"%10.0g")
            to achieve some measure of what you seek.

            Comment


            • #7
              With regard to post #5, yeah, I too hate to throw away hard-earned bits, you never know when you're going to need them. I'm sure we're going to see shortages of digits any day now and I'll regret a lifetime spent thinking they would never be significant. :-)

              Comment

              Working...
              X