Testing whether recasting as a float from a double will result in lost precision

Malcolm Wardlaw

Join Date: Apr 2014

Posts: 46
#1

Testing whether recasting as a float from a double will result in lost precision

23 Apr 2020, 12:59

I have a lot of datasets which are the result of an export from SAS. As far as I can tell, SAS always defaults to double-precision floats when exporting to Stata. The floating-point variables in these datasets appear to nearly always "stored" as single-precision floats in the sense that their accuracy is obviously round to the 6.92 digit precision of a single-precision float.

Data items like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double var1 9.642 8.165 8.482 9.351 end

will then be stored as:

Code:

var1 9.641999999999999 8.164999999999999 8.481999999999999 9.351000000000001

(I know you can change the format and Stata will silently adapt the way it displays it, but this is more a question about type conversion.)

If I try to test whether the variables are identical when rounded to within 6 decimals or 8 decimals

Code:

assert round(var1,.00001)== round(var1,.0000001)

it will fail. I'm guessing this is because of some underlying precision in binary that I don't quite understand. On top of this, I'm not quite sure how to round to the exact precision of a single-precision float. Because of this, I'm sort of lost as to why

Code:

assert float(var1)==var1

fails. I know it has something to do with binary representations of base 10 numbers, but I never had formal training in computer science on this.

Does anyone have any advice on how to pseudo-test the implied decimal precision of a double so that it can be recast into a float if the precision is unwarranted?
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35724

23 Apr 2020, 13:16

What you report as the number stored is just a better decimal approximation to what is stored in binary.

round() is a treacherous witness as it can't do what people sometimes think it does, namely round exactly to a multiple of a fractional power of 10. All it could do is produce another binary approximation to a different decimal.

I think this is a more direct approximation to your question. A hexadecimal display format shows more clearly how many bits are being used.

Code:

. gen float fvar1 = float(var1)

. gen double diff = var1 - float(var1)

.
. format * %21x

.
. list

     +-----------------------------------------------------------------------+
     |                  var1                   fvar1                    diff |
     |-----------------------------------------------------------------------|
  1. | +1.348b439581062X+003   +1.348b440000000X+003   -1.a9fbe78000000X-017 |
  2. | +1.0547ae147ae14X+003   +1.0547ae0000000X+003   +1.47ae140000000X-019 |
  3. | +1.0f6c8b4395810X+003   +1.0f6c8c0000000X+003   -1.78d4fe0000000X-016 |
  4. | +1.2b3b645a1cac1X+003   +1.2b3b640000000X+003   +1.6872b04000000X-017 |
     +-----------------------------------------------------------------------+

.
. format * %21.18f

.
. list

     +---------------------------------------------------------------------+
     |                 var1                  fvar1                    diff |
     |---------------------------------------------------------------------|
  1. | 9.641999999999999460   9.642000198364257813   -0.000000198364258353 |
  2. | 8.164999999999999147   8.164999961853027344    0.000000038146971804 |
  3. | 8.481999999999999318   8.482000350952148438   -0.000000350952149120 |
  4. | 9.351000000000000867   9.350999832153320313    0.000000167846680554 |
     +---------------------------------------------------------------------+

Last edited by Nick Cox; 23 Apr 2020, 13:25.

Comment

Malcolm Wardlaw

Join Date: Apr 2014

Posts: 46
#3

23 Apr 2020, 14:20

Is there any reference on how Stata coaxes floating point numbers into their exact decimal representations, and can this be replicated as a user program?

Stata clearly understands on some level that the binary representations of 9.642 are identical in a decimal sense, since say the operation of outsheet-ing their values to a set of ascii characters produces bitwise identical outputs which are then bitwise identical double-precision floats.

Code:

gen float fvar1 = float(var1) outsheet using temp.csv, replace insheet using temp.csv, clear double assert var1==fvar1

This is obviously an insane thing to do from a memory and cpu perspective, but something must be happening under the hood that aught to be replicable as a variable operation?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

23 Apr 2020, 14:20

To Nick's advice let me add as recommended reading Bill Gould's extensive blog post on numeric precision in Stata, found at

https://blog.stata.com/2012/04/02/th...-to-precision/

and in particular, to section 5.5, a discussion of "false precision". He writes "Little in this world is measured to a relative accuracy of ±2-24, the accuracy provided by float precision." and backs that up with examples. But he's not dogmatic, he does recognize that
Nonetheless, a few things have been measured with more than float accuracy, and they stand out as crowning accomplishments of mankind. Use double as required.
It seems likely that little is to be lost by using recast float ... , force on your double-precision numbers. The one exception I might make, and Bill discusses this as well, are exact data, such as that for currency. And even then, he points out
The U.S. deficit in 2011 was $1.5 trillion. Stored as a float, this amount has a (maximum) error of ±2^-24*1.5e+12 = ±$89,406.97. It would be difficult to imagine that ±$89,406.97 would affect any government decision maker dealing with the full $1.5 trillion.
Comment
Malcolm Wardlaw

Join Date: Apr 2014

Posts: 46
#5

23 Apr 2020, 14:33

I guess to your point, it's probably not worth individually testing their precision if they are true floating point numbers and I can just crush all of them, maybe with a test to see if any of the doubles are actually big-integers that are potentially id-numbers. It feels unsatisfying, but maybe I should learn to live with that rather than feeding my ocd.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

23 Apr 2020, 14:40

In response to post #3, which arrived while I was writing post #4, the outsheet command simply transforms both numbers into character strings using their display formats.

Code:

. input double var1

           var1
  1. 9.642
  2. 8.165
  3. 8.482
  4. 9.351
  5. end

. generate float fvar1 = float(var1)

. describe var1 fvar1

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
var1            double  %10.0g                
fvar1           float   %9.0g                 

. outsheet using "~/Downloads/temp.csv", replace

. format var1 fvar1 %21x

. list

     +-----------------------------------------------+
     |                  var1                   fvar1 |
     |-----------------------------------------------|
  1. | +1.348b439581062X+003   +1.348b440000000X+003 |
  2. | +1.0547ae147ae14X+003   +1.0547ae0000000X+003 |
  3. | +1.0f6c8b4395810X+003   +1.0f6c8c0000000X+003 |
  4. | +1.2b3b645a1cac1X+003   +1.2b3b640000000X+003 |
     +-----------------------------------------------+

. type "~/Downloads/temp.csv"
var1    fvar1
9.642   9.642
8.165   8.165
8.482   8.482
9.351   9.351

So you could perhaps compare

Code:

strofreal(var1,"%10.0g") == strofreal(float(var1),"%10.0g")

to achieve some measure of what you seek.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

23 Apr 2020, 15:00

With regard to post #5, yeah, I too hate to throw away hard-earned bits, you never know when you're going to need them. I'm sure we're going to see shortages of digits any day now and I'll regret a lifetime spent thinking they would never be significant. :-)
Comment

Announcement