Egen Mean Function Is Not Generating a Double Var? (Precision Problem or Egen Mean?)

Emiliano Carlevaro

Join Date: May 2017

Posts: 4
#1

Egen Mean Function Is Not Generating a Double Var? (Precision Problem or Egen Mean?)

06 Dec 2017, 16:59

When I use mean() function from egen it seems that egen does not store the result as double even when I indicated that. The result is problematic since I am trying to identify values that depart from the mean (of course, I can use round() and it actually works perfect though I'm just curious about why is this going on with egen).

I have a x variable with a "binary-problematic" number. I define it with

Code:

gen double x = 0

I calculate xMean as the mean of observations per individual

Code:

by ID: egen double xMean = mean(x)

Then

Code:

gen double xDiff = x-xMean

I expect to obtain 0 however, I obtain a non-zero value. It works as expected if I use

Code:

gen double xDiff = x-round(xMean, .01)

id x xMean xDiff

1 114.77 114.77 -1.421e-14

1 114.77 114.77 -1.421e-14

It does work as expected when I manually input values (114.77) instead of using egen mean().

I read Precision (yet again), Part I and The Penultimate Guide to Precision. What am I missing?

Last edited by Emiliano Carlevaro; 06 Dec 2017, 17:05.
Tags: egen, precision, storing
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

06 Dec 2017, 18:03

You do not show your data in a form that is usable to answer this question: an HTML table does not tell us the storage types of the data. You should be using -dataex- to show example data in general, and especially in a situation like this.

That said, I'm quite sure that -egen- is storing its results as a double with that command. The thing is that double precision is still only finite precision. When you calculate the mean here, you are adding up the values and then dividing by the count. The result will, in general, not be exact even in double precision as the numbers you are working with require infinitely long binary representations. So there is some truncation error in the calculations themselves that you pick up when you try to calculate the difference.

That said, I cannot reproduce your finding for these particular numbers in my Stata:

Code:

clear input int id double x 1 114.77 1 114.77 end by id, sort: egen double xMean = mean(x) gen double diff = x - xMean list, noobs clean format x xMean %21x list, noobs clean

Note: The format %21x shows the actual hexadecimal representation of the numbers, so we can see that we really are getting an exact 0 result, it's not just -list- assuming we want to see zero if it's very close to zero.
1 like
Comment

Emiliano Carlevaro

Join Date: May 2017
Posts: 4

07 Dec 2017, 05:43

Yes, you cannot reproduce it because it was my fault. I entered example data using browse mode and it defined my variable as float. My bad, I'm sorry.

However, I still don't understand the issue with my real data which I attach using dataex. My variables are accounting ratios of banks.
Probably, this is just an example of a precision issue and Stata being a highly precise software.The hexadecimal representation of the mean variable is not the same as the input values.

HTML Code:

<strike>I probably misunderstood that if I make the difference between two double variables whose values are equal (in decimal system) but have no exact representation in the binary system I would (magically) obtain a zero value</strike>

.

The real data is grouped by date and id (i.e, panel-data), so I define xMean with

Code:

by date id, sort: egen double xMean = mean(x)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float date long id double(x xMean diff)
437   7  61.54              61.54                       0
455  11     56              54.47      1.5300000000000011
431  11  58.12              58.12                       0
455  11  52.94              54.47     -1.5300000000000011
431  11  58.12              58.12                       0
455  11     56              54.47      1.5300000000000011
455  11  52.94              54.47     -1.5300000000000011
479  14  49.57              49.57                       0
443  14  70.26              70.26                       0
455  14  61.73  61.43333333333334      .29666666666665975
431  14  73.23              73.23                       0
431 249 114.77 114.77000000000001 -1.4210854715202004e-14
431 249 114.77 114.77000000000001 -1.4210854715202004e-14
431 280  82.04              82.04                       0
443 280  88.34  89.99333333333334      -1.653333333333336
443 280  90.82  89.99333333333334       .8266666666666538
431 280  82.04              82.04                       0
443 280  90.82  89.99333333333334       .8266666666666538
431 280  82.04              82.04                       0
443 280  90.82  89.99333333333334       .8266666666666538
443 280  90.82  89.99333333333334       .8266666666666538
443 280  90.82  89.99333333333334       .8266666666666538
443 280  88.34  89.99333333333334      -1.653333333333336
443 280  90.82  89.99333333333334       .8266666666666538
443 280  88.34  89.99333333333334      -1.653333333333336
431 280  82.04              82.04                       0
end
format %tm date

I haven't mentioned before because of my ill-example data, but I get these data (variable x) from Excel (actually from an old dBase database file). I use "import excel". It generates variable x as double.I also tried importing all data as string and then using destring which seems to convert "114.77" to double 114.77. May it be a problem with importing data?

[UPDATE]
It is an importing issue. I will look at it but it is clear that at some point (between Excel, import excel and/or destring) I'm losing data even when x var is defined as double.

Last edited by Emiliano Carlevaro; 07 Dec 2017, 05:52. Reason: update of the problem

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

07 Dec 2017, 08:07

It's, indirectly, an importing issue. But primarily it's a precision issue. Yes, you are losing data--it's inevitable. There is no exact finite binary representation of numbers like 114.77. (A rational number has a finite binary representation if and only if its denominator is a power of 2.) double precision will give you smaller truncation/rounding errors than float precision. But no amount of precision will avoid such errors entirely.

Also, let's put this in perspective. Your -diff- variable is showing as -1.421e-14. That's 10^-14 instead of 0. That can't possibly make any practical difference to what you're doing. It's almost certainly negligible compared to other sources of error in the data itself.

The bottom line is that it is not possible to do exact floating point arithmetic, and, in particular, trying to get exact equality comparisons is a fool's errand. The degree of precision offered by double precision is enough for nearly all applications. If you really are in a situation where that level of precision then you need to either write custom code for quadruple or higher precision calculations, or switch to other software that incorporates quad precision.
1 like
Comment
Emiliano Carlevaro

Join Date: May 2017

Posts: 4
#5

07 Dec 2017, 08:30

I have been playing around and it is not an importing issue either. It is a precision issue.
In my real data (where the problem appeared) I have 5 equal values 114.77 that I need to average. In the example I gave in the first post, there were only 2 observations 114.77 that is why the problem cannot be reproduced. But instead, with 5 observations:

Code:

clear input int id double x 1 114.77 1 114.77 1 114.77 1 114.77 1 114.77 end by id, sort: egen double xMean = mean(x) gen double diff = x - xMean list, noobs clean format x xMean %21x list, noobs clean

Now, x-xMean is not zero.

I only obtain the expected result 0 (HEX +0.0000000000000X-3ff) if I use

Code:

gen double diff2 = round(x, .01)-round(xMean, .01)

That works fine for me in this case.

I feel the lesson for me is that precision cannot be neglected by the sole fact of using double variables.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3872
#6

07 Dec 2017, 09:22

It is interesting that the results can be reproduced for 5 (or 9, or 10, ...), but not 2 (or 3, or 4, or 6, or 7) observations. Precision still is a fascinating issue for me.

Best
Daniel
Comment

Announcement