Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Egen Mean Function Is Not Generating a Double Var? (Precision Problem or Egen Mean?)

    When I use mean() function from egen it seems that egen does not store the result as double even when I indicated that. The result is problematic since I am trying to identify values that depart from the mean (of course, I can use round() and it actually works perfect though I'm just curious about why is this going on with egen).

    I have a x variable with a "binary-problematic" number. I define it with
    Code:
    gen double x = 0
    I calculate xMean as the mean of observations per individual
    Code:
    by ID: egen double xMean = mean(x)
    Then

    Code:
    gen double xDiff = x-xMean
    I expect to obtain 0 however, I obtain a non-zero value. It works as expected if I use
    Code:
    gen double xDiff = x-round(xMean, .01)
    id x xMean xDiff
    1 114.77 114.77 -1.421e-14
    1 114.77 114.77 -1.421e-14
    It does work as expected when I manually input values (114.77) instead of using egen mean().

    I read Precision (yet again), Part I and The Penultimate Guide to Precision. What am I missing?
    Last edited by Emiliano Carlevaro; 06 Dec 2017, 17:05.

  • #2
    You do not show your data in a form that is usable to answer this question: an HTML table does not tell us the storage types of the data. You should be using -dataex- to show example data in general, and especially in a situation like this.

    That said, I'm quite sure that -egen- is storing its results as a double with that command. The thing is that double precision is still only finite precision. When you calculate the mean here, you are adding up the values and then dividing by the count. The result will, in general, not be exact even in double precision as the numbers you are working with require infinitely long binary representations. So there is some truncation error in the calculations themselves that you pick up when you try to calculate the difference.

    That said, I cannot reproduce your finding for these particular numbers in my Stata:

    Code:
    clear
    input int id double x
    1 114.77
    1 114.77
    end
    
    by id, sort: egen double xMean = mean(x)
    
    gen double diff = x - xMean
    list, noobs clean
    format x xMean %21x
    list, noobs clean
    Note: The format %21x shows the actual hexadecimal representation of the numbers, so we can see that we really are getting an exact 0 result, it's not just -list- assuming we want to see zero if it's very close to zero.

    Comment


    • #3
      Yes, you cannot reproduce it because it was my fault. I entered example data using browse mode and it defined my variable as float. My bad, I'm sorry.

      However, I still don't understand the issue with my real data which I attach using dataex. My variables are accounting ratios of banks.
      Probably, this is just an example of a precision issue and Stata being a highly precise software.The hexadecimal representation of the mean variable is not the same as the input values.
      HTML Code:
      <strike>I probably misunderstood that if I make the difference between two double variables whose values are equal (in decimal system) but have no exact representation in the binary system I would (magically) obtain a zero value</strike>
      .

      The real data is grouped by date and id (i.e, panel-data), so I define xMean with
      Code:
      by date id, sort: egen double xMean = mean(x)
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float date long id double(x xMean diff)
      437   7  61.54              61.54                       0
      455  11     56              54.47      1.5300000000000011
      431  11  58.12              58.12                       0
      455  11  52.94              54.47     -1.5300000000000011
      431  11  58.12              58.12                       0
      455  11     56              54.47      1.5300000000000011
      455  11  52.94              54.47     -1.5300000000000011
      479  14  49.57              49.57                       0
      443  14  70.26              70.26                       0
      455  14  61.73  61.43333333333334      .29666666666665975
      431  14  73.23              73.23                       0
      431 249 114.77 114.77000000000001 -1.4210854715202004e-14
      431 249 114.77 114.77000000000001 -1.4210854715202004e-14
      431 280  82.04              82.04                       0
      443 280  88.34  89.99333333333334      -1.653333333333336
      443 280  90.82  89.99333333333334       .8266666666666538
      431 280  82.04              82.04                       0
      443 280  90.82  89.99333333333334       .8266666666666538
      431 280  82.04              82.04                       0
      443 280  90.82  89.99333333333334       .8266666666666538
      443 280  90.82  89.99333333333334       .8266666666666538
      443 280  90.82  89.99333333333334       .8266666666666538
      443 280  88.34  89.99333333333334      -1.653333333333336
      443 280  90.82  89.99333333333334       .8266666666666538
      443 280  88.34  89.99333333333334      -1.653333333333336
      431 280  82.04              82.04                       0
      end
      format %tm date
      I haven't mentioned before because of my ill-example data, but I get these data (variable x) from Excel (actually from an old dBase database file). I use "import excel". It generates variable x as double.I also tried importing all data as string and then using destring which seems to convert "114.77" to double 114.77. May it be a problem with importing data?

      [UPDATE]
      It is an importing issue. I will look at it but it is clear that at some point (between Excel, import excel and/or destring) I'm losing data even when x var is defined as double.
      Last edited by Emiliano Carlevaro; 07 Dec 2017, 05:52. Reason: update of the problem

      Comment


      • #4
        It's, indirectly, an importing issue. But primarily it's a precision issue. Yes, you are losing data--it's inevitable. There is no exact finite binary representation of numbers like 114.77. (A rational number has a finite binary representation if and only if its denominator is a power of 2.) double precision will give you smaller truncation/rounding errors than float precision. But no amount of precision will avoid such errors entirely.

        Also, let's put this in perspective. Your -diff- variable is showing as -1.421e-14. That's 10-14 instead of 0. That can't possibly make any practical difference to what you're doing. It's almost certainly negligible compared to other sources of error in the data itself.

        The bottom line is that it is not possible to do exact floating point arithmetic, and, in particular, trying to get exact equality comparisons is a fool's errand. The degree of precision offered by double precision is enough for nearly all applications. If you really are in a situation where that level of precision then you need to either write custom code for quadruple or higher precision calculations, or switch to other software that incorporates quad precision.

        Comment


        • #5
          I have been playing around and it is not an importing issue either. It is a precision issue.
          In my real data (where the problem appeared) I have 5 equal values 114.77 that I need to average. In the example I gave in the first post, there were only 2 observations 114.77 that is why the problem cannot be reproduced. But instead, with 5 observations:

          Code:
          clear
          input int id double x
          1 114.77
          1 114.77
          1 114.77
          1 114.77
          1 114.77
          end
          
          by id, sort: egen double xMean = mean(x)
          
          gen double diff = x - xMean
          list, noobs clean
          format x xMean %21x
          list, noobs clean
          Now, x-xMean is not zero.

          I only obtain the expected result 0 (HEX +0.0000000000000X-3ff) if I use
          Code:
          gen double diff2 = round(x, .01)-round(xMean, .01)
          That works fine for me in this case.

          I feel the lesson for me is that precision cannot be neglected by the sole fact of using double variables.

          Comment


          • #6
            It is interesting that the results can be reproduced for 5 (or 9, or 10, ...), but not 2 (or 3, or 4, or 6, or 7) observations. Precision still is a fascinating issue for me.

            Best
            Daniel

            Comment

            Working...
            X