Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rounding error using format?

    Hello,

    I am experiencing rounding errors when using the format function. I am using a personnal dataset with red blood cell counts: an example of the values I have would be 1.284E+13 (cells/mL). I perform summary statistics using the tabstat function with the format option to display 3 decimals to the right of the decimal point. Here is the function call:

    tabstat D1, statistics( mean sd var median ) by(Group) columns(statistics) longstub

    And the result (two digits to the right of the decimal point):

    Group variable | mean sd variance p50
    ----------------------+----------------------------------------
    1 D1 | 1.26e+13 1.08e+12 1.16e+24 1.26e+13
    2 D1 | 1.15e+13 2.64e+12 6.98e+24 1.20e+13
    3 D1 | 1.21e+13 6.20e+11 3.84e+23 1.21e+13
    4 D1 | 1.22e+13 6.56e+11 4.30e+23 1.21e+13
    ----------------------+----------------------------------------
    Total D1 | 1.21e+13 1.48e+12 2.20e+24 1.22e+13
    ---------------------------------------------------------------

    Now, because I wanted 3 decimals, I modified the command line to:

    tabstat D1, statistics( mean sd var median ) by(Group) columns(statistics) longstub format(%10.3f)

    And get the following results:


    Group variable | mean sd variance p50
    ----------------------+----------------------------------------
    1 D1 | 1.263e+13 1.079e+12 1.165e+24 1.260e+13
    2 D1 | 1.146e+13 2.642e+12 6.983e+24 1.200e+13
    3 D1 | 1.209e+13 6.198e+11 3.841e+23 1.210e+13
    4 D1 | 1.221e+13 6.556e+11 4.298e+23 1.205e+13
    ----------------------+----------------------------------------
    Total D1 | 1.210e+13 1.484e+12 2.203e+24 1.215e+13
    ---------------------------------------------------------------

    Variance for group 1 is 1.165E+24, rounded to 1.16E+24 in the first table. It should be rounded to 1.17E+24. I've tried different format options but nothing changes. This happens on other statistics of the same kind when Stata rounds a number ending in 5. Is there a way to force Stata to round as it should be? I am using Stata 14 but had the same problem with Stata 13.1.

    Has anybody had the same problem or does anybody know if this is merely because my command is not correct or if this is a bug?

    Thank you,

    Mathias

  • #2
    A display format is in effect an expressed preference, not an absolute instruction. If you have numbers that require many more than 20 characters to display in full, and you have several of them to display, tabstat can't do what you ask in the space available. That is not an Stata error unless it's a Stata error to be unable to fulfil impossible requests.

    Note that the .3f detail in formats like %10.3f controls the number of decimal places after the dot or period I think you are expecting that it does something else, say control the number of significant figures.

    Consider the following examples: Here we first underline the previous point about 3 decimal places. Then we show that trying to show a number that is just a little more than 10 billion, but asking for 3 decimal places too, requires %15.3f as the minimal format. All your numbers are much bigger than that!

    Code:
     
    . di %6.3f 1/7
     0.143
    
    . di %10.3f 1e10 +  1/7
     1.000e+10
    
    . di %9.3f 1e10 +  1/7
     1.00e+10
    
    . di %11.3f 1e10 +  1/7
      1.000e+10
    
    . di %12.3f 1e10 +  1/7
       1.000e+10
    
    . di %13.3f 1e10 +  1/7
        1.000e+10
    
    . di %14.3f 1e10 +  1/7
         1.000e+10
    
    . di %15.3f 1e10 +  1/7
    10000000000.143
    Note also that tabstat is a command, not a function.

    What can you do? I don't work with this kind of data, but I''d suggest

    1. Not showing the variance as well as the SD. I can't believe that it adds any scientific or statistical merit to the table.

    2. Change the units to say billion cells per ml.

    3. Experiment with format commands and examples using the display command, as above. It's the easiest way to get practical experience.

    Comment


    • #3
      Variance for group 1 is 1.165E+24, rounded to 1.16E+24 in the first table. It should be rounded to 1.17E+24.
      Not necessarily. The variance for group 1, rounded to four significant digits, is 1.165E+24. Applying the techniques Nick demonstrated,
      Code:
      . display %10.4f 1.16490000
          1.1649
      
      . display %10.3f 1.16490000
           1.165
      
      . display %10.2f 1.16490000
            1.16
      Last edited by William Lisowski; 22 Jun 2015, 08:16.

      Comment


      • #4
        Thank you. I understand your points. But if I run tabstat with more significant digits, I get:

        . tabstat D1, statistics( mean sd var median ) by(Group) columns(statistics) longstub format (%14.8f)

        Group variable | mean sd variance p50
        ----------------------+----------------------------------------
        1 D1 | 1.2625000e+13 1.0793517e+12 1.1650000e+24 1.2600000e+13
        2 D1 | 1.1462500e+13 2.6424758e+12 6.9826786e+24 1.2000000e+13
        3 D1 | 1.2087500e+13 6.1976378e+11 3.8410714e+23 1.2100000e+13
        4 D1 | 1.2212500e+13 6.5560768e+11 4.2982143e+23 1.2050000e+13
        ----------------------+----------------------------------------
        Total D1 | 1.2096875e+13 1.4842147e+12 2.2028931e+24 1.2150000e+13
        ---------------------------------------------------------------

        The value is 1.1650000e+24. If I display that value with two significant digits, I get:
        . display %10.2f 1.1650000e+24
        1.17e+24

        Why is it that with tabstat, it doesn't round this way?

        I've got other examples of this, and it's always when the last significant digit is 5 followed by 0s (and the number is rounded downards and not upwards)

        Would it be possible that the result is actually something like 1.164999999999e+24 in which case I agree this should be rounded to 1.16e+24:

        . display %10.2f 1.164999999e+24
        1.16e+24

        . display %10.3f 1.164999999e+24
        1.165e+24

        Is there a way of retrieving the actual number as stored by Stata? If I increase the number of significant digits, I still get 0s after the 5.

        I understand this seems minor but what I need to verify different functions of Stata (those I will be using) by comparing their results with those from the previous statistical software I used (Staview), Excel, and R. All three give me 1.17e+24. I am working in the pharamaceutical industry and need to validate all software I use, including Stata (even though this is a commercial software and is provided with an software verifying its correct installation). This may seem a bit stupid or a waste of time but quality assurance can get very picky on this one.

        Comment


        • #5
          This seems to repeat much of your previous question. As before "significant figures" and "decimal places" are not one and the same. You didn't respond to my suggestions that changing the units and not showing variance would help mightily.

          Restating part of my answer another way, tabstat resorts to force, or, to put it more politely, has its own ideas of which format is best when space constraints and format requests collide. Given that, you have various choices, which include

          1. cloning tabstat and modifying it so that it does what you want

          2. writing your own program

          3. trying something else.

          Here I focus on #3. We can fire up the auto data and change the units of mpg to miles per billionth of a gallon to get results round about a billion or ten billion.

          Then group summary statistics can be calculated using egen and shown with tabdisp.

          Code:
          . sysuse auto, clear 
          (1978 Automobile Data)
          
          . 
          . foreach s in mean median sd {
            2.         egen double s`s' = `s'(1e9 * mpg), by(rep78)
            3. }
          
          . 
          . tabdisp rep78, c(smean smedian ssd)
          
          ----------------------------------------------
          Repair    |
          Record    |
          1978      |      smean     smedian         ssd
          ----------+-----------------------------------
                  1 |  2.100e+10    2.10e+10   4.243e+09
                  2 |  1.913e+10    1.80e+10   3.758e+09
                  3 |  1.943e+10    1.90e+10   4.141e+09
                  4 |  2.167e+10    2.25e+10   4.935e+09
                  5 |  2.736e+10    3.00e+10   8.732e+09
                  . |  2.140e+10    2.20e+10   5.079e+09
          ----------------------------------------------
          
          . 
          . tabdisp rep78, c(smean smedian ssd) format(%15.3f) 
          
          -------------------------------------------------------------
          Repair    |
          Record    |
          1978      |           smean          smedian              ssd
          ----------+--------------------------------------------------
                  1 | 21000000000.000  20999999488.000   4242640687.119
                  2 | 19125000000.000  17999998976.000   3758324094.593
                  3 | 19433333333.333  19000000512.000   4141325236.279
                  4 | 21666666666.667  22499999744.000   4934869924.980
                  5 | 27363636363.636  30000001024.000   8732384866.378
                  . | 21400000000.000  22000001024.000   5079370039.680
          -------------------------------------------------------------
          For even more flexibility assign each summary variable its own format and then use list.

          Comment


          • #6
            If you have not done so already, you should review the output of Stata's help precision command as well as the various entries in the Stata Blog returned by the output of Stata's search precision command.

            Rather than concern yourself with rounding, I'd recommend displaying the results from Stata and the package you wish to compare it to with enough significant digits that you can confirm that they agree suitably well for your purposes, regardless of the rounding. For example, if you have 1.164 500 and 1.164 499 as the displayed values from two packages, then you know that the first value must be at most 1.164 500 5 and the second at least 1.164 498 5 so the two values differ by no more than .000 002.
            Last edited by William Lisowski; 23 Jun 2015, 21:02.

            Comment


            • #7
              Thank you very much. I will try your suggestions.

              Comment

              Working...
              X