Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • STATA randomly adds/subtracts 0.0000001 to random numeric values

    Dear STATA users

    I have a problem where STATA either adds or subtracts 0.000001/0.0000001 to a seemingly random set of my numeric values.
    I have read several forum posts on topics of STATA adding untrue decimal points, but I have been unable to find a solution after several hours.
    The data is entirely numeric in the format x.xx (blood sample concentration).

    I have tried the following:
    - Loading the variables as string using import delimited with stringcols
    - Converting to numeric by either
    gen hyp_cal_blood_0_num = real(blood_0_ca_uncorrected)
    OR
    destring blood_0_ca_uncorrected cacl_peak, replace force float
    This yields the same problem: 19 out of 108 observations had 0.000001/0.0000001 added or subtracted.
    E.g. one value of 0.95 is converted to 0.94999999 - and I cannot figure out why exactly this value got the addition/subtraction while others did not.
    The only pattern is that some values observed multiple times are all converted by the same, e.g. 1.18 --> 1.1799999 and 1.19 --> 1.1900001

    The original data has exactly two decimal points, so it is not possible that e.g. 1.1900001 is an actual observed value. It is added/subtracted by STATA for some reason.

    I need the variables to be numeric in order to subgroup the values into three groups. I do this by:
    generate hyp_cal_blood_0_0 = 0 if blood_0_ca_uncorrected<=1.32 & blood_0_ca_uncorrected<.
    generate hyp_cal_blood_0_1 = 1 if blood_0_ca_uncorrected>=1.33 & blood_0_ca_uncorrected<=1.46 & blood_0_ca_uncorrected<.
    generate hyp_cal_blood_0_2 = 2 if blood_0_ca_uncorrected>=1.47 & blood_0_ca_uncorrected<=2.00 & blood_0_ca_uncorrected<.
    generate hyp_cal_blood_0_3 = 3 if blood_0_ca_uncorrected>=2.01 & blood_0_ca_uncorrected<.
    egen float hyp_cal_blood_0 = rowtotal(hyp_cal_blood_0_0 hyp_cal_blood_0_1 hyp_cal_blood_0_2 hyp_cal_blood_0_3) if blood_0_ca_uncorrected<.
    drop hyp_cal_blood_0_0 hyp_cal_blood_0_1 hyp_cal_blood_0_2 hyp_cal_blood_0_3

    Consequently, an addition/subtraction can lead the to an observation ending up in the wrong group - hence why this is important.

    Question:
    Is there a way to avoid the addition/subtraction?
    Alternatively - can I force STATA to round the value back to two decimal points?

    Version
    I use STATA 16.1

    Kind regards,
    Mikael

  • #2
    Try defining your numerical variables as "double" type.

    Either when you import, you import them as doubles.

    Or when you desiring, you force not float but double.

    Comment


    • #3
      Just as there is no way to exactly represent 1/3 as a finite decimal, there is no way to represent numbers like 1.32 exactly in binary. In your -gen- statements, you have not specified a storage type, so you get Stata's default, which is float. A float has 4 bytes, which is good for approximately 7 digits of accuracy. After that, Stata has to round (in binary). When you then display those numbers, you will see those effects as very small rounding errors. They also crop up in programming as problems because for example -if x == 1.32- will not work as you expect! It is, in principle, impossible to get exact representation of non-integer numbers in binary with finite storage, unless the denominator of a fraction representation of the number is a power of 2.

      You can't really force Stata to round back to two decimal places, because, again, Stata does not work in decimal and two decimal places (i.e. accuracy of 0.01) has no exact equivalent in binary. You can apply the -round()- function, but you will still encounter discrepancies. What you could do exactly is get an exact round to 0.0078125 (= 1/128), which might work for your purposes.

      A more common approach is to get a little bit closer precision by using -double- instead of -float-. That requires you to explicitly request -double- storage in your -gen- commands. That is, -gen double x = whatever-. You will still get rounding errors, but they will be at 16 digits rather than 7.

      Another thing that should be done routinely when comparing variables that were stored as float to other such variables or to constants is:

      Code:
      some command if float(var1) == float(var2)
      some command if float(var1) <= float(some_constant)
      etc.
      Finally, I will just add that there are very, very few situations in real life where you actually need your numbers to be accurate to more than 7 digits. In fact, real world data almost always includes much larger errors in ascertainment, measurement, etc.

      Added: Crossed with #2, which I seem to be doing a lot today!

      Also added: I should also add that if you had not used the -force- option in your -destring- command, Stata would have notified you that you were asking for a conversion that cannot be performed without loss of precision. By using the -force- option you shut Stata up, but you swept the problem under the rug only to have it trip you up later. -force- options are always dangerous and should only be used in very rare, carefully thought out, circumstances. Usually if you need -force- to get something done, it's something that you will later regret doing.

      Further addition: Another approach you can use is to store 100 times the value as an integer variable. That is, instead of 1.32, use 132, etc. Then you can do exact comparisons and they will work as you expect. If you need to use the actual unmultiplied value itself, you can always divided it by 100 (and live with rounding in the 7th decimal place) in the course of calculation.
      Last edited by Clyde Schechter; 06 Apr 2021, 15:08.

      Comment


      • #4
        Thank you Joro and Clyde for your thorough answers.
        Generating them as double worked.
        I will not say that I understand why it doesn't just display 1.3199999999999999 if there is still a rounding error to 16 digits with double - but I will have to go into depth with that some other day.

        Comment


        • #5
          Originally posted by Mikael Vallentin View Post
          I will have to go into depth with that some other day.
          Maybe this will help?

          https://blog.stata.com/2012/04/02/th...-to-precision/
          David Radwin
          Senior Researcher, California Competes
          californiacompetes.org
          Pronouns: He/Him

          Comment


          • #6
            For one to understand properly what Clyde is saying, one has to have a proper understanding of binary numbers, because this is all computers know about. The decimals computers are showing us are just a display, still somewhere behind this decimal display lies a binary number stored and operated on.

            I myself am not an expert on binary numbers, and this is why I did not even try to venture in trying to construct myself some layman version of the thorough explanation that Clyde provided.

            Practically I know that setting the storage type as double resolves the problem you encounter. I just keep this fact as a black box in my head, and I do not look too deeply into it :P.


            Originally posted by Mikael Vallentin View Post
            Thank you Joro and Clyde for your thorough answers.
            Generating them as double worked.
            I will not say that I understand why it doesn't just display 1.3199999999999999 if there is still a rounding error to 16 digits with double - but I will have to go into depth with that some other day.

            Comment

            Working...
            X