Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • STATA handling of missing values is in direct violation of IEE 754

    I typed `drop if var>50`, and STATA dropped, without me realizing it, rows with missing `var`.

    In fact, "Stata codes missing values (., .a, .b, .c, ..., .z) larger than any nonmissing values, so, literally, x >1000 is true" (source).

    This behaviour is in direct violation of IEEE 754.

    The standard provides that comparing NaN with whatever else ALWAYS RETURN FALSE (and not true as STATA is doing).

    As specified in chapter 5.11 of the standard, NaN always compare unordered with everything:

    unordered arises when at least one operand is a NaN. Every NaN shall compare unordered with everything, including itself
    In fact:

    When NaNs are present, operands have the unordered relation, so trichotomy does not apply.
    Moreover, if the operator is signaling, an error must be thrown when one of the operand is NaN:

    The Signaling predicates in Table 5.1 signal the invalid operation exception on quiet NaN operands to warn of potential incorrect behavior of programs written assuming trichotomy

  • #2
    Overall, I'd prefer to have my statistical analysis tools designed by statisticians and my hardware designed by engineers. As the linked Wikipedia article tells us, IEEE 754 is used by hardware floating point implementations - in your computer's CPU for example. How a specific application uses those hardware features is the decision of the application.

    Comment


    • #3
      Also, the Stata documentation, which comes installed as PDF's with Stata, makes it quite clear that it handles missing values as greater than any real number (and gives the ordering among the different missing values themselves.) With any software package, it is the responsibility of the user to familiarize him/herself with the basic functionality of the software before using it.

      While I happen to agree with the original poster that it would have been better to have missing value comparisons to everything return false (or to use a 3-valued logic), that is simply not what Stata does, and Stata users quickly grow accustomed to this and adapt to it. I also note that there is one situation where Stata does treat missing values the way O.P. wants--the -inrange()-, function. So if he feels very strongly about this, he can use that instead of the usual inequality operators.

      Comment


      • #4
        This behaviour is in direct violation of IEEE 754.
        So what? Why is that bad? This is just an axiom Stata abides by.

        I oftentimes wondered why the default date for everything was "x relative to January 1, 1960". But eventually, I didn't care. Why not? Well, computer software, I presume, needs to define these things in some relative time period. So they could've chosen Jan 1 1900, or Jan 6 2021, but either way, there needs to be a convention around these things, and the one StataCorp chose likely for practical reasons is treating missing values as more than real numbers. Anyways, why not just do
        Code:
        drop x if x>50 & !mi(x)

        Comment


        • #5
          Another way of thinking about Stata's treatment of missing values is by comparison with the original techniques for representing missing values in survey data - coding them as 9 or 99/98/97 or -9/-8/-7, for example, to represent different reasons the data was missing.

          Stata, for any data storage type, codes missing values as the 27 largest positive integer values that can be stored exactly in that storage type, and - because these values vary by storage type - assigns "names" to those 27 values so that your code for missing values does not depend on the storage type. Nor does it depend on the range of your data, as the convention of storing missing values as 999 for three-digit numeric values or 9999 for four-digit numeric values, etc. does. And then Stata takes care of a large part of the effort of handing missing values in numerical calculation.

          I note in passing that IEEE 754 is a standard for floating point arithmetic, which does not provide guidance for the fixed point values that Stata is also capable of working with, nor for missing values in character values.

          Comment


          • #6
            So what? Why is that bad? This is just an axiom Stata abides by.
            Suppose that you buy a car, and the manufacturer decides that the car is powered by a thousand hamsters that spin a thousand wheels inside of it.

            Why is that bad?

            Because there are no hamster stations along the way.

            There are gas stations, but not hamster stations.

            And that's because we, as a society, have decided that cars are powered by gas. Because that's the energy source that makes more sense.

            I oftentimes wondered why the default date for everything was "x relative to January 1, 1960"
            If you saw that as a end-user, you can report it as a bug.
            Under ISO 8601, datetimes are represented this way: YYYY-MM-DDTHH:MM:SS.
            Unix epoch is just a way to represent this inside the memory of a PC.
            You don't need to care about that.

            Comment


            • #7
              Apparently, both Stata and IEEE 754 got started in 1985. I assume that at first the programmers were not aware of this guideline and now its just too inconvenient to change it (as it would basically break all ados).
              Best wishes

              (Stata 16.1 MP)

              Comment


              • #8
                Originally posted by Giuseppe Polito View Post
                I typed `drop if var>50`, and STATA dropped, without me realizing it, rows with missing `var`.

                In fact, "Stata codes missing values (., .a, .b, .c, ..., .z) larger than any nonmissing values, so, literally, x >1000 is true".

                This behaviour is in direct violation of IEEE 754.

                The standard provides that comparing NaN with whatever else ALWAYS RETURN FALSE (and not true as STATA is doing).

                As specified in chapter 5.11 of the standard, NaN always compare unordered with everything:
                I recall an exchange years ago on the predecessor to this list that the behavior of Stata with respect to missing values was a conscious decision in the beginning when software design choices were being made for Stata. Apparently, it was hotly debated even within the fledgling company, with many on the staff arguing for three-value (triple-state) logic, but the advantage of straightforward determinacy (every Boolean outcome was either true or false) won out. So, yes, maybe quantitative comparisons with Stata's missing values ought to have been declared always-false or unordered, but that design choice, too, can give rise to programming mistakes in the inexperienced (see SQL).

                Also, although invalid floating-point operations in Stata can give rise to system missing as the result, I think that Stata's missing values were really never intended to be identical to IEEE's NaN. As William mentions, distinct missing values can be assigned to circumstances where the response is unavailable for different reasons (refused, not in universe etc.); I'm guessing that the only reasonably facile way for these to be distinguished (and made usable) in a statistical dataset is to incorporate them as numerical (ordered) values. SAS orders them at the most negative; Stata does the opposite, again as a deliberate design decision at the software's inception.

                Originally posted by Giuseppe Polito View Post
                I oftentimes wondered why the default date for everything was "x relative to January 1, 1960"

                If you saw that as a end-user, you can report it as a bug.
                Under ISO 8601, datetimes are represented this way: YYYY-MM-DDTHH:MM:SS.
                Good luck with that—maybe you'll have better results that I have had so far.

                Comment


                • #9
                  I should note that special missing values (.a through .z) were added later, but the same consideration applies to the system missing in that it had to be parked somewhere, and Stata chose it to be the most positive value of each numerical datatype. (Stata's byte numerical datatype is unsigned and so it would have been a real pain to adopt SAS's convention.)

                  Comment


                  • #10
                    I agree with OP and Clyde that from my convenience point of view this was an unfortunate choice.

                    And I would dispute the claim by Clyde that "Stata users quickly grow accustomed to this and adapt to it."

                    I have been using Stata for 22 years now, and I still get occasionally ambushed by this Stata feature. It is something that is not natural, and you always have to think about it when you write logical expressions.

                    Comment


                    • #11
                      Let me try to ask a more fundamental question: Why should I care what a random group says about how Stata handles missing values?
                      You don't need to care about that.
                      You write about IEEE 754 as though their word should be Holy Writ, and that not following their consecrated encyclicals is an act of sacrilege that no self respecting researcher would dare commit, lest their research be defiled. I'm partly joking here of course, but my original question for you was basically "StataCorp disagrees with IEE on a matter." Why is that bad?

                      Comment


                      • #12
                        I still get occasionally ambushed by this Stata feature. It is something that is not natural, ...
                        Missing values themselves are not "natural", in the sense that the bulk of mathematical instruction is based on two-valued logic, and the bulk of statistical instruction is based on unrealistic data designed to teach statistical procedures but not statistical practice.

                        Why should I care what a random group says about how Stata handles missing values?
                        I believe that the analogy in post #6 was meant to address this question, and the response - in the setting of the analogy - was

                        because we, as a society, have decided that cars are powered by gas [rather than squirrels in cages]. Because that's the energy source that makes more sense.
                        Now, two responses.
                        • IEEE is a small society, and my guess is that most of the "we" reading this are not members, so the analogy breaks down there.
                        • Apparently cars powered by electricity are not a thing.
                        Last edited by William Lisowski; 06 Jul 2022, 13:29.

                        Comment


                        • #13
                          Setting aside the OPs unfortunate tone, the IEEE standard is often sensible and convenient but not ideal for statistical software. I also don't care for the way Stata handles this, but the IEEE standard is no better.

                          The standard provides that comparing NaN with whatever else ALWAYS RETURN FALSE (and not true as STATA is doing).
                          This is not actually better behavior. Say you compare V1 to some arbitrary constant X. You want to generate another variable V2 such that V2 equals 0 if V2 is less than X and 1 otherwise. Stata would set this missing value to 1 and IEEE to 0. So what should the software do if a value of V1 is missing? Ideally the corresponding value for the same observation in V2 should be missing as well. While perhaps not natural, there are other languages that implement this three (or more) valued behavior for logical operations. Some do it well (like R) and some do it poorly (like JavaScript).

                          Comment


                          • #14
                            Those yearning for a trinary logic can still achieve it using the cond() function, but that is neither better nor worse, and only stylistically different. What should matter more is that whatever design choices are made, their behaviour is consistent across the software/language.

                            Comment


                            • #15
                              I respectfully disagree. Of course in general it does matter that a design choice is consistent across a piece of software or a programing language. But consistency isn't the only thing that matters; some design choices are meaningfully better than others.
                              Last edited by Daniel Schaefer; 06 Jul 2022, 19:41.

                              Comment

                              Working...
                              X