STATA handling of missing values is in direct violation of IEE 754

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#16

06 Jul 2022, 20:01

Yes, but I was referring to the scope of this discussion. We may of course still disagree on this point. I have no interest in debating the merits of what is a foregone conclusion regarding the implementation of missing values. The choice has been made and we are left to deal with those consequences.
1 like
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#17

06 Jul 2022, 21:01

Isn't there some value in an honest accounting of the consequences of the implementation of missing values in Stata, regardless of whether or not we can't change them? Stata has a great deal else to offer outside of whether or not it handles missing values in the best possible way.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#18

06 Jul 2022, 21:03

Leonardo Guizzetti maybe, just maybe, if enough of us post on "Wishlist for Stata 18" or write a petition of sorts, we could beseech Alan Riley and the other directors of StataCorp to implement trinary logic for this purpose.

Of course I'm kidding, I sincerely hope they don't do this.

As I sort of said above, I just look at it as convention. I'm sure if I sat down (likely rereading this thread) and read through the arguments for or against, I'd likely have a stronger view. To me though, as you said, it don't really matter. It would be a little like asking why we can't have

Code:

reg y x, robust, notab

be a legal command. I mean I guess, maybe, if we REALLY wanted to, we could write commands to allow for double commas, but Stata commands don't work like that, and it's just sorta a convention we learn to live with.

I don't know if you know any Python, but when I started learning it after Stata implemented it, I was dumbfounded at how you have to tab inside Python loops. Stata doesn't care at all, you can write loops however you'd like so long as it's all within two braces, but Python does care about stuff like this.

Eventually, I told myself "I may think it's dumb, but it's Python's language, and just as we speak French while in France, we speak Latin whilst in Rome."
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#19

06 Jul 2022, 21:17

Isn't there some value in an honest accounting of the consequences of the implementation of missing values in Stata, regardless of whether or not we can't change them?

I certainly think so.

Stata has a great deal else to offer outside of whether or not it handles missing values in the best possible way.

I also agree. But I guess that's the point, the missing data issue to me is pretty academic. And we're all academic here, so I guess that's what we're meant to do in some sense, but in my opinion anyways, OP sort of took this tone of "I wanna speak to your manager" about it. I've seen others hold similar views on things like Stata vs R's graphics, for example. Daniel Schaefer
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#20

06 Jul 2022, 21:19

As an aside, Jared Greathouse, regarding #4; why do you suppose it isn't January 1 1970 - the Unix Epoch? If all you have to do is pick an arbitrary number, why not take advantage of a standard?

I don't know if you know any Python, but when I started learning it after Stata implemented it, I was dumbfounded at how you have to tab inside Python loops. Stata doesn't care at all, you can write loops however you'd like so long as it's all within two braces, but Python does care about stuff like this.

The writers of the Python interpreter consider it advantageous to use white space characters to break up code. I'm not sure I agree. Of course, that doesn't make it a bad language. The Stata devs do the same for what its worth - the newline character ends a line of code. I've been using Python much longer than Stata, and I have a lot of respect for the language. But I also think class encapsulation and interface design are good things, and Python doesn't do those things well.

Last edited by Daniel Schaefer; 06 Jul 2022, 21:28.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#21

06 Jul 2022, 21:34

There are plenty of ways to patch up design choises, so I disagree with the pessimistic conclusion of Leonardo in #16 -- we should have interest in debating foregone conclusions because foregone conclusions can be patched up. For this discussion Leonardo in #14 himself gives an example, the cond() function. Clyde in #3 gives an example with the inrange() function.

I myself use a lot the missing() function. In fact the missing() function is how I dodge the bullet of the (unnatural to me at least) choice in Stata that missings are plus infinity. I get ambushed by this choice only when I am lazy and I do not employ the missing() function enough in my logical statements.

So the constructive outcome of such whiny discussions could very well be the introduction of a new logical() function by Stata Corp, which does all that whiny users want it to do.

And of course William makes the awesome point that life in general, and statistical/econometric practice in particular are a mess. So despite of all my whining, I do appreciate that Stata Corp could have done much worse with missing values -- e.g., they could have set missing values equal to 666, or -666... Imagine then the horror !
1 like
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#22

06 Jul 2022, 21:38

the Unix Epoch

I must be honest, I don't know what this term means. I'll have to look it up!

I do agree about python though, I do sort of like the structure it forces users to abide by. Unlike Stata, you can't go nuts with tabs and spaces and brackets (not that these are relevant in Python), so the Pythonic nature of things does impose a certain structure that I do like.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#23

06 Jul 2022, 21:52

The Python syntax was inspired by the way professionals write code in other languages. Tabs are often used to denote sections or "blocks" of code in any language. In many languages (including C) blocks of code are surrounded by curly brackets. This allows the compiler to "see" the structure of the code and implement that structure in assembly. The python interpreter uses the tab structure programers were already using to make their code more readable, and of course to denote code blocks. Consider that the interpreter is essentially reading your code character by character, as close to linearly (I mean it avoids jumping backward) as possible. When it sees a tab character on a new line it knows the previous code block is continuing. The absence of a tab on the next line (after scanning the newline character) means the previous block has ended.

You should write all your code blocks like python code. I prefer 4 spaces to a tab character in most languages, but it's arbitrary.

Last edited by Daniel Schaefer; 06 Jul 2022, 21:55.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#24

06 Jul 2022, 22:03

Thanks Joro Kolev. As a compulsive whiner myself, I really do appreciate how bad it would be if they set the value of missing to 666.
2 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3862
#25

07 Jul 2022, 01:37

Suppose, StataCorp had implemented the IEEE 754 standard. Would

Code:

keep if foo > 42 drop if foo < 42

produce results that were more convenient or more consistent (except for conforming to IEEE standards, which StataCorp never claimed to do)?

Similarly, what would three-valued logic do here?
2 likes
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 785
#26

07 Jul 2022, 07:44

Changing to a three-valued logic might make some comparisons more predictable but will introduce inconsistencies elsewhere; that is, you would have to remember several rules for how missing values were handled in different situations instead of just one rule.

https://www.stata.com/support/faqs/d...issing-values/
2 likes
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#27

07 Jul 2022, 11:27

daniel klein

Would [the above] produce results that were more convenient or more consistent (except for conforming to IEEE standards, which StataCorp never claimed to do)?

Based on OPs description of the standard it looks like IEEE would drop missing values in the first case and in the second case. If I'm not mistaken, Stata will keep missing values in the first case and in the second. I'm not convinced either is more consistent, but I'm open to the idea that one might be.

Similarly, what would three-valued logic do here?

If foo equals missing, the logical expression will evaluate to missing. If I were implementing the language, I would want to allow the writer of -keep- and -drop- to determine what happens when an expression evaluates to missing. If I were the author of -keep- and -drop-, I would probably keep or drop values only when the expression evaluates to True - so I would drop missing values in the first line and keep them in the second.

As to which is best, in this case it really does feel like a stylistic choice. I'll admit my bias here and say that I still prefer the ternary logic. If I say "keep if foo > 42" I mean only keep values of foo greater than 42. Since missing isn't a number I don't keep it. If I say "drop if foo < 42" I am now talking explicitly about numbers less than 42, but missing isn't a number less than 42; it's not a number at all, so keep it.

Last edited by Daniel Schaefer; 07 Jul 2022, 12:05.
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4423
#28

07 Jul 2022, 18:18

Originally posted by Daniel Schaefer View Post

If I were implementing the language, I would want to allow the writer of -keep- and -drop- to determine what happens when an expression evaluates to missing.

I think that that potential is what clinched the decision at StataCorp to favor determinacy.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#29

07 Jul 2022, 19:36

Originally posted by Joseph Coveney View Post

I think that that potential is what clinched the decision at StataCorp to favor determinacy.

You honestly don't think the author of a command should determine how that command behaves? I have to admit, I did not think this point would be controversial. An author of a command can already decide what happens when a logical expression evaluates to True or False, so why is it a problem when the author of a command decides what happens when a logical expression evaluates to True, False, or Missing?

Why do you suppose this kind of "nondeterminacy" is a problem?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4423
#30

07 Jul 2022, 20:59

Originally posted by Daniel Schaefer View Post

Why do you suppose this kind of "nondeterminacy" is a problem?

I think that the consideration was as pointed out in #26, that is, the mnemonics are more in hand.
0 is False. Everything else (everything else) is True.
You might not like that, but at least it's easy to know where things stand, and that's an important consideration in control flow.

There are rare exceptions (e.g., inrange()) as others have mentioned above, but I think that the thinking was that it's easier to learn to append if !missing(varname) to a line of code than to learn ad hoc rules for when missing evaluates to True and when missing evaluates to False for each author and each command.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment