Bug in -mvencode-

daniel klein

Join Date: Mar 2014

Posts: 3885
#1

Bug in -mvencode-

06 Jul 2023, 05:49

The help file for mvencode states that

Without this option [ override ], mvencode refuses to make the requested change if any of the numeric values are already used in the data.

There are two problems with this.

First, the meaning of "data" might be misleading here. mvencode only checks whether the values are observed in the specified varlist -- not in the (entire) data(set). This should be clearly documented.

Second, mvencode fails to detect (some) non-integers stored in float precision. Here is a reproducible example, that should result in an error but does not:

Code:

sysuse auto generate float x = 1.2 replace x = . in 1 mvencode x , mv(.=1.2) assert x == float(1.2)
Tags: bug, mvencode, precision

2 likes
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

06 Jul 2023, 09:19

Well, yes and no. This is really a precision issue. You are storing x as a float, so it has only float precision. But in the -mvencode- command, the 1.2 you specify is actually part of an expression and is, in effect, double precision. And:

Code:

. assert 1.2 == float(1.2) assertion is false r(9);

So, the value you are specifying, namely double-precision binary approximation to 1.2, is not actually in the variable. I think this is just another of the many confusions that can arise from attempting to rely on exact equality of floating-point numbers.

As I'm sure you've noticed, if you generate x as a -double-, you don't get this paradoxical result.

All of that said, it might be sensible for StataCorp to update the documentation to clarify the meaning of "if any of the numeric values are already used."
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3885
#3

06 Jul 2023, 12:54

To me, this is still a clear bug.

Precision issues explain the behavior but the whole purpose of mvencode is to

[...] be certain that your coding is unique ([D] mvencode, p. 3)

In my example, I cannot be certain that the coding is unique.

There are two ways to resolve the bug. The manual explains that

mvencode will automatically recast variables upward, if necessary ([D] mvencode, p. 3)

Thus, mvencode might recast float variables to double if the replacement values cannot be stored with sufficient precision as a float (technically: if # != float(#)). This would be somewhat consistent with the documentation. It would also be consistent with how bytes are changed to int or long (or whatever is necessary to hold the replacement value). However, most users would be irritated by, say,

Code:

. tabulate x x | Freq. Percent Cum. ------------+----------------------------------- 1.2 | 1 50.00 50.00 1.2 | 1 50.00 100.00 ------------+----------------------------------- Total | 2 100.00

The second possibility to fix the bug is to compare values in float precision. But this would be somewhat inconsistent with documentation and pose further questions: What if some variables are stored in float precision and others in double precision? Should the comparison in float precision only be applied to variables with float precision? This is how recode handles the issue. But then the already imprecise statement that I have complained about in #1 would be even more confusing.

At the very least, there should be a warning message -- although, by the time I read that, my values are already replaced and I will no longer be able to distinguish (formerly) missing values from valid ones.

Edit:

I have created the tabulate example as

Code:

clear set ob 2 generate float x = 1.2 recast double x replace x = 1.2 in 2

Last edited by daniel klein; 06 Jul 2023, 13:05. Reason: rephrased; added Edit
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#4

06 Jul 2023, 13:57

In light of the quote from the documentation "be certain that your coding is unique," you are right: this is a bug. The command does not perform as called for by its documentation.

FWIW, I don't think the warning after the fact would be a good solution: your data is already irreversibly mangled. I think -recast-ing upwards is a better idea. The -tabulate- example wouldn't bother me: I would immediately see it as a precision issue. But admittedly, many users would be completely flummoxed by this.

I think perhaps the best solution would be more complicated than either of the options you show. I would have the code check to see whether the proposed numeric code for the missing value is equal to any existing value of the data when both are reduced to float precision. It would also have an option, -floatmatchok- that would cause Stata to disregard this near-match and proceed with the mvencoding. If that option is not specified, then the data would not be changed, and an informative error message would be given.
Comment
Bill Sribney (StataCorp)

Join Date: Jun 2017

Posts: 8
#5

06 Jul 2023, 16:46

I think it is a bug. When the variable is float, it should be checking if x == float(1.2). We will fix it.

Bill Sribney (StataCorp)
3 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3885
#6

11 Sep 2023, 15:00

Happy to see that this has already been fixed by StataCorp in the 30aug2023 update to Stata 18. Thanks.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment