gen var1=var2, generates some wrong values

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2458
#16

10 Mar 2023, 14:09

Originally posted by Daniel Shin View Post

I could name one thing that StataCorp could address as I see it come up so often: just get rid of merge m:m. Even the manual says that it "is a bad idea." Inexperienced as well as intermediate users almost always think it should do what joinby does.

Yes, that is a well-opined issue on this forum. Presumably, StataCorp has a reason for keeping it around (perhaps for backwards compatability) since is has yet to be removed, or at least undocumented. However, this is a completely different issue.

My perspective on the original issue of precision is this. Every language is bound by the limitations of performing decimal arithmetic on a binary computer with finite precision. When numbers grow too big or too small for the available precision, then the program can do one of a few things: throw some kind of error or exception, ignore it and truncate as necessary, or impute a missing value. It's the onus of the programmer to understand these limitations and at some point, we come to learn it, even if it is because it causes some problem later. We've already discussed three mitigation strategies here (use a bigger storage type, use clonevar, or don't use that piece of code). Two of those don't even really require an understanding of the precision issue. At some point, the programmer must take responsibility for the actions of the code and understanding what it does.

On a related note, there are some on the forum, including me, who wish to see Stata support a new datatype similar to the Java BigInteger which could be useful for identification variables. There is also the idea that identification variables could better be stored as string variables when the numbers are too large, but that causes issues for some estimation routines which expect numerical data only.
1 like
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 433
#17

10 Mar 2023, 14:23

Nick, I have updated my message above to be less personal. I've noticed that in today's climate, some individuals are getting "cancelled" for no reason lately. However, you are still my (our) Stata god.

Importantly, I gave up long ago, stating that other people's opinions or beliefs are/were wrong.

My issue is that, for example, any beginner Stata user will have wrong estimates if any value is large. According to my real datasets, any values exceeding 17 million run the risk of generating imprecise outcomes following basic manipulations - unless I meticulously verify and convert the new variables to the long format.

Issues like this are inadmissible in Stata:

Code:

clear set obs 1 gene x = 17003091 format %9.0f x list x +----------+ | x | |----------| 1. | 17003092 | +----------+

I am expressing my opinion as a user of Stata.

So, what is it precisely that Stata users are likely to want?

Code:

clear set obs 1 gene x = 17003091 *! Stata understands automatically that the number is large and this variable requires long/double format format %9.0f x list x +----------+ | x | |----------| 1. | 17003091 | +----------+
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#18

10 Mar 2023, 16:09

*! Stata understands automatically that the number is large and this variable requires long/double format

This is not so simple as it might sound.

Suppose I ask Stata to calculate -gen x = 5.4321 * 9.8765-. The exact answer would be 53.65013565, which has more digits than can be kept in a float without loss of precision. So, should Stata now make x a double automatically? If so, we are in the position of double becoming the default default storage type in almost all situations. But in most real world data situations, the 5.4321 and 9.8765 are themselves probably only really accurate to, if I'm lucky, 5.43 and 9.87 (or round it to 9.88). And, even if all four decimal places in the starting numbers are actually precise, their product still hardly ever needs all those additional decimal places. So will I have to nearly double the size of most of my data sets so that I can carry around meaningless low-order digits that are usually just noise? And if so, why stop there? What if I want to multiply 53.65013565 * 77.88895126. That will blow out the precision of a double. Must Stata now implement quad precision floating point numbers? And since a product of those will require octuple precision numbers, and so on, where does this end?
1 like
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 433
#19

11 Mar 2023, 08:29

Hi, Clyde.

As always, you have a good point.

But, it's important to note that the concern regarding data storage type in Stata is mainly in relation to large numbers rather than fractional numbers. Novice Stata users (including myself) may not realize that handling larger integers can lead to incorrect calculations unless they carefully assess the magnitude of each variable and choose the appropriate data storage type.

This concern is valid.

Imaging that 17003091 is a unique identifier, a function of other variables. Incorrectly replacing 17003091 with 17003092 can be disastrous when merging datasets. Imagine a simple budget impact analysis with multiple of such errors? What a disaster.

Now, as you pointed out, I hardly see any impact on other applications when replacing 0.17003091 with 0.17003092. If the precision issue is critical in such cases, it's highly probable that the Stata user would have prior knowledge and awareness of it.

I think the issue with big numbers is way more easily solved.
1 like
Comment

Announcement

Comment

Comment

Comment

Comment