gen var1=var2, generates some wrong values

Shahida Pervin

Join Date: Apr 2020
Posts: 21

gen var1=var2, generates some wrong values

09 Mar 2023, 21:20

Code:

gen geo3= geo3_bd2001

gives the following values. I wonder what might cause this. Any idea please? Thank you.

geo3_bd2001	geo3
10079076	10079076
10079080	10079080
10079087	10079087
20003004	20003004
20003014	20003014
20003051	20003052
20003073	20003072
20003089	20003088
20003091	20003092
20003095	20003096

Tags: None

Tiago Pereira

Join Date: Jan 2016

Posts: 433
#2

09 Mar 2023, 22:27

I have replicated the same issue. Eager to see expert opinions. Bug?
Comment

Tiago Pereira

Join Date: Jan 2016
Posts: 433

09 Mar 2023, 22:31

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long geo3_bd2001
10079076
10079080
10079087
20003004
20003014
20003051
20003073
20003089
20003091
20003095
end

gene geo3 = geo3_
 format %9.0f geo3*
gene diff = geo3-geo3_bd
sum diff

Code:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         diff |         10          .1    .7378648         -1          1

Comment

Daniel Shin

Join Date: Mar 2020
Posts: 146

09 Mar 2023, 22:42

Code:

gen long geo3 = geo3_
format %9.0f geo3*
gene diff = geo3-geo3_bd
sum diff

Code:

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        diff |         10           0           0          0          0

Comment

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2458
#5

09 Mar 2023, 23:30

It’s not a bug. The default storage type for numeric variables is float, which is insufficient for numbers of this magnitude. Daniel demonstrates that long type will rectify the issue, and so will double. Another alternative is -clonevar- if one is interested in a simple copy of a variable.
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 433
#6

09 Mar 2023, 23:42

Although some may argue that users should be aware of their ID or other variables that have large numbers, I believe that this is a flaw in Stata that should be addressed. Stata is explicitly requiring users to invest extra effort in checking variable size or changing the format to long, which can pose difficulties for beginners. In R this does not happen. The Stata community would benefit from a Stata version that doesn't require such manual checks and is more user-friendly for everyone.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#7

10 Mar 2023, 01:24

Tiago Pereira The user-friendliness here is that Stata

1 makes the default default (*) variable or storage type (not a format) float because making it double would bloat dataset sizes, a big issue for many users

2 documents different variable types

3 has a community that explains this to each other.

You're a long-term user evidently not previously aware of this issue. That's great, and no irony or condescension intended, as it's a strong signal that it bites only occasionally as a problem.

People want Stata to do what is in their best interests when that is not what they imply by their commands. OTOH, people also want Stata to do exactly that they say. That can be a tough call for Stata to get right as far as the user is concerned and it's one programmers (company and community) wrestle with in writing commands.

(*) Repetition of default intended. Users can set type double if they prefer that.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3912
#8

10 Mar 2023, 02:13

I believe that StataCorp has long been and still seems to be concerned with memory and storage, and rightfully so.

However, I think that double would be the better default. Users who care about dataset sizes and memory can still set type float and/or compress their data. In many fields, a byte would suffice for most variables, and I am willing to bet that many users do not generate (or compress) those variables as byte. Therefore, I doubt memory and storage are among the top concerns of many users, especially beginners. Also, too large a dataset size is an easily detected (and resolved) problem once it bites. On the other hand, loss of precision might often go unnoticed (e.g., when working with date time variables) and/or cause problems that are more difficult to diagnose, especially for beginners. Choosing a default (default), I prefer precision over small datasets; but that is obviously not my call to make.

Last edited by daniel klein; 10 Mar 2023, 02:16.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#9

10 Mar 2023, 02:53

clonevar is an example of how the company listens to the community. Wanting a command that replicates everything in an existing variable exactly, if only as a starting point for some calculation, was a common user need and the company saw the point and folded the community-contributed command into official Stata.
2 likes
Comment
Daniel Shin

Join Date: Mar 2020

Posts: 146
#10

10 Mar 2023, 07:17

I didn't know about clonevar. Good to know.

It would be nice to "lock" a variable type so that it doesn't get changed by mistake (like habitually running compress). I've run into issues appending and merging datasets where the IDs changed types (earlier datasets had lower values). Perhaps I should just make it a habit of making all ID variables long/double and kick the habit of compressing.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2458
#11

10 Mar 2023, 07:20

Originally posted by Daniel Shin View Post

I didn't know about clonevar. Good to know.

It would be nice to "lock" a variable type so that it doesn't get changed by mistake (like habitually running compress). I've run into issues appending and merging datasets where the IDs changed types (earlier datasets had lower values). Perhaps I should just make it a habit of making all ID variables long/double and kick the habit of compressing.

Stata will never allow you to -compress- a datatype in such a way that it would lose precision. (Yes, you can -recast-, but that's a different prospect.) If additional precision is required, the storage type will be expanded to accommodate as much as possible.
1 like
Comment
Daniel Shin

Join Date: Mar 2020

Posts: 146
#12

10 Mar 2023, 07:43

You're absolutely right, compress does not cause loss of precision. It's just a matter of keeping float out of the equation.
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 433
#13

10 Mar 2023, 10:58

The problem concerning large numbers and the importance of users being mindful of the long format issue is quite concerning. Since most Stata users are likely beginners or intermediate users, it poses a significant challenge. Therefore, StataCorp could address this issue promptly.

Last edited by Tiago Pereira; 10 Mar 2023, 11:29.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#14

10 Mar 2023, 11:53

Thanks for the flattering comments, but your last few sentences are in my view quite wrong. There is nothing at all urgent or immediate about this issue, which has been present for most if not all of the lifetime of Stata. The only thing that is immediate is when people suddenly discover the issue because it bites them, or even become upset and angry about it.

There are two types of program learning:

1. wanting to use a program to get results now and having no interest (or at least no inclination to spend much time) in acquiring a deep understanding of that program

2. wanting to understand a program quite deeply because one likes it, uses a lot and sees long-term gain or even pleasure in getting better.

I say types of learning -- not learners -- because I suspect many, indeed most, of us follow both types at different times for different software.

At present I follow Type 1 for almost everything I use but Type 2 only for Stata, so no snobbery from me about Type 1

The point of this distinction is that following Type 1 comes with a strong disinclination to read any documentation unless you have to, but Type 2 carries a strong inclination to read whatever has been written.

Thus one only has to read some chapters of the User's Guide to learn something about variable or storage types -- and above all one needs to experiment a bit to find out some details

People who don't or won't do this can't throw all the blame on StataCorp for their not studying documentation which explains what is going on.

Users can easily waste time misunderstanding almost anything -- what happens with missing values, how to deal with dates, how to deal with strings, the difference between the if command and the if qualifier, and so on and so forth.

Whenever I buy something I often don't read the instructions. Sometimes that doesn't matter and sometimes I miss something I need to know. But the blame is then on myself, for not being zealous enough to find out what I needed to know.

All that said, you urge that

StataCorp should address this issue promptly

so what is it precisely that you want StataCorp to do?

Note: Perhaps I should add Type 3, the idea that one just asks some bot what the code should be.
Comment
Daniel Shin

Join Date: Mar 2020

Posts: 146
#15

10 Mar 2023, 12:35

I could name one thing that StataCorp could address as I see it come up so often: just get rid of merge m:m. Even the manual says that it "is a bad idea." Inexperienced as well as intermediate users almost always think it should do what joinby does.
Comment

Announcement

gen var1=var2, generates some wrong values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment