Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • gen var1=var2, generates some wrong values

    Code:
    gen geo3= geo3_bd2001
    gives the following values. I wonder what might cause this. Any idea please? Thank you.

    geo3_bd2001 geo3
    10079076 10079076
    10079080 10079080
    10079087 10079087
    20003004 20003004
    20003014 20003014
    20003051 20003052
    20003073 20003072
    20003089 20003088
    20003091 20003092
    20003095 20003096






  • #2
    I have replicated the same issue. Eager to see expert opinions. Bug?

    Comment


    • #3
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long geo3_bd2001
      10079076
      10079080
      10079087
      20003004
      20003014
      20003051
      20003073
      20003089
      20003091
      20003095
      end
      
      gene geo3 = geo3_
       format %9.0f geo3*
      gene diff = geo3-geo3_bd
      sum diff

      Code:
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
               diff |         10          .1    .7378648         -1          1

      Comment


      • #4
        Code:
        gen long geo3 = geo3_
        format %9.0f geo3*
        gene diff = geo3-geo3_bd
        sum diff
        Code:
            Variable |        Obs        Mean    Std. dev.       Min        Max
        -------------+---------------------------------------------------------
                diff |         10           0           0          0          0

        Comment


        • #5
          It’s not a bug. The default storage type for numeric variables is float, which is insufficient for numbers of this magnitude. Daniel demonstrates that long type will rectify the issue, and so will double. Another alternative is -clonevar- if one is interested in a simple copy of a variable.

          Comment


          • #6
            Although some may argue that users should be aware of their ID or other variables that have large numbers, I believe that this is a flaw in Stata that should be addressed. Stata is explicitly requiring users to invest extra effort in checking variable size or changing the format to long, which can pose difficulties for beginners. In R this does not happen. The Stata community would benefit from a Stata version that doesn't require such manual checks and is more user-friendly for everyone.

            Comment


            • #7
              Tiago Pereira The user-friendliness here is that Stata

              1 makes the default default (*) variable or storage type (not a format) float because making it double would bloat dataset sizes, a big issue for many users

              2 documents different variable types

              3 has a community that explains this to each other.

              You're a long-term user evidently not previously aware of this issue. That's great, and no irony or condescension intended, as it's a strong signal that it bites only occasionally as a problem.

              People want Stata to do what is in their best interests when that is not what they imply by their commands. OTOH, people also want Stata to do exactly that they say. That can be a tough call for Stata to get right as far as the user is concerned and it's one programmers (company and community) wrestle with in writing commands.

              (*) Repetition of default intended. Users can set type double if they prefer that.

              Comment


              • #8
                I believe that StataCorp has long been and still seems to be concerned with memory and storage, and rightfully so.

                However, I think that double would be the better default. Users who care about dataset sizes and memory can still set type float and/or compress their data. In many fields, a byte would suffice for most variables, and I am willing to bet that many users do not generate (or compress) those variables as byte. Therefore, I doubt memory and storage are among the top concerns of many users, especially beginners. Also, too large a dataset size is an easily detected (and resolved) problem once it bites. On the other hand, loss of precision might often go unnoticed (e.g., when working with date time variables) and/or cause problems that are more difficult to diagnose, especially for beginners. Choosing a default (default), I prefer precision over small datasets; but that is obviously not my call to make.
                Last edited by daniel klein; 10 Mar 2023, 02:16.

                Comment


                • #9
                  clonevar is an example of how the company listens to the community. Wanting a command that replicates everything in an existing variable exactly, if only as a starting point for some calculation, was a common user need and the company saw the point and folded the community-contributed command into official Stata.



                  Comment


                  • #10
                    I didn't know about clonevar. Good to know.

                    It would be nice to "lock" a variable type so that it doesn't get changed by mistake (like habitually running compress). I've run into issues appending and merging datasets where the IDs changed types (earlier datasets had lower values). Perhaps I should just make it a habit of making all ID variables long/double and kick the habit of compressing.

                    Comment


                    • #11
                      Originally posted by Daniel Shin View Post
                      I didn't know about clonevar. Good to know.

                      It would be nice to "lock" a variable type so that it doesn't get changed by mistake (like habitually running compress). I've run into issues appending and merging datasets where the IDs changed types (earlier datasets had lower values). Perhaps I should just make it a habit of making all ID variables long/double and kick the habit of compressing.
                      Stata will never allow you to -compress- a datatype in such a way that it would lose precision. (Yes, you can -recast-, but that's a different prospect.) If additional precision is required, the storage type will be expanded to accommodate as much as possible.

                      Comment


                      • #12
                        You're absolutely right, compress does not cause loss of precision. It's just a matter of keeping float out of the equation.

                        Comment


                        • #13
                          The problem concerning large numbers and the importance of users being mindful of the long format issue is quite concerning. Since most Stata users are likely beginners or intermediate users, it poses a significant challenge. Therefore, StataCorp could address this issue promptly.
                          Last edited by Tiago Pereira; 10 Mar 2023, 11:29.

                          Comment


                          • #14
                            Thanks for the flattering comments, but your last few sentences are in my view quite wrong. There is nothing at all urgent or immediate about this issue, which has been present for most if not all of the lifetime of Stata. The only thing that is immediate is when people suddenly discover the issue because it bites them, or even become upset and angry about it.

                            There are two types of program learning:

                            1. wanting to use a program to get results now and having no interest (or at least no inclination to spend much time) in acquiring a deep understanding of that program

                            2. wanting to understand a program quite deeply because one likes it, uses a lot and sees long-term gain or even pleasure in getting better.

                            I say types of learning -- not learners -- because I suspect many, indeed most, of us follow both types at different times for different software.

                            At present I follow Type 1 for almost everything I use but Type 2 only for Stata, so no snobbery from me about Type 1

                            The point of this distinction is that following Type 1 comes with a strong disinclination to read any documentation unless you have to, but Type 2 carries a strong inclination to read whatever has been written.

                            Thus one only has to read some chapters of the User's Guide to learn something about variable or storage types -- and above all one needs to experiment a bit to find out some details

                            People who don't or won't do this can't throw all the blame on StataCorp for their not studying documentation which explains what is going on.

                            Users can easily waste time misunderstanding almost anything -- what happens with missing values, how to deal with dates, how to deal with strings, the difference between the if command and the if qualifier, and so on and so forth.

                            Whenever I buy something I often don't read the instructions. Sometimes that doesn't matter and sometimes I miss something I need to know. But the blame is then on myself, for not being zealous enough to find out what I needed to know.

                            All that said, you urge that

                            StataCorp should address this issue promptly
                            so what is it precisely that you want StataCorp to do?

                            Note: Perhaps I should add Type 3, the idea that one just asks some bot what the code should be.


                            Comment


                            • #15
                              I could name one thing that StataCorp could address as I see it come up so often: just get rid of merge m:m. Even the manual says that it "is a bad idea." Inexperienced as well as intermediate users almost always think it should do what joinby does.

                              Comment

                              Working...
                              X