Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • tostring does not return correct values

    Dear Statalisters,

    I've imported a large .txt file to Stata using import delimited. The file has a few numeric variables, which are very long (18 digits) so Stata display them as e+ (e.g. 1.23e+17). I wanted these values to display in full so converted them to string, using
    tostring(var), gen(newvar) format ("%20.0f")
    However, some of the values returned inaccurate. So for instance, the correct value I expected (i.e. what is in the original .txt file) should have been 123456, but in Stata the it becomes 123457. It seems that after being converted they were rounded up somehow. What confuses me is that there's no decimal places, so it shouldn't be rounded up or down.

    Could anyone shed any light on what might have happened and how do I fix this issue (i.e. to convert the numeric variables to string but keep the values the same as what they were in the original .txt file)?

    Many thanks

  • #2
    The maximum digit length for double format numeric variables before rounding is 16, so numeric values stored as such with greater than 16 digits will lose precision. Therefore, I would recommend using the -stringcols- option in your import delimited command to import the appropriate columns as string variables in the first place, as opposed to your current approach of importing as numeric and converting within Stata.
    Last edited by Ali Atia; 24 Jan 2022, 20:36.

    Comment


    • #3
      Except in their use as identifiers, values this large don't arise often in what Stata is typically used for, and for their use as identifiers, you could get away with keeping them as strings, or use -encode- for estimation commands that don't accept string identifiers.

      But might it be worth a request in the Wishlist for Stata 18 thread for a long-long integer datatype?

      I think that at least a couple of other domain-specific languages or statistical / data-management software packages have either signed or unsigned long-longs (or both) in order to help maintain compatibility with modern relational database management systems and other data sources.

      Comment


      • #4
        Here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values. Note that full accuracy for doubles is limited to 15 digits, or 16 digits if the leftmost digit is 8 or less, or pedantically, 16 digits not to exceed 9,007,199,254,740,992.

        byte - 7 bits -127 100
        int - 15 bits -32,767 32,740
        long - 31 bits -2,147,483,647 2,147,483,620
        float - 24 bits -16,777,216 16,777,216
        double - 53 bits -9,007,199,254,740,992 9,007,199,254,740,992

        Comment


        • #5
          Thank you so much for your swift responses and advice Ali and Joseph! The -stringcols- option did the trick! Definitely worth a rquest in the wishlist for Stata 18 to help read in long identifiers correctly.

          Comment


          • #6
            I can't reproduce the 123456 to 123457 report. As William Lisowski comments, that shouldn't be a problem. In any case if there is rounding it would be towards some even number, or more generally towards a multiple of a power of 2.

            Code:
            . clear 
            
            . set obs 1
            Number of observations (_N) was 0, now 1.
            
            . gen long num = 123456
            
            . tostring num, gen(num2) format(%20.0f)
            num2 generated as str6
            
            . l
            
                 +-----------------+
                 |    num     num2 |
                 |-----------------|
              1. | 123456   123456 |
                 +-----------------+
            
            . gen float NUM = 123456
            
            . tostring NUM, gen(NUM2) format(%20.0f)
            NUM2 generated as str6
            
            . l
            
                 +-----------------------------------+
                 |    num     num2      NUM     NUM2 |
                 |-----------------------------------|
              1. | 123456   123456   123456   123456 |
                 +-----------------------------------+

            Comment


            • #7
              Thanks Nick and William! 123456 is just an example. My actual data is much longer, like 123456789101112131415, and yes, it did round odd numbers to even numbers. The -stringcols- seems to have solved the problem though. Many thanks again

              Comment


              • #8
                OK, but in short 123456 is not an example.

                Comment

                Working...
                X