Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating unique number using _n and formatting correctly in massive data that defaulted to exponential format

    Hello,

    I am working with 28 million observations, and I would like to create a unique identifier from 1 to 28 million for each one. I used:

    gen id = _n

    Which worked great, except after the first 10 million observations, the data started labelling them in exponential format "1.00e+07", giving hundreds of observations the exact same id.

    when I use the command "format %9.0 id", it turns it back into numeric but still creates many duplicate id's. I think it assigned them a number in the middle.

    Any help would be appreciated

    Thanks!












  • #2
    You can use the following to generate a unique ID counting from 1 to N. The special system macro -c(obs_t)- will be replaced by an appropriate data type that is large enough to contain the integer value associated with -_N-. That is, it will pick the correct datatype to use.

    Code:
    gen `c(obs_t)' id = _n
    The issue you have likely run into is that -float- is the implicit storage type for numeric variables, which will not be big enough to uniquely identify so many observations. The following code demonstrates this failure.

    Code:
    set obs 30000000
    gen float floatid = _n
    isid floatid
    Using -format- simply changes the display format of how a human sees the numbers, but does nothing for how the underlying data are stored. I wouldn't change it from the default of %12.0g.

    Comment


    • #3
      That indeed worked. Thank you so much for the help and explanation!

      Comment


      • #4
        Leonardo Guizzetti This old dog wishes to learn new tricks -- could you please direct me to help on c(obs_t)? I don't see it mentioned, e.g., in the list produced with -query-.

        Comment


        • #5
          Apparently, it is not in:

          Code:
          help creturn##other
          but you can find it in the PDF manual entry for creturn (last page): https://www.stata.com/manuals/pcreturn.pdf

          Comment


          • #6
            Code:
             
            help creturn
            https://www.youtube.com/watch?v=iuy-oOJCOoM (old dog noises)

            Comment


            • #7
              Originally posted by Stephen Jenkins View Post
              Leonardo Guizzetti This old dog wishes to learn new tricks -- could you please direct me to help on c(obs_t)? I don't see it mentioned, e.g., in the list produced with -query-.
              No problem, at the bottom of -help creturn- under Other system variables

              Code:
                  c(obs_t) returns a string equal to the optimal data type for storing _n.  This allows you to code
              
                          generate `c(obs_t)' index = _n
              
                      and know that index will go from 1 to _N without roundoff errors and without wasting any space.

              Comment


              • #8
                thank you! I don't know how I missed that.

                Comment

                Working...
                X