Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem creating unique identifier variable using system variable _n

    Dear Statalisters,

    My dataset comprises approximately 21 million individuals and each individual represented by one row in the dataset. I would like to create a unique identifier for each individual numbered from 1...n.

    I used the following code:

    gen id = _n

    This created a variable with only approximately 19 million unique values. Approximately 2 million values then were duplicates. Having examined the duplicates closely I do not see what distinguishes them and explains why Stata treated them as identical observations.

    Can anyone explain why this is happening and how I can otherwise create a unique identifier for each individual?

    Thank you.

    Omar


  • #2
    Try using

    Code:
    gen double id = _n

    Comment


    • #3
      The issue is precision: I recommend that you read all of the entry under help precision and follow the links therein. generate creates new variables as floats. Having also read help generate, understand why generate double id = _n may solve your problem. Be aware too that IDs can also be held as string variables. This has many advantages, and few disadvantages that I can think of. [There are other related postings about this topic on Statalist.]

      Comment


      • #4
        That worked! Thank you Jesse.

        Comment


        • #5
          Originally posted by Stephen Jenkins View Post
          The issue is precision: I recommend that you read all of the entry under help precision and follow the links therein. generate creates new variables as floats. Having also read help generate, understand why generate double id = _n may solve your problem. Be aware too that IDs can also be held as string variables. This has many advantages, and few disadvantages that I can think of. [There are other related postings about this topic on Statalist.]
          Is there a reason Stata generates floats by default? With modern memory capacities, the space savings seem irrelevant for most cases. And in those cases where the savings are essential, it seems reasonable to require the user to either generate as float/long/byte or to just have the person use compress...

          The upshot of having doubles by default all the time is that newer (and I dare say even experienced users) are much less likely to have issues such as posted here. In this case the problem was spotted, but for example I've been using my dataset with incorrect VAT numbers (company identifiers) for a long time until I noticed some outside merge was giving me strange results.

          Comment


          • #6
            The argument for the default is that very few variables justify being held as doubles. Also, memory is bigger than it used to be but so are datasets. And the downside of Stata holding datasets in memory is, inevitably, a frequent need to be careful about memory.

            The main exception is indeed identifiers for datasets large enough that this problem bites. But most categories, counts and even measurements just aren't that problematic.

            I'm afraid that Stata still requires people to think about what they are doing, at least some of the time. If the default were double, then we would be seeing a lot of posts here to which we would be answering, "Use compress".

            You can set the default type to double yourself. It would be interesting to know how many people knowing about that have ever thought it worthwhile. See the help for generate for the details on set type double.

            Comment


            • #7
              A disadvantage of holding identifiers as strings is that tsset and xtset require numeric identifiers. In that situation, don't throw out the string identifier but use egen's group(), label

              Comment

              Working...
              X