Problem creating unique identifier variable using system variable _n

Omar McDoom

Join Date: Sep 2014

Posts: 9
#1

Problem creating unique identifier variable using system variable _n

17 Aug 2016, 01:29

Dear Statalisters,

My dataset comprises approximately 21 million individuals and each individual represented by one row in the dataset. I would like to create a unique identifier for each individual numbered from 1...n.

I used the following code:

gen id = _n

This created a variable with only approximately 19 million unique values. Approximately 2 million values then were duplicates. Having examined the duplicates closely I do not see what distinguishes them and explains why Stata treated them as identical observations.

Can anyone explain why this is happening and how I can otherwise create a unique identifier for each individual?

Thank you.

Omar
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

17 Aug 2016, 01:49

Try using

Code:

gen double id = _n
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#3

17 Aug 2016, 01:51

The issue is precision: I recommend that you read all of the entry under help precision and follow the links therein. generate creates new variables as floats. Having also read help generate, understand why generate double id = _n may solve your problem. Be aware too that IDs can also be held as string variables. This has many advantages, and few disadvantages that I can think of. [There are other related postings about this topic on Statalist.]
1 like
Comment
Omar McDoom

Join Date: Sep 2014

Posts: 9
#4

17 Aug 2016, 01:56

That worked! Thank you Jesse.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#5

17 Aug 2016, 01:59

Originally posted by Stephen Jenkins View Post

The issue is precision: I recommend that you read all of the entry under help precision and follow the links therein. generate creates new variables as floats. Having also read help generate, understand why generate double id = _n may solve your problem. Be aware too that IDs can also be held as string variables. This has many advantages, and few disadvantages that I can think of. [There are other related postings about this topic on Statalist.]

Is there a reason Stata generates floats by default? With modern memory capacities, the space savings seem irrelevant for most cases. And in those cases where the savings are essential, it seems reasonable to require the user to either generate as float/long/byte or to just have the person use compress...

The upshot of having doubles by default all the time is that newer (and I dare say even experienced users) are much less likely to have issues such as posted here. In this case the problem was spotted, but for example I've been using my dataset with incorrect VAT numbers (company identifiers) for a long time until I noticed some outside merge was giving me strange results.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#6

17 Aug 2016, 02:48

The argument for the default is that very few variables justify being held as doubles. Also, memory is bigger than it used to be but so are datasets. And the downside of Stata holding datasets in memory is, inevitably, a frequent need to be careful about memory.

The main exception is indeed identifiers for datasets large enough that this problem bites. But most categories, counts and even measurements just aren't that problematic.

I'm afraid that Stata still requires people to think about what they are doing, at least some of the time. If the default were double, then we would be seeing a lot of posts here to which we would be answering, "Use compress".

You can set the default type to double yourself. It would be interesting to know how many people knowing about that have ever thought it worthwhile. See the help for generate for the details on set type double.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#7

17 Aug 2016, 03:26

A disadvantage of holding identifiers as strings is that tsset and xtset require numeric identifiers. In that situation, don't throw out the string identifier but use egen's group(), label
Comment

Announcement

Problem creating unique identifier variable using system variable _n

Comment

Comment

Comment

Comment

Comment

Comment