Problems with using _n to create id variable

Joe Ward

Join Date: Jun 2015

Posts: 45
#1

Problems with using _n to create id variable

17 Jan 2020, 03:20

Dear Statalist

I have a dataset with 32 million observations, and around 30 variables, I want to perform an operation on one variable (diag) to label ICD10 codes. To make this run quicker, I created an id variable (gen id = _n), and saved the dataset as "original_file"

I then dropped all variables except id and diag, then ran the operation on diag_01 (which took several hours)

I then wanted to merge the original file using

merge 1:1 id using original_file

But my id variable does not uniquely identify observations in either file. When I look at the data in Data Editor, I see that at large numbers the id variable repeats its self.

Does anyone know why this happens, and how to get round it? Should I be specifying the format of the id variable, to make sure its long enough not to round the large numbers?

Any help would be much appreciated

Best Wishes

Joe
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 36069

17 Jan 2020, 03:39

You need

Code:

gen long id = _n

Specifying or changing the (display) format will neither avoid the problem nor fix it retrospectively.

You need a variable or storage type fit for your purpose with enough bits to ensure that each identifier really is unique.

Code:

. clear

. set obs 4
number of observations (_N) was 0, now 4

. gen long id = 32e6 + _n

. gen bad = 32e6 + _n

. format bad %11.0f

. list

     +---------------------+
     |       id        bad |
     |---------------------|
  1. | 32000001   32000000 |
  2. | 32000002   32000002 |
  3. | 32000003   32000004 |
  4. | 32000004   32000004 |
     +---------------------+

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 Jan 2020, 09:29

Following up on Nick's advice, for future reference here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.

byte - 7 bits	-127	100
int - 15 bits	-32,767	32,740
long - 31 bits	-2,147,483,647	2,147,483,620
float - 24 bits	-16,777,216	16,777,216
double - 53 bits	-9,007,199,254,740,992	9,007,199,254,740,992

Comment

Nick Cox

Join Date: Mar 2014

Posts: 36069
#4

17 Jan 2020, 09:47

Furthermore, merge lets you merge 1:1 by observation number. I can't tell whether that was a good answer here.
1 like
Comment

Announcement