Buggy behaviour when creating unique identifier based on row number for variables

Roberto Santos

Join Date: Jun 2017
Posts: 11

Buggy behaviour when creating unique identifier based on row number for variables

16 Aug 2023, 14:53

Hi everyone, I have been experiencing a very weird issue when I try to create an unique numerical ID based on two variables. Basically, by running the command:

Code:

gen unique_identifier = _n

I am not getting a different number for each row for the variable unique_identifier. I am not sure if this is an issue with my machine or a bug. Hence, I have below a code to reproduce my issues (the dataset in which I have encountered the issue is available here).

I am running Stata 17.0 SE on a notebook with Linux Ubuntu 22.04 LTS.

Code:

sort var1 var2
by var1 var2: gen position = _n
tab position
* There are multiple variables sharing the same var1 and var2. I will keep only
* one of each
keep if position==1
* Indeed, I have kept only one row per combination of var1-var2
tab position
* I use the command below to generate an unique ID for each combination of var1
* and var2. unique_identifier should have been the number of the row.
gen unique_identifier = _n
* By inspecting the dataset  can see something is off: the value for
* unique_identifier in the last row does not match the number of the last row.
* Hence, I create another variable (aux) to keep only cases in which
* unique_identifier is the same for more than one pair of var1-var2.
gen aux = 0
* If unique_identifier is the same in two lines in a row, aux=1 for the second
* line
replace aux=1 if unique_identifier[_n]==[_n-1]
* If unique_identifier is the same in two lines in a row, aux=1 for the first
* line
replace aux=1 if aux[_n+1]==1
keep if aux==1
* By inspecting the dataset, one can see many cases in which different pairs of
* var1-var2 have the same value for unique_identifier

Thanks!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

16 Aug 2023, 15:19

I don't download datasets from people I don't know. But let me guess that you have a large data set. When you use -gen- without specifying a data storage type, you get float by default. And float is only good for up to 7 digits. So if you have more than 9,999,999 observations in your data set, you have blown through the capacity of a float to hold your id numbers. You need -gen long id = _n- or -gen double id = _n-.

Actually, better still is to use, in all circumstances

Code:

gen `c(obs_t)' id = _n

c(obs_t) always contains the most economical storage type that will distinctly serve the purpose of uniquely identifying observations in your data. That way you don't have to know how big your data set will even be when you write the code. Stata will figure it out at run time.
1 like
Comment
Roberto Santos

Join Date: Jun 2017

Posts: 11
#3

16 Aug 2023, 16:05

Thanks Clyde Schechter , this solves the issue. Indeed, my dataset is large, it has 18,207,812 observations.
Comment

Announcement

Buggy behaviour when creating unique identifier based on row number for variables

Comment

Comment