Duplicates in very large dataset

Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#1

Duplicates in very large dataset

04 Nov 2022, 04:38

Hi all,

I am trying to merge two huge datasets.
To do so, I am generating a unique identifier as

Code:

gen id_mas = _n

However, when I check for duplicates, stata found that there are though the numbers displayed are different.
For instance:

as you can see the number displayed is 2.33e+07 but it is precisely 23309572. The number below is displayed again as 2.33e+07 but it's 23309514. So they are uniquely defined but stata seems to care only about the rounded value.
How can I solve this issue and tell stata that these are two separate numbers?

Thank you
Tags: data, panel data, Suggestion
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1548
#2

04 Nov 2022, 04:51

Your issue is one of precision. By default, Stata creates numeric variables of type float, which are accurate only to about seven digits. Since you have eight digits, this falls short for you. If your id variables are no more than 9 digits long, you should store them as long; if they can be larger than that, you should use double. So your code would look like:

Code:

gen long id_mas = _n

Another trick that works is

Code:

gen `c(obs_t)' id_mas = _n

whereby Stata automatically chooses the best storage type given the number of observations you have.

You might want to look at

Code:

help data types

to understand the precision issue, and

Code:

help creturn##currentdta

for the latter trick.

Last edited by Hemanshu Kumar; 04 Nov 2022, 04:59.
2 likes
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2532
#3

04 Nov 2022, 04:55

Hi Federico
two thoughts
1) if you are going to do a 1 to 1 merge, with just an index, as you indicate, you can just type
merge 1:1 _n using file.dta
This way you do not need to create the ID you mention
2) if you still want to do this, you probably need to create the ID using a different format:
gen double id=_n
or
gen long id=_n

HTH
F
1 like
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#4

04 Nov 2022, 08:26

Thank you all for the kind replies. I see the problem. For what concerns the merge, yes I would like to do a merge but it's not a 1:1 merge type. I have posted a question in another thread as the DDBs are huge and I have computational issues with traditional tools.
Comment

Announcement

Duplicates in very large dataset

Comment

Comment

Comment