Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates in very large dataset

    Hi all,

    I am trying to merge two huge datasets.
    To do so, I am generating a unique identifier as
    Code:
    gen id_mas = _n
    However, when I check for duplicates, stata found that there are though the numbers displayed are different.
    For instance:

    Click image for larger version

Name:	Screenshot 2022-11-04 at 11.26.35 AM.png
Views:	2
Size:	67.6 KB
ID:	1687997


    as you can see the number displayed is 2.33e+07 but it is precisely 23309572. The number below is displayed again as 2.33e+07 but it's 23309514. So they are uniquely defined but stata seems to care only about the rounded value.
    How can I solve this issue and tell stata that these are two separate numbers?

    Thank you

  • #2
    Your issue is one of precision. By default, Stata creates numeric variables of type float, which are accurate only to about seven digits. Since you have eight digits, this falls short for you. If your id variables are no more than 9 digits long, you should store them as long; if they can be larger than that, you should use double. So your code would look like:
    Code:
    gen long id_mas = _n
    Another trick that works is
    Code:
    gen `c(obs_t)' id_mas = _n
    whereby Stata automatically chooses the best storage type given the number of observations you have.

    You might want to look at
    Code:
    help data types
    to understand the precision issue, and
    Code:
    help creturn##currentdta
    for the latter trick.
    Last edited by Hemanshu Kumar; 04 Nov 2022, 04:59.

    Comment


    • #3
      Hi Federico
      two thoughts
      1) if you are going to do a 1 to 1 merge, with just an index, as you indicate, you can just type
      merge 1:1 _n using file.dta
      This way you do not need to create the ID you mention
      2) if you still want to do this, you probably need to create the ID using a different format:
      gen double id=_n
      or
      gen long id=_n

      HTH
      F

      Comment


      • #4
        Thank you all for the kind replies. I see the problem. For what concerns the merge, yes I would like to do a merge but it's not a 1:1 merge type. I have posted a question in another thread as the DDBs are huge and I have computational issues with traditional tools.

        Comment

        Working...
        X