Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with using _n to create id variable

    Dear Statalist

    I have a dataset with 32 million observations, and around 30 variables, I want to perform an operation on one variable (diag) to label ICD10 codes. To make this run quicker, I created an id variable (gen id = _n), and saved the dataset as "original_file"

    I then dropped all variables except id and diag, then ran the operation on diag_01 (which took several hours)

    I then wanted to merge the original file using

    merge 1:1 id using original_file

    But my id variable does not uniquely identify observations in either file. When I look at the data in Data Editor, I see that at large numbers the id variable repeats its self.

    Does anyone know why this happens, and how to get round it? Should I be specifying the format of the id variable, to make sure its long enough not to round the large numbers?

    Any help would be much appreciated

    Best Wishes

    Joe

  • #2
    You need

    Code:
    gen long id = _n
    Specifying or changing the (display) format will neither avoid the problem nor fix it retrospectively.

    You need a variable or storage type fit for your purpose with enough bits to ensure that each identifier really is unique.

    Code:
    . clear
    
    . set obs 4
    number of observations (_N) was 0, now 4
    
    . gen long id = 32e6 + _n
    
    . gen bad = 32e6 + _n
    
    . format bad %11.0f
    
    . list
    
         +---------------------+
         |       id        bad |
         |---------------------|
      1. | 32000001   32000000 |
      2. | 32000002   32000002 |
      3. | 32000003   32000004 |
      4. | 32000004   32000004 |
         +---------------------+


    Comment


    • #3
      Following up on Nick's advice, for future reference here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.
      byte - 7 bits -127 100
      int - 15 bits -32,767 32,740
      long - 31 bits -2,147,483,647 2,147,483,620
      float - 24 bits -16,777,216 16,777,216
      double - 53 bits -9,007,199,254,740,992 9,007,199,254,740,992

      Comment


      • #4
        Furthermore, merge lets you merge 1:1 by observation number. I can't tell whether that was a good answer here.

        Comment

        Working...
        X