Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mysterious Duplicates Emerging when Dataset is Closed and Immediately Reopened

    Hello STATA Forum,
    I am currently experiencing a mysterious issue where STATA seems to be "hallucinating" duplicates when saving, closing and reopening large datasets.

    For context, I am running STATA 18 on an Ubuntu server instance with 16 VCPUs and 124 GB of RAM. The disk is not close to full. I have been working on this project for a long time and I am manually setting the version to 14 to be consistent with older .do files. The dataset I will be discussing is about 5.2 GB.

    To begin, I discovered this problem when running older .do files that are involved in the final dataset creation process, which worked perfectly in 2024. A certain .do file ("importData.do") is involved in importing and compiling a bunch of datasets together. There should be absolutely no duplicates by id_s and year because I take steps throughout to ensure this. Here is what happens next:

    Code:
    * Export data
    .         sort id_s year, stable
    
    .        
    .         duplicates report
    
    Duplicates in terms of all variables
    
    --------------------------------------
       Copies | Observations       Surplus
    ----------+---------------------------
            1 |      5693769             0
    --------------------------------------
    
    .        
    .         count if id_s ==0
      0
    
    .
    .         save Fdata.dta, replace
    file
        XXXXXXfilepathXXXXXX.dta saved
    
    .        
    .         stop
    command stop is unrecognized
    r(199);
    
    end of do-file
    
    r(199);
    
    . do "/tmp/XXXXXXXXXX"
    
    . * Reopen and check duplicates  
    .         use Fdata.dta, clear
    
    .        
    .         duplicates report
    
    Duplicates in terms of all variables
    
    --------------------------------------
       Copies | Observations       Surplus
    ----------+---------------------------
            1 |      5520883             0
            8 |            8             7
       172878 |       172878        172877
    --------------------------------------
    
    .        
    .         count if id_s ==0
      172,937

    I've commented out some file names and such for privacy. Essentially, I observe the following:
    1. I generate the final dataset
    2. I check the number of duplicates in the data pre-save and verify that there are no observations where the id variable are equal to 0.
    3. I save this dataset on the server.
    4. After break, I open the EXACT same dataset I have just saved
    5. I check the duplicates count again and find that suddenly in the same dataset there are now between about 4500 and 200000 duplicates (UNSTABLE numbers!), all of them with id equal to 0. Almost all of these duplicates are just rows with all variables equaling 0.
    In an attempt to debug, I also tried saving the data as .csv and importing back to into STATA from the .csv when I needed to call it. This successfully circumvented this strange duplicates issue, but unfortunately, it looks like the .csv compression forced some variables away from doubles into floats which appeared to affect my estimates. I am unable to compress variables further without losing information.

    Has anyone encountered an issue such as this before? I have a feeling it might be an issue with the server instance as the exact same code worked previously without issues and did not produce duplicates when reopening. At this point, I am looking for any and all suggestions!

    This is my first time posting on the STATA forum, so thank you in advance for your patience!

    Best,
    Lucas

  • #2
    This appears to be a precision issue. Do you face the same problem if you instead do
    Code:
    count if id_s == float(0)
    Do you get the same problem if you set all (numeric) variables to be of type double?

    Comment


    • #3
      Thank you for your help. I've found a solution involving dataframes and everything seems to be ok now. Thanks!

      Comment

      Working...
      X