Mysterious Duplicates Emerging when Dataset is Closed and Immediately Reopened

Lucas Borden

Join Date: Aug 2025

Posts: 2
#1

Mysterious Duplicates Emerging when Dataset is Closed and Immediately Reopened

05 Aug 2025, 03:17

Hello STATA Forum,
I am currently experiencing a mysterious issue where STATA seems to be "hallucinating" duplicates when saving, closing and reopening large datasets.

For context, I am running STATA 18 on an Ubuntu server instance with 16 VCPUs and 124 GB of RAM. The disk is not close to full. I have been working on this project for a long time and I am manually setting the version to 14 to be consistent with older .do files. The dataset I will be discussing is about 5.2 GB.

To begin, I discovered this problem when running older .do files that are involved in the final dataset creation process, which worked perfectly in 2024. A certain .do file ("importData.do") is involved in importing and compiling a bunch of datasets together. There should be absolutely no duplicates by id_s and year because I take steps throughout to ensure this. Here is what happens next:

Code:

* Export data . sort id_s year, stable . . duplicates report Duplicates in terms of all variables -------------------------------------- Copies | Observations Surplus ----------+--------------------------- 1 | 5693769 0 -------------------------------------- . . count if id_s ==0 0 . . save Fdata.dta, replace file XXXXXXfilepathXXXXXX.dta saved . . stop command stop is unrecognized r(199); end of do-file r(199); . do "/tmp/XXXXXXXXXX" . * Reopen and check duplicates . use Fdata.dta, clear . . duplicates report Duplicates in terms of all variables -------------------------------------- Copies | Observations Surplus ----------+--------------------------- 1 | 5520883 0 8 | 8 7 172878 | 172878 172877 -------------------------------------- . . count if id_s ==0 172,937

I've commented out some file names and such for privacy. Essentially, I observe the following:
I generate the final dataset

I check the number of duplicates in the data pre-save and verify that there are no observations where the id variable are equal to 0.

I save this dataset on the server.

After break, I open the EXACT same dataset I have just saved

I check the duplicates count again and find that suddenly in the same dataset there are now between about 4500 and 200000 duplicates (UNSTABLE numbers!), all of them with id equal to 0. Almost all of these duplicates are just rows with all variables equaling 0.

In an attempt to debug, I also tried saving the data as .csv and importing back to into STATA from the .csv when I needed to call it. This successfully circumvented this strange duplicates issue, but unfortunately, it looks like the .csv compression forced some variables away from doubles into floats which appeared to affect my estimates. I am unable to compress variables further without losing information.

Has anyone encountered an issue such as this before? I have a feeling it might be an issue with the server instance as the exact same code worked previously without issues and did not produce duplicates when reopening. At this point, I am looking for any and all suggestions!

This is my first time posting on the STATA forum, so thank you in advance for your patience!

Best,
Lucas
Tags: None
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1480
#2

05 Aug 2025, 11:07

This appears to be a precision issue. Do you face the same problem if you instead do

Code:

count if id_s == float(0)

Do you get the same problem if you set all (numeric) variables to be of type double?
Comment
Lucas Borden

Join Date: Aug 2025

Posts: 2
#3

07 Aug 2025, 09:58

Thank you for your help. I've found a solution involving dataframes and everything seems to be ok now. Thanks!
Comment

Announcement

Mysterious Duplicates Emerging when Dataset is Closed and Immediately Reopened

Comment

Comment