Hello STATA Forum,
I am currently experiencing a mysterious issue where STATA seems to be "hallucinating" duplicates when saving, closing and reopening large datasets.
For context, I am running STATA 18 on an Ubuntu server instance with 16 VCPUs and 124 GB of RAM. The disk is not close to full. I have been working on this project for a long time and I am manually setting the version to 14 to be consistent with older .do files. The dataset I will be discussing is about 5.2 GB.
To begin, I discovered this problem when running older .do files that are involved in the final dataset creation process, which worked perfectly in 2024. A certain .do file ("importData.do") is involved in importing and compiling a bunch of datasets together. There should be absolutely no duplicates by id_s and year because I take steps throughout to ensure this. Here is what happens next:
I've commented out some file names and such for privacy. Essentially, I observe the following:
Has anyone encountered an issue such as this before? I have a feeling it might be an issue with the server instance as the exact same code worked previously without issues and did not produce duplicates when reopening. At this point, I am looking for any and all suggestions!
This is my first time posting on the STATA forum, so thank you in advance for your patience!
Best,
Lucas
I am currently experiencing a mysterious issue where STATA seems to be "hallucinating" duplicates when saving, closing and reopening large datasets.
For context, I am running STATA 18 on an Ubuntu server instance with 16 VCPUs and 124 GB of RAM. The disk is not close to full. I have been working on this project for a long time and I am manually setting the version to 14 to be consistent with older .do files. The dataset I will be discussing is about 5.2 GB.
To begin, I discovered this problem when running older .do files that are involved in the final dataset creation process, which worked perfectly in 2024. A certain .do file ("importData.do") is involved in importing and compiling a bunch of datasets together. There should be absolutely no duplicates by id_s and year because I take steps throughout to ensure this. Here is what happens next:
Code:
* Export data . sort id_s year, stable . . duplicates report Duplicates in terms of all variables -------------------------------------- Copies | Observations Surplus ----------+--------------------------- 1 | 5693769 0 -------------------------------------- . . count if id_s ==0 0 . . save Fdata.dta, replace file XXXXXXfilepathXXXXXX.dta saved . . stop command stop is unrecognized r(199); end of do-file r(199); . do "/tmp/XXXXXXXXXX" . * Reopen and check duplicates . use Fdata.dta, clear . . duplicates report Duplicates in terms of all variables -------------------------------------- Copies | Observations Surplus ----------+--------------------------- 1 | 5520883 0 8 | 8 7 172878 | 172878 172877 -------------------------------------- . . count if id_s ==0 172,937
I've commented out some file names and such for privacy. Essentially, I observe the following:
- I generate the final dataset
- I check the number of duplicates in the data pre-save and verify that there are no observations where the id variable are equal to 0.
- I save this dataset on the server.
- After break, I open the EXACT same dataset I have just saved
- I check the duplicates count again and find that suddenly in the same dataset there are now between about 4500 and 200000 duplicates (UNSTABLE numbers!), all of them with id equal to 0. Almost all of these duplicates are just rows with all variables equaling 0.
Has anyone encountered an issue such as this before? I have a feeling it might be an issue with the server instance as the exact same code worked previously without issues and did not produce duplicates when reopening. At this point, I am looking for any and all suggestions!
This is my first time posting on the STATA forum, so thank you in advance for your patience!
Best,
Lucas
Comment