Data being destroyed while saving

Jack Reimer

Join Date: Sep 2018

Posts: 52
#1

Data being destroyed while saving

11 Jul 2022, 15:27

I'm having a very weird problem with data being destroyed while writing/reading into Stata-MP on a slurm cluster.

After data cleaning, I am running isid, everything is fine, then I save the file as a dta. But when I immediately read in the exact same file, isid fails. One third of my 66M records are having their IDs values erased in the process. One of the ID variables is numeric and the other is a float (formatted as a date) Has anyone experienced this problem? Is this likely a server problem or is there something about the ID variables themselves that I could be unclear about?

Thank you. Explanatory code below. Unfortunately, my data are private so I can't share a sample.

Code:

. isid hhd_id date_DT . . * write . save $use_data/household_days, replace file /projects/project_name/data/use_data/household_days.dta saved . . use $use_data/household_days, clear . . isid hhd_id date_DT variables hhd_id and date_DT do not uniquely identify the observations r(459);
Tags: float, isid, numeric, save, use
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#2

11 Jul 2022, 15:37

No, you must share a sample. I don't care if the data are private, the FAQ explicitly says for you to either change the numbers or take measures such that the variables aren't their real values. Or, to get your problem to reproduce on a toy dataset.

Thus far, I can't tell what the issue is aside from that duplicates exist. Show me what

Code:

duplicates report idhhd_id date_DT

returns

Edit: the likely solution lies in all the code you wrote before you got to this point. So we'll need more code and ideally data to answer this

Last edited by Jared Greathouse; 11 Jul 2022, 15:41.
Comment
Ali Atia

Join Date: May 2020

Posts: 737
#3

11 Jul 2022, 18:01

My assumption is that this has something to do with the server to which you are writing/reading, rather than any problem with code or data.

I cannot replicate your problem when saving to a local device -- the code below returns no errors when run on my machine:

Code:

clear set obs 50000000 egen id = seq(),f(1) b(100000) bys id: gen date = _n isid id date global dir "C:\Users\username\Desktop" save $dir/test, replace use $dir/test, clear isid id date
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

11 Jul 2022, 18:37

You might try changing your code to write to and read from a temporary file, which (barring a truly peculiar Stata configuration) should be guaranteed to exist on a local filesystem for performance reasons, or at least, on a different filesystem than your permanent dataset. In post #3 Ali Atia makes the important point that his example is run on a filesystem attached to his computer, rather than on a network drive, or on some cloud device.

Code:

isid hhd_id date_DT tempfile foo save "`foo'" use "`foo'", clear isid hhd_id date_DT

Last edited by William Lisowski; 11 Jul 2022, 18:52.
2 likes
Comment
Jack Reimer

Join Date: Sep 2018

Posts: 52
#5

11 Jul 2022, 21:18

Hi - Thank you very much for these helpful responses.

I have not been able to replicate the problem with William's suggestion of using tempfiles which gives me a greater credence that it has to do with how it is being written to disk storage. Moreover, I learned that it is actually deleting all variables for this subset of observations. Not just the identifying variables or variables of a particular type. I've also replicated the problem across MP and SE in the event that it's some type of data race scenario.

Unfortunately, my codebase is such that I can't use `tempfiles' at this juncture so I will update here if I can come up with a better answer.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4540
#6

11 Jul 2022, 22:03

Originally posted by Jack Reimer View Post

Code:

* write . save $use_data/household_days, replace

Try saving the cleaned-up dataset under a new name.

That is, for the sake of I/O efficiency, maybe your SLURM cluster is feeding you back a more accessible cached instance of the original uncleaned dataset that has all of the problems that you thought that you had just taken care of.

You can also check the datasignature to see whether the one you're using is the same one that you just saved. Inserting a few lines into what you post above:

Code:

isid hhd_id date_DT * write datasignature local new_datasignature `r(datasignature)' save $use_data/household_days, replace use $use_data/household_days, clear datasignature assert "`r(datasignature)'" == "`new_datasignature'" isid hhd_id date_DT
1 like
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4540
#7

11 Jul 2022, 22:13

Originally posted by Jack Reimer View Post

a greater credence that it has to do with how it is being written to disk storage.

You can use the dataset signature features to look into that, too, for example, with datasignature set and datasignature confirm.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

12 Jul 2022, 09:07

my codebase is such that I can't use `tempfiles' at this juncture

It was not my suggestion that you actually use a tempfile for more than what it accomplished - verifying that the problem is somehow related to the filesystem to which the data is being written.

If it is being written to a network filesystem or a filesystem linked to cloud storage, then reports of problems like yours are common on Statalist. If you need to store your data in that particular location, I think you would be advised to write the data to a standard filesystem (not networked or linked to cloud storage) and at the end of your job, use the linux mv command - rather than Stata processes - to relocate the dataset to the ultimate destination.

If you think the filesystems reported in the output of the save commands from the code in posts #1 and #5 are both standard filesystems on the slurm cluster, then you might contact the cluster support team and ask them what the difference is between those two filesystems that accounts for the problems you are seeing.
Comment
Jack Reimer

Join Date: Sep 2018

Posts: 52
#9

13 Jul 2022, 09:46

Hello,

Ultimately, saving under a new name, as Joseph suggested, remediated the issue. I was able to confirm that it is something about the save process, not the `use` process because the problem was already apparent when I tried to instead read the dta file into R.

Quite odd but these suggestions were all quite helpful and helped resolve this issue
Comment

Announcement

Data being destroyed while saving

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment