how do you fix a corrupt data file?

László Sándor

Join Date: Apr 2014

Posts: 120
#1

how do you fix a corrupt data file?

11 Aug 2015, 06:20

I took over a project where one of the old files of raw data seems corrupted, at least this was the answer I got here to my previous question (I also shared the large file, if you are interested):
http://www.statalist.org/forums/foru...too-long-r-688

If so, are there tools that might fix a file from September 2008 (presumably made with version 10.1 or 10, if not 9.2) for use in version 14?

I understand this must heavily depend on the nature of corruption, but maybe there are some general tools to consider.

Thanks,

Laszlo
Tags: None
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#2

11 Aug 2015, 08:40

Do you have access to Stata 11 or 12? Older versions of Stata may be able to open the file. See this message for an explanation.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#3

11 Aug 2015, 08:56

This is a great post, though a deeply disturbing practice. I don't have access to such old versions, but I can ask around. Thanks.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#4

11 Aug 2015, 09:40

FYI

The file László uploaded causes Stata 12.1 (fully updated, Win 7) to crash (I mean crash as in Stata freezes, tells me in a new window that it stopped working and asks whether I want to search for a solution online) if I try to use it, but it has no problems with describe using the file. With Stata 11.2 (fully updated, same machine) I can use the file (after adjusting the memory setting) and save it. I can then use it in Stata 12.

I have currently no acces to Stata 13 or 14.

The file format is 113 (i.e., 0x71), meaning that it was created with Stata 8, according to (this).

Best
Daniel

Last edited by daniel klein; 11 Aug 2015, 09:51.
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#5

11 Aug 2015, 10:21

Using the information in Daniel's message, I opened the file in Stata 11, saved it, then opened it in Stata 14 and saved it again. The resulting file (21 MB compressed) is available at this link: https://www.dropbox.com/s/uxojnwxvoc...w5_14.zip?dl=1
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#6

11 Aug 2015, 10:32

László Sándor has posted a Stata file which he can't open in Stata. daniel klein has correctly diagnosed the dataset is produced by Stata 8 (or Stata 9).
The dataset is indeed corrupt.
Standard Stata will not be able to open this file.
Standard Stata does not contain recovery options to allow partial content extraction in this case. It either opens the whole file, or stops with an error message.
The file declares 1028 variables, but contains valid descriptors for about 640 of them:
http://www.radyakin.org/statalist/20...11_1305683.txt

The corruption occurs in the header section, which means that any tool that obeys the header instructions and parses data according to them will fail to recover the data correctly. However, the data itself may be intact (if the corruption was only in one spot). With some computations one can probably establish the way to recover it (though at this time it is not clear whether it is going to be 640 or 1028 variables actually in the file).

Friedrich Huebler quoted my earlier message which was about a file being corrupt. StataCorp reacted swiftly to investigate the reason of the issue. The following page contains a more descriptive information about the situation:
http://www.radyakin.org/statalist/st...plete_file.htm

I believe as a result of this report the data loading facility of Stata 13 was significantly improved and peppered with helpful error messages to give a better understanding of what is going on when something goes wrong. In particular, if one tries to open the file posted by Laszlo in Stata 13, the following error message will be displayed:
characteristics too long
r(688);

And while the error message does not correspond to the real problem in the file, we have the following two improvements:
1) Stata detected that there is "something" wrong with the file;
2) Stata doesn't crash now.

Since the file is probably SIPP data for 2004, requesting the original dataset from the data provider would probably be a faster and cheaper option for Laszlo than commissioning a data recovery operation.

Finally, I should notice that Laszlo has shared the file via the DropBox. Corruption of the file as observed is consistent with the file being saved to a resource like DropBox/Google Drive/etc similar technology. If you are reading this message and synchronizing the data over such a technology, stop now! Disconnect, finish your work, upload your results. Do not sync in the background or this will happen again. Laszlo was lucky that corruption of his file occured in the header section. If a similar corruption occurred in the middle of the data section somewhere within the 382,778 observations, he might have never had a hint, and produce incorrect analyses/results/conclusions.

Finally finally, dropbox has had problems (at least for me) downloading folders at some point (around 2013). I can only guess it was something to do with zipping. Downloading individual files helped, but left a hard feeling.

Hope this helps.
Sergiy Radyakin

Attached Files

sipp04w5_vars.txt (8.2 KB, 1 view)
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#7

12 Aug 2015, 16:59

Thanks, Sergiy.

This is not the gist of this post, but I feel obliged to comment on Dropbox: I acquired the corrupted files over a network with rsync, not by storing this data on Dropbox. I'm even at a loss how this corruption occurred, as the dta files have been zipped into a single (humungous) archive, and I would have thought corruption of a zip would manifest itself differently than smooth unarchiving but with data loss. It is even more confusing why SIPP 2004 waves 5,6,7,9,10 were affected by the same (?) problem, but not the rest.

Yes, I can redownload the data, but it has been revised since, and for replication and code verification I would have needed to differentiate whether the code we had on record is different from what produced an analysis data file, or it does reproduce it bit by bit, only from the original data at hand, not from more recent versions.

Thanks again!
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#8

13 Aug 2015, 14:38

Laszlo, check this source:
http://www.nber.org/sipp/2004/
it may be updated with a different speed then the
http://www.census.gov/sipp/

The data provider will probably have archived copies of earlier released data. Just let them know which revision you need. If this is of any help, then the date saved in your file is:
19 Sep 2008 12:52

Since this is a fairly popular dataset, your colleagues might also have copies. If you stored anything on a network, your network administrator might have done backups of the network drive as well, depending on the policies in your organization. Just guessing here.

Best, Sergiy
1 like
Comment

Announcement

how do you fix a corrupt data file?

Comment

Comment

Comment

Comment

Comment

Comment

Comment