.dta file corrupting

Natalie Theys

Join Date: Jan 2018

Posts: 3
#1

.dta file corrupting

04 Jan 2018, 13:54

Hi everyone,
Thanks in advance for your advice! I am using Stata 15 on Windows , although I also tried this below with Stata 14 and with saveold Stata13...
I have a bunch of files that I am running a program on to clean, then merging these cleaned files together into a master dataset. When I get to a certain file (in this case the one June05), the file is unable to merge into the master file because the variables I am merging based on "do not uniquely identify observations in the using dataset." This is only a problem for this one scrape from June05 so I wrote some code to drop these duplicates (see below). However, although the code shows it runs in the command box, it says zero duplicates and nothing is dropped. So, when it tries to merging this file in I get the same error.
The more complex problem is that if I pick up from where the code breaks (the merge line, see below) and run the drop duplicates code (as it is written) it then works and the duplicates are dropped. The duplicates are dropped and I am able to merge the file into the master file. But this causes the master file to become corrupt, with the error message ".dta file is corrupt. The file unexpectedly ended before it should have." Per other advice, I have tried running dtaverify on the file after it is corrupt and it says " SERIOUS ERROR: unexpected end of file and SERIOUS ERROR: map[1] invalid." Effectively I can't try to fix the error with the duplicates above and then try running the rest of the code to complete the dataset because breaking this way causes a corrupt file and I have no idea why this would be. Any help would be very much appreciated. Thank you!
I am pasting the code below and attaching the ado file.

global identifiers mr_no work_code job_card_number worker_name work_start_date days_worked total_cash_payments

local scrape_list output_28Nov2014 output_06Dec2014 output_19Dec2014 full_output_19Dec2014 output_26Dec2014 ///
output_02Jan2015 output_09Jan2015 full_output_10Jan2015 output_16Jan2015 output_23Jan2015 output_30Jan2015 ///
output_06Feb2015 output_13Feb2015 output_20Feb2015 output_27Feb2015 ///
full_output_16Mar2015 output_20Mar2015 output_10Apr2015 output_17Apr2015 output_24Apr2015 ///
output_01May2015 output_08May2015 output_15May2015 output_22May2015 output_29May2015 ///
full_output_01Jun2015 output_05Jun2015 output_12Jun2015 output_19Jun2015 ///
output_03Jul2015 output_10Jul2015 output_13Jul2015 ///
full_output_10Sep2015 full_output_15Sep2015 full_output_20Nov2015 full_output_15Sep2016 full_output_18Nov2017

local n : word count `scrape_list'
forvalues i = 1/`n' {

local scrape : word `i' of `scrape_list'

import delimited using "MIS_scrapes/Data/Raw/unzipped/`scrape'/muster.csv", varnames(1) clear
cap gen aadhar_no=""
cap gen account_no=""

clean_muster_scrape `scrape' `i'

if "`scrape'" == "output_05Jun2015" {
duplicates drop $identifiers, force
}

compress
save "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_using_v2.dta", replace

use "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_master_v2.dta", replace

merge 1:1 $identifiers using "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_using_v2.dta", nogenerate
label var muster_merge_`i' "Merge Indicator, `scrape'"
note: scrape_`i'=`scrape'

compress
save "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_master_v2.dta", replace
}
Attached Files

clean_muster_scrape_v2.ado (6.0 KB, 1 view)
Tags: None
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#2

04 Jan 2018, 15:22

1. What does

Code:

use "foobar.dta", replace

mean? I don't think this is a valid syntax.

2. Can you share the file which Stata says is corrupt?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#3

04 Jan 2018, 15:35

Sergiy Radyakin The -replace- option is old syntax for the -clear- option in -use-. It's no longer documented, but it still works.
1 like
Comment
Natalie Theys

Join Date: Jan 2018

Posts: 3
#4

04 Jan 2018, 17:49

The file is large (10 gigabits) so I am not sure I can send it. In trying to rerun the code without the problem scrape it is still corrupting the file.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#5

04 Jan 2018, 18:00

10 gigabits? Are you sure you don't mean 10 gigabytes? File sizes are seldom reported in units of bits. I ask, because you are running Windows, and Windows supports (at least) two different kinds of file system. The FAT-32 file system has an upper limit on file size of 2GB. If you are trying to write a file larger than that, something will go wrong at the operating system level, and I wouldn't be able to predict how that would look when viewed through the eyes of Stata. It may be that the operating system quits writing the file and you are left with a file that just got cut off in the middle, which could lead Stata to think of it as corrupted. If, however, you are running the NT file system, a file this size will not pose any difficulties. If you open Windows Explorer, and right-click on the icon for your C: drive, and select Properties from the menu that pops up, you will be able to see which type of file system you have.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#6

04 Jan 2018, 22:05

Clyde,
1. thank you for the clarification on the replace option.
2. With FAT-32 the save command should abort with error and Stata should not proceed to the next command.

Natalie,

1. type

Code:

hexdump "filename.dta"

with the file name for which you are getting the "SERIOUS ERROR" message, then post the output of it to here (first two-three screens should be sufficient). Include exact file size in bytes.
2. separate the problems into simpler ones. convert the files you have from CSV to DTA. Verify your conversion succeeded. Only after that start doing anything with merges, duplicates and identifiers.
3. look through your code and identify the commands that save or write. Problem like yours is often the result of e.g. saving a graph to a file which is later used as a dataset, which happens a lot with tempfiles.
4. There was a bug in Stata 13 with incorrect composition of the file map, which was very quickly repaired by StataCorp, so only a handful of datasets were in circulation with that defect, and it is repairable as the incorrect value can be inferred from other information saved in the file map, which is what -use13- was doing. However, nothing like that bug appeared since then, afaik, so the problem is likely with your code, not with Stata.

Best, Sergiy
Comment

Announcement

.dta file corrupting

Comment

Comment

Comment

Comment

Comment