Hi everyone,
Thanks in advance for your advice! I am using Stata 15 on Windows , although I also tried this below with Stata 14 and with saveold Stata13...
I have a bunch of files that I am running a program on to clean, then merging these cleaned files together into a master dataset. When I get to a certain file (in this case the one June05), the file is unable to merge into the master file because the variables I am merging based on "do not uniquely identify observations in the using dataset." This is only a problem for this one scrape from June05 so I wrote some code to drop these duplicates (see below). However, although the code shows it runs in the command box, it says zero duplicates and nothing is dropped. So, when it tries to merging this file in I get the same error.
The more complex problem is that if I pick up from where the code breaks (the merge line, see below) and run the drop duplicates code (as it is written) it then works and the duplicates are dropped. The duplicates are dropped and I am able to merge the file into the master file. But this causes the master file to become corrupt, with the error message ".dta file is corrupt. The file unexpectedly ended before it should have." Per other advice, I have tried running dtaverify on the file after it is corrupt and it says " SERIOUS ERROR: unexpected end of file and SERIOUS ERROR: map[1] invalid." Effectively I can't try to fix the error with the duplicates above and then try running the rest of the code to complete the dataset because breaking this way causes a corrupt file and I have no idea why this would be. Any help would be very much appreciated. Thank you!
I am pasting the code below and attaching the ado file.
global identifiers mr_no work_code job_card_number worker_name work_start_date days_worked total_cash_payments
local scrape_list output_28Nov2014 output_06Dec2014 output_19Dec2014 full_output_19Dec2014 output_26Dec2014 ///
output_02Jan2015 output_09Jan2015 full_output_10Jan2015 output_16Jan2015 output_23Jan2015 output_30Jan2015 ///
output_06Feb2015 output_13Feb2015 output_20Feb2015 output_27Feb2015 ///
full_output_16Mar2015 output_20Mar2015 output_10Apr2015 output_17Apr2015 output_24Apr2015 ///
output_01May2015 output_08May2015 output_15May2015 output_22May2015 output_29May2015 ///
full_output_01Jun2015 output_05Jun2015 output_12Jun2015 output_19Jun2015 ///
output_03Jul2015 output_10Jul2015 output_13Jul2015 ///
full_output_10Sep2015 full_output_15Sep2015 full_output_20Nov2015 full_output_15Sep2016 full_output_18Nov2017
local n : word count `scrape_list'
forvalues i = 1/`n' {
local scrape : word `i' of `scrape_list'
import delimited using "MIS_scrapes/Data/Raw/unzipped/`scrape'/muster.csv", varnames(1) clear
cap gen aadhar_no=""
cap gen account_no=""
clean_muster_scrape `scrape' `i'
if "`scrape'" == "output_05Jun2015" {
duplicates drop $identifiers, force
}
compress
save "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_using_v2.dta", replace
use "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_master_v2.dta", replace
merge 1:1 $identifiers using "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_using_v2.dta", nogenerate
label var muster_merge_`i' "Merge Indicator, `scrape'"
note: scrape_`i'=`scrape'
compress
save "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_master_v2.dta", replace
}
Thanks in advance for your advice! I am using Stata 15 on Windows , although I also tried this below with Stata 14 and with saveold Stata13...
I have a bunch of files that I am running a program on to clean, then merging these cleaned files together into a master dataset. When I get to a certain file (in this case the one June05), the file is unable to merge into the master file because the variables I am merging based on "do not uniquely identify observations in the using dataset." This is only a problem for this one scrape from June05 so I wrote some code to drop these duplicates (see below). However, although the code shows it runs in the command box, it says zero duplicates and nothing is dropped. So, when it tries to merging this file in I get the same error.
The more complex problem is that if I pick up from where the code breaks (the merge line, see below) and run the drop duplicates code (as it is written) it then works and the duplicates are dropped. The duplicates are dropped and I am able to merge the file into the master file. But this causes the master file to become corrupt, with the error message ".dta file is corrupt. The file unexpectedly ended before it should have." Per other advice, I have tried running dtaverify on the file after it is corrupt and it says " SERIOUS ERROR: unexpected end of file and SERIOUS ERROR: map[1] invalid." Effectively I can't try to fix the error with the duplicates above and then try running the rest of the code to complete the dataset because breaking this way causes a corrupt file and I have no idea why this would be. Any help would be very much appreciated. Thank you!
I am pasting the code below and attaching the ado file.
global identifiers mr_no work_code job_card_number worker_name work_start_date days_worked total_cash_payments
local scrape_list output_28Nov2014 output_06Dec2014 output_19Dec2014 full_output_19Dec2014 output_26Dec2014 ///
output_02Jan2015 output_09Jan2015 full_output_10Jan2015 output_16Jan2015 output_23Jan2015 output_30Jan2015 ///
output_06Feb2015 output_13Feb2015 output_20Feb2015 output_27Feb2015 ///
full_output_16Mar2015 output_20Mar2015 output_10Apr2015 output_17Apr2015 output_24Apr2015 ///
output_01May2015 output_08May2015 output_15May2015 output_22May2015 output_29May2015 ///
full_output_01Jun2015 output_05Jun2015 output_12Jun2015 output_19Jun2015 ///
output_03Jul2015 output_10Jul2015 output_13Jul2015 ///
full_output_10Sep2015 full_output_15Sep2015 full_output_20Nov2015 full_output_15Sep2016 full_output_18Nov2017
local n : word count `scrape_list'
forvalues i = 1/`n' {
local scrape : word `i' of `scrape_list'
import delimited using "MIS_scrapes/Data/Raw/unzipped/`scrape'/muster.csv", varnames(1) clear
cap gen aadhar_no=""
cap gen account_no=""
clean_muster_scrape `scrape' `i'
if "`scrape'" == "output_05Jun2015" {
duplicates drop $identifiers, force
}
compress
save "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_using_v2.dta", replace
use "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_master_v2.dta", replace
merge 1:1 $identifiers using "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_using_v2.dta", nogenerate
label var muster_merge_`i' "Merge Indicator, `scrape'"
note: scrape_`i'=`scrape'
compress
save "MIS Merge New/MIS_scrapes/Data/temp/union_muster_all_master_v2.dta", replace
}
Comment