Hello everyone,
I have a large dataset and after some cleaning I tried to save it to the directory. It saves normally but when i try to use it again the error below shows:
. use employee_national
.dta file corrupt
The file unexpectedly ended before it should have.
I tried adding space and a comment as I have seen others do but with no success. Yesterday the code ran perfectly and I was able to use the dataset. I do not know what happened today. I also tried restarting my computer. Note that I am working on Harvard's Research Computing environment.
I do not know if this helps but please find my full code below:
*Equal Opportunity Index
*GOSI analysis
*Employment by Industry- By gender and nationality
*February 19th, 2018
*Chaza Abou Daher
. clear
. cd "/nfs/home/C/cha022/GOSI-Index"
. use "/nfs/home/C/cha022/shared_space/ci3_nali/GOSI_2016.dta"
*opening the log
. log using "Employee.txt", text replace
*remove missing employers and industries
. gen missemployer= (owner_id_700==.)
. tabulate missemployer
. drop if missemployer==1
*95500 observations were dropped, for having no company name
. gen missindustry= (activitysubgroup==.)
. tabulate missindustry
. drop if missindustry==1
*212 more observations were dropped, for having no industry name
. duplicates drop
*4,693,171 observations deleted, for being duplicates across all variables
. decode activitysubgroup, gen (subgroupname)
. tostring owner_id_700, gen (owner_id_str) format (%17.0g)
. gen emp_dura=end_date-start_date if !missing(end_date)
. gen ongoing= (emp_dura==.)
*generate dummy variable for saudi non saudi nationality
. gen saudinational=0
. replace saudinational=1 if nationality==1
*salary growth by nationality by employer
. gen saudisalarygrowth= (saudis_salary_2016 - saudis_salary_2009) / saudis_salary_2009
. gen nonsaudisalarygrowth= (nonsaudis_salary_2016 - nonsaudis_salary_2009) / nonsaudis_salary_2009
*salary growth by employee
. gen salarygrowth= (salary_2016 - salary_2009) / salary_2009
*employee growth by nationality by employer
. gen saudigrowth= (saudis_2016 - saudis_2009)/ saudis_2009
. gen nonsaudigrowth= (nonsaudis_2016 - nonsaudis_2009) / nonsaudis_2009
*dentify one time employees, employees that changed employment within the same company and employees that changed employment across companies
. duplicates tag id, generate (dup_employee)
. duplicates tag owner_id_700 id, generate (dup_employee_employer)
. duplicates tag owner_id_700 id occupation, generate (dup_employee_employer_posit)
*delete all observations with employment duration less than a month
. drop if emp_dura < 30
*delete all observations with salary_2016 and salary_2012 less or equal to 1000 SAR
. drop if salary_2016 <=1000
. drop if salary_2012 <=1000
*add current employees by nationaliy
. gen ongoing_saudi= ongoing * saudinational
* specify salary by gender in 2016
. gen salary_fem_2016= gender * salary_2016 if gender==2
. gen salary_mal_2016= gender * salary_2016 if gender==1
. save employee_national, replace
*closing the log
. log close
*end of dofile
I have a large dataset and after some cleaning I tried to save it to the directory. It saves normally but when i try to use it again the error below shows:
. use employee_national
.dta file corrupt
The file unexpectedly ended before it should have.
I tried adding space and a comment as I have seen others do but with no success. Yesterday the code ran perfectly and I was able to use the dataset. I do not know what happened today. I also tried restarting my computer. Note that I am working on Harvard's Research Computing environment.
I do not know if this helps but please find my full code below:
*Equal Opportunity Index
*GOSI analysis
*Employment by Industry- By gender and nationality
*February 19th, 2018
*Chaza Abou Daher
. clear
. cd "/nfs/home/C/cha022/GOSI-Index"
. use "/nfs/home/C/cha022/shared_space/ci3_nali/GOSI_2016.dta"
*opening the log
. log using "Employee.txt", text replace
*remove missing employers and industries
. gen missemployer= (owner_id_700==.)
. tabulate missemployer
. drop if missemployer==1
*95500 observations were dropped, for having no company name
. gen missindustry= (activitysubgroup==.)
. tabulate missindustry
. drop if missindustry==1
*212 more observations were dropped, for having no industry name
. duplicates drop
*4,693,171 observations deleted, for being duplicates across all variables
. decode activitysubgroup, gen (subgroupname)
. tostring owner_id_700, gen (owner_id_str) format (%17.0g)
. gen emp_dura=end_date-start_date if !missing(end_date)
. gen ongoing= (emp_dura==.)
*generate dummy variable for saudi non saudi nationality
. gen saudinational=0
. replace saudinational=1 if nationality==1
*salary growth by nationality by employer
. gen saudisalarygrowth= (saudis_salary_2016 - saudis_salary_2009) / saudis_salary_2009
. gen nonsaudisalarygrowth= (nonsaudis_salary_2016 - nonsaudis_salary_2009) / nonsaudis_salary_2009
*salary growth by employee
. gen salarygrowth= (salary_2016 - salary_2009) / salary_2009
*employee growth by nationality by employer
. gen saudigrowth= (saudis_2016 - saudis_2009)/ saudis_2009
. gen nonsaudigrowth= (nonsaudis_2016 - nonsaudis_2009) / nonsaudis_2009
*dentify one time employees, employees that changed employment within the same company and employees that changed employment across companies
. duplicates tag id, generate (dup_employee)
. duplicates tag owner_id_700 id, generate (dup_employee_employer)
. duplicates tag owner_id_700 id occupation, generate (dup_employee_employer_posit)
*delete all observations with employment duration less than a month
. drop if emp_dura < 30
*delete all observations with salary_2016 and salary_2012 less or equal to 1000 SAR
. drop if salary_2016 <=1000
. drop if salary_2012 <=1000
*add current employees by nationaliy
. gen ongoing_saudi= ongoing * saudinational
* specify salary by gender in 2016
. gen salary_fem_2016= gender * salary_2016 if gender==2
. gen salary_mal_2016= gender * salary_2016 if gender==1
. save employee_national, replace
*closing the log
. log close
*end of dofile
Comment