Large data-cleaning and -appending exercise needs more efficiency

Fabio Martinenghi

Join Date: Dec 2018
Posts: 17

Large data-cleaning and -appending exercise needs more efficiency

19 Aug 2019, 04:01

Hello everyone,

I am at my first attempt at importing, cleaning and appending a large amount of datasets. I am working with yearly (2004-2014) balance sheet data for 8000 Italian municipalities. Each Comune is divided in year and each year in "quadri". I need to create one file per comune with all years (and will eventually append them all in a single panel dataset). I have started with a very inefficient method: the only one I am able to use. This implies four big loops (importing, saving yearly files per each municipality, cleaning, appending) and saving lots of datasets. Could you help me making the process more efficient? I will probably need to run the code more than once and my approach is taking forever to work. Also, I tried to work with -tempfile- but had issues using it in combination with locals (in the filenames).

This is how my code is structured:

Code:

clear all
set more off

cd "C:\bilanci_unzip"
qui folders

local com_list `"`r(folders)'"'

*IMPORTING LOOP
set more off
log using "C:/Users/Fabio M/Documents/Research/Data/log_BF_import", text replace
local years 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
local quadro 01 02 03 13
foreach c of local com_list{
foreach y of local years{
foreach q of local quadro{
capture noisily cd "C:/bilanci_unzip/`c'/csv/`c'/`y'/Consuntivo/`q'"
capture noisily fs *
 foreach f in `r(files)'{
    local F : subinstr local f ".csv" ""
    capture noisily import delimited using "C:/bilanci_unzip/`c'/csv/`c'/`y'/Consuntivo/`q'/`f'", delimiter(";") varnames(1) clear

[...]

    capture noisily cd "C:/bilanci_unzip/`c'/csv/`c'/`y'"
    capture noisily save `"`a'_`y'_`q'_`F'"', replace
    }
}
}
}
log close


*CREATE YEARLY DATASETS PER EACH COMUNE
clear
set more off
log using "C:/Users/Fabio M/Documents/Research/Data/log_BF_yearly", text replace

local years 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
foreach c of local com_list{
foreach y of local years{
        clear
        cd "C:/bilanci_unzip/`c'/csv/`c'/`y'"
        fs *
foreach f in `r(files)'{
            append using `"`f'"'
            *this is for having all the right info in labels3
            replace labels3 = labels2 if labels3==""
            }   //close 'files' loop
 [...]
    *save one file for each comune
    cd "C:/bilanci_unzip/`c'/csv/`c'/"
    save "`c'_`y'", replace
        }  //close 'year' loop
}  //close 'comune' loop
log close

*CLEANING LOOP
clear
//local com_list `"abano-terme--2050540010"'
log using "C:/Users/Fabio M/Documents/Research/Data/log_BF_clean", text replace
foreach c of local com_list{
cd "C:/bilanci_unzip/`c'/csv/`c'"
fs *
foreach f in `r(files)'{
    use `"`f'"', clear
[...]
    keep voci dati
[...]
    sxpose, clear
    renvars, map(strtoname(@[1]))
    drop in 1
[...]
    save, replace
}
}    
log close

*APPENDING LOOP
log using "C:/Users/Fabio M/Documents/Research/Data/log_BF_append", text replace
foreach c of local com_list{
clear
cd "C:/bilanci_unzip/`c'/csv/`c'"
fs *
foreach f in `r(files)'{
            append using `"`f'"'
            }
[...]
cd "C:/bilanci_unzip/`c'/csv/"
save `"`c'"', replace            
}
log close

Thank you for your attention,

Fabio

PS:

This is the complete code, for reference:

Code:

********************************************************************************
*************************dati bilanci cleaning**********************************
********************************************************************************
clear all
set more off

cd "C:\bilanci_unzip"
qui folders

local com_list `"`r(folders)'"'

*IMPORTING LOOP
set more off
log using "C:/Users/Fabio M/Documents/Research/Data/log_BF_import", text replace
//local com_list `"abano-terme--2050540010"'
local years 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
local quadro 01 02 03 13
foreach c of local com_list{
foreach y of local years{
foreach q of local quadro{
capture noisily cd "C:/bilanci_unzip/`c'/csv/`c'/`y'/Consuntivo/`q'"
capture noisily fs *
 foreach f in `r(files)'{
    local F : subinstr local f ".csv" ""
    capture noisily import delimited using "C:/bilanci_unzip/`c'/csv/`c'/`y'/Consuntivo/`q'/`f'", delimiter(";") varnames(1) clear
    
    *useful for knowing what data refer to
    capture noisily gen labels = "`F'"
    capture noisily split labels, parse(dicembre -i- -ii- -iii- -iv- -v- -vi- -1- -2- -3- -4- -5- -6- -7- -8- -9- -10- -11- -12- -13- -14- -15- -101- -102- -103-)
    
    *rename variables that change name
    cap rename descrizionevoci voci
    cap rename importooquantitã parametrodiefficaciaefficienzafo
    
    capture noisily gen comune = "`c'`y'`q'"
    capture noisily replace comune = subinstr(comune,"-"," ", 99999)
    capture noisily split comune, parse("  ")
    capture noisily drop comune comune2
    capture noisily rename comune1 comune
    
    *destring conditioning on varibale being of type 'string'
    capture confirm string var dati
        if _rc==0 {
        replace dati="1" if dati=="S"
        replace dati="0" if dati=="N"
        replace dati="." if dati=="N.C."
        replace dati = subinstr(dati,".","", 99999)
        replace dati = subinstr(dati,",",".", 99999)
        destring dati, replace    
        }

    *tostring parametrodiefficaciaefficienzafo if it is numeric [_rc==7]
    cap confirm string var parametrodiefficaciaefficienzafo
        if _rc==7 {
            gen param = strofreal(parametrodiefficaciaefficienzafo, "%10.5g")
            drop parametrodiefficaciaefficienzafo
            rename param parametrodiefficaciaefficienzafo
            }

    *replace "N.C" with missing value "."
    capture confirm string var parametrodiefficaciaefficienzafo
        if _rc==0 {
        replace parametrodiefficaciaefficienzafo="." if parametrodiefficaciaefficienzafo=="N.C."    
        }
        
    //tempfile t_`"`a'_`y'_`q'"'
    capture noisily cd "C:/bilanci_unzip/`c'/csv/`c'/`y'"
    capture noisily save `"`a'_`y'_`q'_`F'"', replace
    //save `"`a'`y'`q'"', replace
    
    //tempfile `c'`y'`q'
    //save "C:/bilanci_unzip/`c'/``c'`y'`q''.dta"
    }
}
}
}
log close

//browse

*CREATE YEARLY DATASETS PER EACH COMUNE
clear
set more off
log using "C:/Users/Fabio M/Documents/Research/Data/log_BF_yearly", text replace
//local com_list `"abano-terme--2050540010"'
local years 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
foreach c of local com_list{
foreach y of local years{
        clear
        cd "C:/bilanci_unzip/`c'/csv/`c'/`y'"
        fs *
foreach f in `r(files)'{
            append using `"`f'"'
            *this is for having all the right info in labels3
            replace labels3 = labels2 if labels3==""
            }   //close 'files' loop
    replace voci = servizioeattivitarilevanti if voci=="" &amp; servizioeattivitarilevanti!=""

    set obs `=_N+1'
    replace voci="year" if voci==""
    replace dati=`y' if voci=="year"

    *save one file for each comune
    cd "C:/bilanci_unzip/`c'/csv/`c'/"
    save "`c'_`y'", replace
        }  //close 'year' loop
}  //close 'comune' loop
log close

*CLEANING LOOP
clear
//local com_list `"abano-terme--2050540010"'
log using "C:/Users/Fabio M/Documents/Research/Data/log_BF_clean", text replace
foreach c of local com_list{
cd "C:/bilanci_unzip/`c'/csv/`c'"
fs *
foreach f in `r(files)'{
    use `"`f'"', clear
    *destring
    local numeric accertamenti impegni parametrodiefficaciaefficienzafo
    foreach n of local numeric{
        capture confirm string var `n'
        if _rc==0 {    
        replace `n' = subinstr(`n',".","", 99999)
        replace `n' = subinstr(`n',",",".", 99999)
        destring `n', replace
        }
        }
    local useful accertamenti impegni parametrodiefficaciaefficienzafo
    foreach u of local useful{
        replace dati = `u' if dati ==. &amp; `u'!=.
        }
    replace dati = parametrodiefficaciaefficienzafo if dati==. &amp; parametrodiefficaciaefficienzafo!=.
    *keep only useful variables
    //keep comune voci funzionieserviziinterventicorren dati accertamenti impegni impegni_accert
    drop riscoss* pagam* servizioe*
    //drop if impegni_accert==.
    rename parametrodiefficaciaefficienzafo efficaciaefficienza
    
    *drop useless data
    gen useless = strpos("servizi-indispensabili-per-comuni-ed-unioni-di-comuni",labels2) &amp; (labels3!="acquedotto" &amp; labels3!="fognatura-e-depurazione" &amp; labels3!="nettezza-urbana" &amp; labels3!="viabilita-ed-illuminazione-pubblica")
    drop if useless==1 &amp; voci!="year"
    drop if voci=="Forma di gestione (codice)" | voci=="UnitÃ  immobiliari servite" | voci=="Totale unitÃ  immobiliari" | voci=="N.unitÃ  immobiliari servite" ///
    | voci=="Terreni Agricoli" | voci=="Aree Fabbricabili" | voci=="N. personale docente" | voci=="Costo totale diretto"  | voci=="N. personale non docente"  ///
    | voci=="N. personale amministrativo"  | voci=="N.studenti iscritti al 30.09" | voci=="N. aule disponibili al 30.09" | voci=="Forma di gestione" ///
    | voci=="Contributo compensativo minor gettito imu derivante da modifiche normative (art. 1 e 3 D.L. 102/13)"


    *rename voci which are identical
    replace voci="costo acquedotto" if voci=="Costo totale diretto" &amp; labels3=="acquedotto"
    replace voci="costo fogna" if voci=="Costo totale diretto" &amp; labels3=="fognatura-e-depurazione"
    replace voci="costo nettezza" if voci=="Costo totale diretto" &amp; labels3=="nettezza-urbana"
    replace voci="costo viabilita" if voci=="Costo totale diretto" &amp; labels3=="viabilita-ed-illuminazione-pubblica"

    *drop all useless duplicate voci
    drop if labels=="quadro-1-si-certifica" //to avoid 'year' to be  cancelled as duplicate in years 2013, 2014
    duplicates tag voci, gen(dup)
    drop if dup!=0
    drop dup
    
    keep voci dati
    gen lower = lower(voci)
    drop voci
    rename lower voci
    
    drop if voci=="di cui : abitazione principale" | voci=="altri fabbricati"

    *make everything string, transpose and destring everything again
    //tostring dati, gen(dat) format("%9.3f")
    //drop dat
    gen dat = strofreal(dati, "%10.5g")
    drop dati
    sxpose, clear
    renvars, map(strtoname(@[1]))
    drop in 1

    *rename all variables with lowercase
    *rename *, lower
    
    save, replace
}
}    
log close

*APPENDING LOOP
//local com_list `"abano-terme--2050540010"'
log using "C:/Users/Fabio M/Documents/Research/Data/log_BF_append", text replace
foreach c of local com_list{
clear
cd "C:/bilanci_unzip/`c'/csv/`c'"
fs *
foreach f in `r(files)'{
            append using `"`f'"'
            }
            
*comune variable
gen comune = "`c'"
replace comune = subinstr(comune,"-"," ", 99999)
split comune, parse("  ")
drop comune
rename comune2 cod_elettor
rename comune1 comune    

*tidy var's order    
order comune, first    
order year, after(comune)
order cod_elettor, after(comune)

*destring all
destring *, replace

*fix cod_elettor's format
format %12.0g cod_elettor

cd "C:/bilanci_unzip/`c'/csv/"
save `"`c'"', replace            
}
log close

Last edited by Fabio Martinenghi; 19 Aug 2019, 04:17.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

19 Aug 2019, 08:26

If there is something in each file that identifies its commune/year/quadro label (it appears not), perhaps you could append the files with some operating system command prior to importing them with Stata, which would be quicker (how much??) than -importing-. By my understanding, you have 8000 communes *11 years * 4 quadro =352e3, which is going to be slow almost regardless of method.

I'd recommend you first try using several -timer- commands to get a sense of where the slow parts of your process are, and concentrate your (and our) efforts on them. My a priori guess is that the -import- is the key process here as regards speed, but that's a guess. To do this, you'd need to temporarily alter your code to operate on just a few files, and then loop over that a number of times to get reasonable timings.

A related comment is that you're probably acting against your interests here by presenting a big chunk of code for us to digest. If we could focus on a stripped down version, with the unproblematic commands or processes set aside, you'd be more likely to get help. (I do understand reasons that make that difficult, but the advice still stands.)

One immediate comment:
It does seem that you have certain commands that could be run once on the whole file rather than on pieces (i.e., things that don't involve your loop variables.)

Finally, regarding temporary files: Using them wouldn't speed things up, but it would make things less messy by not leaving intermediate files around to be deleted later.
1 like
Comment
Fabio Martinenghi

Join Date: Dec 2018

Posts: 17
#3

20 Aug 2019, 02:58

Thank you Mike, I will try to follow all of your advice. Ironically, time constraints are forcing me to follow my current raw approach. The import part has been executed successfully, now it is just about working with all the outcome files. Thank you again.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

20 Aug 2019, 10:21

Perhaps too late to do you any good, but a quick thing to try is to manually parallelize the process: Start separate instances of Stata according to the number of processors you have, and do the import for some of the files in each one. I would not think that would help with the append, which I'd presume to be purely I/O bound, but it *might* have helped with the import. (Of course, this might already be fully parallelized in Stata and so not do any good, but it's easy to try.)
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

20 Aug 2019, 15:06

You can do this more efficiently (simpler/faster code) by using filelist (from SSC) to recursively scan the directory and create a list of all targeted files and then use runby (from SSC) to import and append every file in the list. You have to create a little program (called do1file in the following example) that contains the commands needed to process a single file and then call runby to run this program for every file in the list. With runby, what's left in memory when the user-specified program terminates is considered results and accumulates (therefore no append loop needed). Here's a quick example that mimics some of the data structure you describe:

Code:

version 15
clear all

cd "/Users/robert/Documents/temp/bilanci_unzip"

filelist , pattern("*.csv")
list, sepby(dirname)

program do1file
    local fdir  = dirname[1]
    local fname = filename[1]
    local fpath = "`fdir'/`fname'"
    
    import delimited using "`fpath'", delimiter(";") varnames(1) stringcols(_all) clear
    
    gen long obs = _n
    gen labels = "`fpath'"
end

runby do1file, by(dirname filename)

list, sepby(labels)

Here's the list results just after the filelist command:

Code:

. list, sepby(dirname)

     +----------------------------------------------------------------------+
     | dirname                                            filename    fsize |
     |----------------------------------------------------------------------|
  1. | ./abano-terme--2050540010/csv/2004/Consuntivo/01   file1.csv      41 |
  2. | ./abano-terme--2050540010/csv/2004/Consuntivo/01   file2.csv      31 |
     |----------------------------------------------------------------------|
  3. | ./abano-terme--2050540010/csv/2004/Consuntivo/02   file1.csv      41 |
  4. | ./abano-terme--2050540010/csv/2004/Consuntivo/02   file2.csv      59 |
     |----------------------------------------------------------------------|
  5. | ./abano-terme--2050540010/csv/2005/Consuntivo/01   file1.csv      44 |
  6. | ./abano-terme--2050540010/csv/2005/Consuntivo/01   file2.csv      45 |
  7. | ./abano-terme--2050540010/csv/2005/Consuntivo/01   file3.csv      60 |
     |----------------------------------------------------------------------|
  8. | ./abano-terme--2050540010/csv/2005/Consuntivo/02   file1.csv      45 |
  9. | ./abano-terme--2050540010/csv/2005/Consuntivo/02   file2.csv      45 |
     +----------------------------------------------------------------------+

.

and here is the final list that shows the appended content from all files in the list:

Code:

. list, sepby(labels)

     +-----------------------------------------------------------------------------------------+
     | vname1   val1   val2   obs                                                       labels |
     |-----------------------------------------------------------------------------------------|
  1. |  test1      1      2     1   ./abano-terme--2050540010/csv/2004/Consuntivo/01/file1.csv |
  2. |  test2      3      4     2   ./abano-terme--2050540010/csv/2004/Consuntivo/01/file1.csv |
     |-----------------------------------------------------------------------------------------|
  3. |  test1     17     18     1   ./abano-terme--2050540010/csv/2004/Consuntivo/01/file2.csv |
     |-----------------------------------------------------------------------------------------|
  4. |  test1      5      6     1   ./abano-terme--2050540010/csv/2004/Consuntivo/02/file1.csv |
  5. |  test2      7      8     2   ./abano-terme--2050540010/csv/2004/Consuntivo/02/file1.csv |
     |-----------------------------------------------------------------------------------------|
  6. |  test1     19     20     1   ./abano-terme--2050540010/csv/2004/Consuntivo/02/file2.csv |
  7. |  test2     21     22     2   ./abano-terme--2050540010/csv/2004/Consuntivo/02/file2.csv |
  8. |  test3     23     24     3   ./abano-terme--2050540010/csv/2004/Consuntivo/02/file2.csv |
     |-----------------------------------------------------------------------------------------|
  9. |  test1      9     10     1   ./abano-terme--2050540010/csv/2005/Consuntivo/01/file1.csv |
 10. |  test2     11     12     2   ./abano-terme--2050540010/csv/2005/Consuntivo/01/file1.csv |
     |-----------------------------------------------------------------------------------------|
 11. |  test1     25     26     1   ./abano-terme--2050540010/csv/2005/Consuntivo/01/file2.csv |
 12. |  test2     27     28     2   ./abano-terme--2050540010/csv/2005/Consuntivo/01/file2.csv |
     |-----------------------------------------------------------------------------------------|
 13. |  test1     55     56     1   ./abano-terme--2050540010/csv/2005/Consuntivo/01/file3.csv |
 14. |  test2     57     58     2   ./abano-terme--2050540010/csv/2005/Consuntivo/01/file3.csv |
 15. |  test3     59     60     3   ./abano-terme--2050540010/csv/2005/Consuntivo/01/file3.csv |
     |-----------------------------------------------------------------------------------------|
 16. |  test1     13     15     1   ./abano-terme--2050540010/csv/2005/Consuntivo/02/file1.csv |
 17. |  test2     14     16     2   ./abano-terme--2050540010/csv/2005/Consuntivo/02/file1.csv |
     |-----------------------------------------------------------------------------------------|
 18. |  test1     29     30     1   ./abano-terme--2050540010/csv/2005/Consuntivo/02/file2.csv |
 19. |  test2     31     32     2   ./abano-terme--2050540010/csv/2005/Consuntivo/02/file2.csv |
     +-----------------------------------------------------------------------------------------+

.

All this assumes that the set of files is amenable to be appended. I strongly recommend doing as little as possible in the do1file program and do additional data cleaning on the final combined dataset. To avoid append errors if the variable type changes (string to numeric or vice versa), you should always import all variables as strings and destring once all the data is combined.

You should start with a small number of files to debug your program (for example just keep the first 10 files after the filelist call) and use the verbose option for the runby call:

Code:

runby do1file, by(dirname filename) verbose

When everything works as expected, you can remove the verbose option and add the status option. This will give you pretty good feedback on the time remaining on the job:

Code:

runby do1file, by(dirname filename) status

Comment

Fabio Martinenghi

Join Date: Dec 2018

Posts: 17
#6

22 Aug 2019, 02:55

Thank you Robert, this is approach is much superior to mine. I will try to follow it. As a side note, I am forced to manipulate the datasets before appending in order to append them effectively.
Comment

Announcement

Large data-cleaning and -appending exercise needs more efficiency

Comment

Comment

Comment

Comment

Comment