Looping over files in a folder

Christos Makridis

Join Date: Nov 2014

Posts: 157
#1

Looping over files in a folder

24 Jul 2015, 23:57

Hi all,

I'm trying to loop through files in the Quarterly Census of Employment and Wages -- all the files for a year are downloaded into one of my folders (all metro areas in the US). Unfortunately, Stata tells me when I try to loop that there are too many files in the loop!

My code is below -- anyone know what could be going wrong, or alternatives?

local files : dir "$data\qcew\2014" files "*.csv"
foreach file in `files' {
quietly insheet using `file', clear
quietly keep area_fips industry_code agglvl_code year qtr area_title industry_title agglvl_title ///
qtrly_estabs_count month* total_qtrly_wages oty_qtrly_estabs_pct_chg ///
oty_month1_emplvl_pct_chg oty_month2_emplvl_pct_chg oty_month3_emplvl_pct_chg
quietly keep if agglvl_code==74 // keep at naics sector level
quietly bysort area_fips qtr year: egen demp=rowmean(oty_month1_emplvl_pct_chg oty_month2_emplvl_pct_chg oty_month3_emplvl_pct_chg)
quietly save `file'.dta, replace
}
Tags: None
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#2

25 Jul 2015, 03:56

Please run the following commands and then show exactly what Stata did in response (see section 12 in the FAQ).

Code:

local files : dir "$data\qcew\2014" files "*.csv" foreach file in `files' { dir `file' }
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

25 Jul 2015, 08:50

I downloaded the Quarterly CSVs for 2014 from this page and unzipped the archive. The archive is very big (4427 cvs files, 4.23GB) and the file names are quite long. Depending on your version of Stata, you may hit the macro length limit. I suggest that you use filelist (from SSC) for this type of task. Here's an example that converts all the files in the zip archive into Stata datasets.

Code:

* -filelist- if from SSC
filelist , dir(2014.q1-q4.by_area) pattern(*.csv) norecur

* extract the identifier code
gen shortname = regexs(1) if regexm(filename,"q1-q4 ([^ ]+) ")
isid shortname

* save a copy so that we can load each observation in the loop below
tempfile files
save "`files'"

local obs = _N

* create a directory to put the Stata dta if it does not already exist
cap mkdir QCEW_stata

* loop over each file and do whatever's needed
forvalues i=1/`obs' {

    use "`files'" in `i', clear
    local source = shortname
    local f = dirname + "/" + filename
    
    insheet using "`f'", clear
    gen source = "`source'"
    
    save "QCEW_stata/`source'.dta", replace
    
}

This takes about 4 minutes to process the whole 4,426 files. If you want to append the files, follow the example in filelist's help file.

Comment

Christos Makridis

Join Date: Nov 2014

Posts: 157
#4

25 Jul 2015, 10:20

Thank you both for the reply, and fantastic advice Robert! That's just about perfect, but I think I might have misunderstood something -- was there a line here that needs to be amended to my context? When I run it, it just goes through the loop file as if it finished, but nothing actually happens; in other words, it gets up to the "create directory" and goes through the loop without doing anything. I will keep looking at it -- I must have missed something, as this looks spot-on.

Friedrich: just FYI the same error comes up when I only ask it to list the directory; "too many filenames".
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

25 Jul 2015, 10:45

The code should run as is but the following must be true:
you execute the code from a do-file (e.g. do "qcwe.do")

the do-file is located in Stata's current directory (help cd)

the zip archive was uncompressed and saved in a directory called "2014.q1-q4.by_area", located in the current directory
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#6

25 Jul 2015, 11:28

I should also note that I do not like long file names with spaces like "2014.q1-q4 01001 Autauga County, Alabama.csv" so I included code to extract the identifier based on the patterns found in the list of files. If the list of files that do not follow the same pattern, then the following line should be adjusted

Code:

gen shortname = regexs(1) if regexm(filename,"q1-q4 ([^ ]+) ")

I doubt this is your problem since the

Code:

isid shortname

command would throw an error if there are missing or not unique short identifier.

I suggest that you run the first line

Code:

filelist , dir(2014.q1-q4.by_area) pattern(*.csv) norecur

by itself and browse the data in memory afterwards to verify that it contains the list of files in the "2014.q1-q4.by_area" directory.

You can also try to process just a few files by changing the number of observations line to

Code:

local obs = 5
Comment
Christos Makridis

Join Date: Nov 2014

Posts: 157
#7

25 Jul 2015, 11:29

Sorry for being slow on this, and part of it is because I don't understand all of the steps in the process, but there must be something else that I'm not doing right.

1) I'm working from the current directory: cd "$data\qcew\2014.q1-q4.by_area". (The directory was previously just "$data\qcew\2014", but I changed it to what you had in case that was the culprit behind the loop not working.)

2) The .do file is now also in the current directory "$data\qcew\2014.q1-q4.by_area". (Before it was in a separate folder where I keep .do files for analysis involved on a particular paper; I copied it to the current directory in case that was the culprit.)

3) All the QCEW files are located in the current directory "$data\qcew\2014.q1-q4.by_area".

This seems to all line up with what you said?

Thank you for your patience!
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#8

25 Jul 2015, 11:33

No, if the do-file is in the same directory as the extracted cvs files, you need to scan for files in the current directory. This would be

Code:

filelist , dir(.) pattern(*.csv) norecur

Just do that interactively and check that the data in memory contains the list of files.
Comment
Christos Makridis

Join Date: Nov 2014

Posts: 157
#9

25 Jul 2015, 12:33

Thanks again, it turned out that the current directory was mis-specified -- precisely as you mentioned in the most recent post above.

Now just trying to figure out how to append the files. It's a very valuable command you created!
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#10

25 Jul 2015, 16:02

If you are going to append all these files, I suggest that you reprocess the files in Stata using import delimited instead of using insheet. The problem is that at least one variable (area_fips) will be imported to a numeric variable by insheet for some files because all observations are numeric. In other files, the area_fips variable will contain letters so the variable will be imported as a string. When you try to append these files, you will not be able to resolve the issue easily. The following example shows the basic process and should work if your current directory is correctly aligned with your files. Basically, you import everything in string variables and only when the data has been combined do you destring. Before running the whole thing (import delimited is slower than insheet), try it on a small subset of files by changing the observation line to local obs = 10

Code:

* -filelist- if from SSC * make sure that the current directory is the one that * contains the "2014.q1-q4.by_area" directory. filelist , dir(2014.q1-q4.by_area) pattern(*.csv) norecur * extract the identifier code gen shortname = regexs(1) if regexm(filename,"q1-q4 ([^ ]+) ") isid shortname * save a copy so that we can load each observation in the loop below tempfile files save "`files'" * to try this out on the first 10 files, change to local obs = 10 local obs = _N * create a directory to put the Stata dta if it does not already exist cap mkdir QCEW_stata * loop over each file and input each csv file into temporary datasets. forvalues i=1/`obs' { use "`files'" in `i', clear local source = shortname local f = dirname + "/" + filename clear import delimited using "`f'", varnames(1) stringcols(_all) gen source = "`source'" tempfile save`i' qui save "`save`i''" } clear forvalues i=1/`obs' { append using "`save`i''" } * Now that all the data is combined, time to destring numeric vars destring, replace

Note that when processing files like this, it is important not to rely on old Statalist posts (or other resources on the internets) that advocate solutions like

Code:

cap erase mybigfile.dta local files : dir . files "*.csv" foreach f of local files { insheet using "`f'", clear append using mybigfile save mybigfile, replace }

The extra cycle of saving "mybigfile" each time a new file is input will take forever because the size increases at each save. By my calculation, this particular exercise would require writing over 10 terabytes of data by the time the last 4GB mybigfile is saved.
Comment
Christos Makridis

Join Date: Nov 2014

Posts: 157
#11

25 Jul 2015, 16:55

That's great. While I got it working earlier, your solution looks more efficient and I'll try it out. That'll be important for using additional years of the QCEW! Thank you again for taking the time to help me figure it out!
Comment

Announcement