Loop over a huge number of files

S. IAAIA

Join Date: May 2015
Posts: 12

Loop over a huge number of files

17 May 2019, 09:53

I am trying to loop over a large number of files (360,000 files) to convert them from .csv to .dta files (the idea is then to append the data files later).

When I try to do all the files in one loop, I get an error (too many file names). So I did the following outer loop to convert them in groups of 10000 files.

The problem is after 10000 files are converted and saved, I get the error message "observation numbers out of range"

Here is my code:

Code:

clear all

global maindir "C:/Users/soha_/Desktop/MorningStar/1-Inputs"
set trace on
*------------------------------------------------------------------------------*
// File list
filelist , dir("$maindir/Historical Assets/Assets_csv") pattern(*.csv) norecur

// save a copy so that we can load each observation in the loop below
tempfile files
save "`files'"

local rangeinitial=1
forvalues r = 0(10000)360000 {


local range1 = `rangeinitial' + `r'
local range2 = `range1' + 9999

cd "$maindir/Historical Assets"

forvalues i=`range1'/`range2' {

use "`files'" in `i', clear
local f = dirname + "/" + filename
local filename = filename

insheet using "`f'", clear
local name: subinstr local filename ".csv" ""

save "Assets_dta/`name'.dta", replace
}
}

when I set trace on, I get the following:

Code:

- local range1 = `rangeinitial' + `r'
= local range1 = 1 + 10000
- local range2 = `range1' + 9999
= local range2 = 10001 + 9999
- cd "$maindir/Historical Assets"
= cd "C:/Users/soha_/Desktop/MorningStar/1-Inputs/Historical Assets"
C:\Users\soha_\Desktop\MorningStar\1-Inputs\Historical Assets
- forvalues i=`range1'/`range2' {
= forvalues i=10001/15000 {
- use "`files'" in `i', clear
= use "C:\Users\soha_\AppData\Local\Temp\ST_00000001.tmp" in 10001, clear
observation numbers out of range
  local f = dirname + "/" + filename
  local filename = filename
  insheet using "`f'", clear
  local name: subinstr local filename ".csv" ""
  save "Assets_dta/`name'.dta", replace
  }
  }

Thanks,
Soha

Last edited by S. IAAIA; 17 May 2019, 10:02.

Tags: error, foreach, Large, loop, observations

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

17 May 2019, 10:27

Each of the observations of your file called `files' is supposed to contain the name of a csv file in your directory. You believe (presumably correctly) that there are more than 10000 files in that directory, so the observation numbers in that file should exceed that. The error message appears to indicate that is false. The filelist command (which is a user written command from SSC, as you are supposed to indicate per the FAQ) by default gives a report in the results window of how many files it found. Was that number > 10000? If not, I'd browse the file produced by -filelist- and see what's in it. My version of -filelist- (-which filelist- gives as 2.0.7) indicates that there is no limit to the number of filenames and that that was a change from an earlier version. Perhaps your problem might be solved by simply updating (-adoupdate-) -filelist.

I suspect you *are* getting the "too many file names" report from -filelist-, and the "limited loop" you have can't solve that problem, as -filelist- is outside that loop. You will need to get the updated version of -filelist-.
Comment
S. IAAIA

Join Date: May 2015

Posts: 12
#3

17 May 2019, 10:34

-filelist- only finds 10000 files, which is strange given that my version also says that there is no limit (I am using Stata 14) and the most recent version of filelist.

Code:

. filelist , dir("$maindir/Historical Assets/Assets_csv") pattern(*.csv) norecur Number of files found = 10000
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

17 May 2019, 11:08

You might well be running into a limit in your flavor of Stata and in v. 14 related to the size of a local macro, which is a guess I'm making based on checking the help file for -filelist-.
Is there some systematic structure to the names of the files? If so, it should be possible to pick out batches of them to work on at a time, though not necessarily with -filelist-.

Also, there is at least one other user-written package (the name escapes me) for importing multiple csv files. You might look for that.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

17 May 2019, 11:10

I note that filelist depends on the Mata built-in function dir() to retrieve filenames, and although that function's documentation does not mention any limitation on the number of files that will be returned, an undocumented limit there seems likely to be the source of the current problem. Perhaps Robert Picard, the author of filelist, will be dismayed to learn of this unexpected issue.
1 like
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#6

17 May 2019, 12:26

I first wrote filelist in pure Stata but then a user hit the macro length limit so I rewrote filelist in Mata. The current version of filelist can recursively scan a directory and return an unlimited number of files (it will happily scan a whole hard disk if you ask for it). Unfortunately, Mata's dir() function still has a hard coded limit of 10,000 files returned from a single directory.

This was noticed back in 2015 (with great chagrin) and Hua Peng (StataCorp) acknowledged here that unicode translate is also affected.

I have posted a request for this limit to be lifted here.

In retrospect, I should have revised the help file to mention this limitation. I'll do that ASAP.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

17 May 2019, 12:30

The 10000 file limitation is mentioned several places. One might do something like the following (then appending dta files using similar strategy).

Code:

tempfile  files

log using "`files'" , text replace
dir "*.csv"
log close

infile str100 files using "`files'" , clear
keep if strpos(files,".csv")
compress 
 
putmata files = files , replace

local resdir "D:\dta\"

mata:

    for (i=1; i<= rows(files); i++) {
    
    fn = subinstr(files[i],".csv","",1)
    readcsv = "import delimited using " + files[i] + " , clear"
    savedta = "save " + "`resdir'" + fn + ", replace"
    stata(readcsv)
    stata(savedta)
}
    
end

Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

17 May 2019, 12:39

(crossed in the ether with Bjarte's essentially similar approach.)

Noting Bill's point about the limitations of Mata's dir(), here's a sketch of a hack that should work, given that you're on a Windows system. I don't have a good way to test it, but it should be close.

Code:

clear
// Shell out to the Windows command line and get a list of CSV files as observations in the Stata data editor.
// Stata's -ls-  can't send the file names to any saved location.
tempfile temp
!dir/b $maindir\*.csv > "`temp'"
import delimited using "`temp'", delim("|") // assume no "|" in filenames
//
// Put this list of names into a Mata matrix, quick and dirty.
putmata name = v1
local nfiles = _N
// Process each file whose name is in the Mata matrix
clear
forval i = 1/`nfiles' {
  mata:st_local("fname", name[`i',1])
  di "Working on `fname'"
 insheet using "`fname'", clear
 // do whatever.... with the current CSV file
}

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

18 May 2019, 09:50

An alternative, avoiding 300,000+ imports and savings of data, could be to concatenate the csv files before importing. After import any extra rows of field headers can be deleted. Using the same example data (300,001 csv files with the first record of the auto data set) the code below runs in 60 seconds, compared to the timings of the example at the bottom using the strategy of post #7/#8

Code:

********************************************************************************
tempfile files

shell 2>NULL type *.csv > "`files'"

import delim "`files'" , ///
        clear ///
        varnames(1) ///
        stringcols(1,12) ///
        numericcols(2(1)11)
        
drop if mi(price) /* delete extra rows of field headers */
compress
********************************************************************************

The code above ran in 60 seconds, compared to the timings for the alternative

Code:

   1:      2.05 /        1 =       2.0540  making list of files
   2:   7769.01 /        1 =    7769.0140  import csv, saving dta
   3:    477.37 /        1 =     477.3690  appending dta files

Code:

********************************************************************************
set maxvar 120000

local csvdir "E:\csv\"
local dtadir "E:\dta\"

cd "`csvdir'"

tempfile files
shell dir /b *.csv > "`files'"
mata: files = subinstr(cat("`files'"),".csv","",.)

cd "`dtadir'"

mata: st_numscalar("nfiles", rows(files))

qui forvalues i = 1/`=nfiles' {

    mata: st_local("fn", files[`i'])
    import delim "`csvdir'\`fn'.csv"
    save "`dtadir'\`fn'", replace
    clear
}

mata: st_strscalar("fn", invtokens(files'))
append using `=fn' , gen(indexfilename)
getmata filename = files
scalar drop fn
********************************************************************************

Last edited by Bjarte Aagnes; 18 May 2019, 10:21.

Announcement

Loop over a huge number of files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment