handling 10.000 files named with 11 digit numbers

Robert Picard

Join Date: Mar 2014

Posts: 1536
#16

18 Nov 2015, 10:39

Try

Code:

filelist, pat("*.TXT")
Comment
Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#17

18 Nov 2015, 10:48

ow, Robert, thank you so much! It seems that there is a limit of 10.000 files though, because I do have 10.903 txt files in this directory.

Code:

. filelist, pat("*.TXT") Number of files found = 10000
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#18

18 Nov 2015, 10:49

If you have more than 10,000 files that match the pattern in the same directory, filelist will only return 10,000. This is apparently a hard coded limit (not documented anywhere) of the Mata function dir(). See this post that mentions the limit. A workaround is to do it in parts using a more specific pattern, e.g.

Code:

filelist, pattern("1*.TXT") filelist, pattern("2*.TXT")
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#19

18 Nov 2015, 11:20

Robert, thank you for filelist. Stata commands like use and dir are not case sensitive. Would it be possible to add an option to filelist that allows users to find files regardless of whether the filenames are uppercase or lowercase?
Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

#20

18 Nov 2015, 11:41

Hello all. Once there is this 10.000 limit files I had to do as below (all my files start with either 1, 2, 3, 4 or 5):

Code:

foreach k in 1 2 3 4 5 {

filelist, pat("`k'*.TXT") save("name_files_`k'.dta") replace
use "name_files_`k'.dta", clear
local obs = _N
forvalues i=1/`obs' {
    use "name_files_`k'.dta" in `i', clear
    local f = filename
    clear
    infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
    tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using "`f'"
    gen source = "`f'"
    save "mydata`k'_`i'.dta", replace
}
}

But now the append section code must be changed because the number of obs restart in every `k' concerned in the first foreach. How could I replace this `obs' in order to get all the 10.903 files which are not equally divided by the 5 k's in the foreach? Below is the old cold:

Code:

clear
forvalues i=1/`obs' {
    append using "mydata`i'.dta"
}
save "mydatacombo.dta", replace

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#21

18 Nov 2015, 12:06

I would think that the following would work:

Code:

clear
save "my_txt_files.dta", emptyok replace
foreach k in 1 2 3 4 5 {
    filelist, pat("`k'*.TXT")
    append using "my_txt_files.dta"
    save "my_txt_files.dta", replace
}

use "my_txt_files.dta", clear
local obs = _N
forvalues i=1/`obs' {
    use "my_txt_files.dta" in `i', clear
    local f = filename
    clear
    infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 ///
        str lati 322-336 str longe 337-351 ///
        tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 ///
        using "`f'"
    gen source = "`f'"
    save "mydata_`i'.dta", replace
}

clear
forvalues i=1/`obs' {
    append using "mydata_`i'.dta"
}
save "mydatacombo.dta", replace

Comment

Friedrich Huebler

Join Date: Apr 2014
Posts: 1053

#22

18 Nov 2015, 12:11

Add k to the obs counter.

Code:

foreach k in 1 2 3 4 5 {
  filelist, pat("`k'*.TXT") save("name_files_`k'.dta") replace
  use "name_files_`k'.dta", clear
  local obs`k' = _N
  forvalues i=1/`obs`k'' {
    use "name_files_`k'.dta" in `i', clear
    local f = filename
    clear
    infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
    tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using "`f'"
    gen source = "`f'"
    save "mydata`k'_`i'.dta", replace
  }
}

Then add k to the append loop.

Code:

clear
forval k in 1 2 3 4 5 {
  forvalues i=1/`obs`k'' {
    append using "mydata`k'_`i'.dta"
  }
}
save "mydatacombo.dta", replace

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#23

18 Nov 2015, 12:21

Friedrich Huebler,

The pattern() option in filelist is the equivalent of using the strmatch() function and of course text matching is case sensitive. The great thing about filelist is that you can leverage all the power of Stata's data management features since it creates a dataset of files. Just use it without any pattern and do the matching yourself. Something like:

Code:

filelist keep if regexm(lower(filename),"\.txt$")

The only exception is the limit encountered above when there is more than 10,000 files in a single directory. I show in #21 one way to circumvent the limitation.
Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

#24

18 Nov 2015, 13:40

Thany you again!!

Finally I ran this shorter code below and it worked!

Code:

foreach k in 1 2 3 4 5 {

clear
tempfile temp
save `temp', emptyok

filelist, pat("`k'*.TXT") save("name_files_`k'.dta") replace
use "name_files_`k'.dta", clear
local obs = _N
forvalues i=1/`obs' {
    use "name_files_`k'.dta" in `i', clear
    local f = filename
    clear
    infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///
    tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using "`f'"
    gen source = "`f'"
    append using `temp'
    display "`k' & `i'"
    save `"`temp'"', replace
}
use `temp', clear
save "mydata_`k'.dta", replace
}

Comment

wbuchanan

Join Date: Mar 2014
Posts: 1362

#25

18 Nov 2015, 14:18

Paula de Souza Leao Spinola you don't need to have the files listed anywhere a priori (http://www.ats.ucla.edu/stat/stata/f...many_files.htm). This assumes all the files have the exact same specifications.

Code:

// Move to the directory where the files are located
cd C:\Users\Paula\Pesquisa\DADOS\dados_georreferenciados\CEPs_Censo_2010\dados_bruto

// List the files in the directory and pipe them into a text file
! DIR *.txt /a-d /b > C:\Users\Paula\Pesquisa\DADOS\dados_georreferenciados\CEPs_Censo_2010\dados_bruto\files.txt

// Open a connection to the file
file open flist using C:\Users\Paula\Pesquisa\DADOS\dados_georreferenciados\CEPs_Censo_2010\dados_bruto\files.txt, r

// Read the first line of the file
file read flist fname

// Load the data and clear any data currently in memory
infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///  
tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using `"`fname'"', clear

// Reserve namespace for a temp file
tempfile mytempfile

// Save the data temporarily
qui: save `mytempfile'.dta, replace

// Loop over the file
while `r(eof)' != 1 {

    // Read the next line
    file read flist fname

    // Load the data and clear any data currently in memory
    infix uf 1-2 mun 3-7 dist 8-9 subdist 10-11 set 12-15 st_set 16-16 str lati 322-336 str longe 337-351 ///  
    tp 472-473 str subtp 474-513 quadra 545-547 face 548-550 cep 551-558 using `"`fname'"', clear

    // Append the first file
    append using `mytempfile'.dta

    // Save over the temp file
    qui: save `mytempfile'.dta, replace

} // End WHILE Block

// Close the open file connection
file close flist

// save the data permanently
save finalfile.dta, replace

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#26

18 Nov 2015, 14:51

wbuchanan

The UCLA FAQ is a terrible example of how to combine a large number of files. Aside from the fact that direct file I/O (using the file command) is way above the skill set of many Stata users and that it requires shelling out, it continues to sell the input/append/save method which is terribly inefficient. Consider the following scenario where you are trying to append 10,000 files, each at 500K:

Code:

. clear . set obs 10000 number of observations (_N) was 0, now 10,000 . gen fsize = 500000 . . * the size of the append dataset each time it is saved . gen double appendsize = sum(fsize) . . * the total number of bytes written . gen double totalbytes = sum(appendsize) . . dis %21.0fc totalbytes[_N] 25,002,500,000,000

That's 25 terabytes of I/O to create an appended dataset of 5GB!
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#27

18 Nov 2015, 15:17

Because of the back and forth on the 10,000 file limit and the confusion about case sensitive matching, it may appear that the task of appending a lot of files is more complicated than it is. Here's a summary of how I would handle the task:

First, make a list of files to append. This is easy to do with filelist (from SSC).

Code:

filelist, pat("*.TXT") local obs = _N save "myfiles.dta", replace

Next, use a loop to input each text file into Stata datasets. It may appear clumsy to reload the dataset of files ("myfiles.dta") at each pass but the data in memory needs to be cleared anyways to input the next text file so this adds very little overhead:

Code:

forvalues i=1/`obs' { use "myfiles.dta" in `i', clear local f = filename insheet using "`f'", clear gen source = "`f'" save "mydata_`i'.dta", replace }

Finally, use another loop to append the Stata datasets

Code:

clear forvalues i=1/`obs' { append using "mydata_`i'.dta" }
Comment
Friedrich Huebler

Join Date: Apr 2014

Posts: 1053
#28

18 Nov 2015, 15:48

How does the code in post #27 work with the 10,903 that Paula has? The command

Code:

filelist, pat("*.TXT")

reads at most 10,000 files.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#29

18 Nov 2015, 16:06

It doesn't, I was just summarizing the general steps of appending lots of files. Another approach to deal with the 10,000 file limit per directory is to split them into sub-directories with less than 10,000 files. Since filelist will search for files recursively throughout sub-directories of the current directory, all that is needed is something like:

Code:

* make a list of files, include all sub-directories filelist, pat("*.TXT") local obs = _N save "myfiles.dta", replace * input each file and save Stata datasets forvalues i=1/`obs' { use "myfiles.dta" in `i', clear local f = dirname + "/" + filename insheet using "`f'", clear gen source = "`f'" save "mydata_`i'.dta", replace } * Append! clear forvalues i=1/`obs' { append using "mydata_`i'.dta" } save "mydatacombo.dta", replace

Again, it's best to use Stata's data management features to first build the list of files to input (via creating a dataset of files to target).
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment