does Stata store the filename in memory?

Robert Picard

Join Date: Mar 2014
Posts: 1536

#16

13 Dec 2017, 19:36

That approach would not work because once you keep observations with form1, you will not find any with form3 as they have already been dropped. Your error was that the pattern is missing a right single quote when you refer to j:

Code:

keep if strmatch(filename, "*`j'*.txt")

If you are going to prune files upfront, you might as well break-up the filename into the parts you wanted from the start. That way you can make sure that all filenames you want to process match your expectations. You could do something like:

Code:

clear all
filelist, dir("text_files")
 
* reduce to files with a ".txt" file extension
keep if strmatch(filename, "*.txt")

* split the file name into parts
gen s = subinstr(filename,".txt", "", 1)
split s, parse("_")
rename (`r(varlist)') (id date form)
assert !mi(id, date, form)

* reduce to form1 and form3
keep if inlist(form, "form1", "form3")

Note that there is a limit of 10 (I think) match strings when using inlist() with strings. If you have more, you can make a separate dataset with the list to use and use merge to reduce the observations to those that match the list.

Here's an expanded version of the program that handles the extra part variables:

Code:

* code to import one text file
program import_txt
  // move values of interest from variables to locals
  local dsource = dirname
  local fsource = filename
  local id1 = id
  local date1 = date
  local form1 = form
  
  import delimited using `"`dsource'/`fsource'"', clear stringcols(_all) varnames(nonames)
  
  // get the desired info
  keep if strpos(v1,"name:")
  gen name = subinstr(v1,"name:","",1)

  // copy over the file's information
  gen sourcefile = `"`fsource'"'
  gen sourcedir  = `"`dsource'"'
  gen id = "`id1'"
  gen date = "`date1'"
  gen form = "`form1'"
end

runby import_txt, by(dirname filename) verbose

Comment

Kyle Smith

Join Date: Mar 2017

Posts: 93
#17

13 Dec 2017, 20:49

I got the above code to work on specific file types, but that does not get around the 10,000 file limit. For example if I

Code:

clear all filelist, dir("text_files") keep if strmatch(filename, "*form1*.txt")

And there are 100 "form1" files and 20,000 total files in the folder, it only collects ~ half of the data in the files with type "form1" (it will not consider the last 10,000 files).

I have 100,000 files per year and 20 years. This means I need to break the data into 200 different subdirectories.

I tried your method in post #29 here, but got the error "no observations r(2000)" error going through the first loop. But it seems the missing values (or whatever is giving the "no observations" errors) causes problems within loops. So it would be best to avoid loops and instead use the runby command (which works wonderfully). The only problem is that filelist can only handle 10,000 files at a time.

Is there a way to have filelist consider more than 10,000 files in the second line of the above code? I don't need to keep more than 10,000 files in the 3rd line, but need to consider more than 10,000 in the 2nd line.

Thanks again.
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 93
#18

13 Dec 2017, 20:51

I did not see your post #16 until after I posted my #17. Apologies. It wasn't there when I started writing mine.
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 93
#19

13 Dec 2017, 21:03

Robert,

Thanks for all the time you spent on replying to my posts. Also, thanks for authoring 'filelist' and 'runby'.

Your new code in #16 takes case of the form1 and form3 issue, but that was in an effort to avoid having to break up the 2M files into directories of less than 10,000. The problem is in the first line of code:

Code:

filelist, dir("text_files")

I can't use loops because they stop when they find "no observations" and I can't use filelist because I have 100,000 files in each of 20 folders. Any ideas? Can you force filelist to consider more than 10,000 files?

Thanks again,
Kyle
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 93
#20

14 Dec 2017, 06:06

Alternatively, could I delete all files not of type=form1 | type=form3 using the erase command? The files are backed up in zip format elsewhere on my HD. When I need access to other types I can just erase all existing files and unzip again and start over.

Something like:

Code:

cd "text_files" local list : dir . files "*form1*.txt" "*form3*.txt" foreach f of local list { erase "`f'" }

Once that is done, then carry out the above code (post #16). Would that work, assuming that after the erase command ran there were less than 10,000 files of form1 and form3 remaining?
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#21

14 Dec 2017, 08:07

To clarify, filelist has no limit on the number of files it can handle and will happily scan your whole hard disk. The issue here is that the Mata function dir() has a hard-coded limit of 10,000 files it can return. See this post from 2015 that mentions the limit. What this means is that, in any given directory (ignoring its subdirectories), filelist will only return the first 10,000 files.

You can still use the filelist approach as long as you can identify patterns that will reduce the number of files returned in a given directory to below 10,000. The following example will collect all form1 and form3 separately:

Code:

clear all * Example generated by -dataex-. To install: ssc install dataex clear input str5 form "form1" "form3" end program get_in_parts local f = form filelist, dir("mega_files") pattern("*`f'*") end runby get_in_parts, by(form)

Let me know if this works for you. If not, you can still get there via filelist by removing/renaming/copying files once they have been captured. It shouldn't be too hard to put together code for that but there's no point in trying if you can manage with the above technique.
Comment

Kyle Smith

Join Date: Mar 2017
Posts: 93

#22

14 Dec 2017, 08:25

Robert,

You da man! That works.

So now the entire thing would look like (I think the below steps are in the correct order):

Code:

clear
input str5 form
"form1"
"form3"
end

program get_in_parts
  local f = form
  filelist, dir("mega_files") pattern("*`f'*")
end
runby get_in_parts, by(form)

program import_txt
  // move values of interest from variables to locals
  local dsource = dirname
  local fsource = filename
  local id1 = id
  local date1 = date
  local form1 = form
  
  import delimited using `"`dsource'/`fsource'"', clear stringcols(_all) varnames(nonames)
  
  // get the desired info
  keep if strpos(v1,"name:")
  gen name = subinstr(v1,"name:","",1)

  // copy over the file's information
  gen sourcefile = `"`fsource'"'
  gen sourcedir  = `"`dsource'"'
  gen id = "`id1'"
  gen date = "`date1'"
  gen form = "`form1'"
end

runby import_txt, by(dirname filename) verbose

save import_txt, replace

That's awesome. Very grateful for the help.

Comment

Kyle Smith

Join Date: Mar 2017

Posts: 93
#23

28 Apr 2018, 21:22

I am trying to make the above code work over many sub-directories, some of which have more than 10K files in of them. To do this, I am trying to incorporate Robert's code here: https://www.statalist.org/forums/for...57#post1317257

The tricky part for me is:

1) I don't know how saving files works within runby.
2) I don't know how to pass the runby output file to the follow-on import_txt program.
3) I don't know how to build or store/save the runby output file.

I think I only need to manipulate program get_in_parts (lines 7-11).

I think the code should be something like:

Code:

program get_in_parts local g = form filelist, dir("mega_files") pattern("*`g'*") local obs = _N save "myfiles.dta", replace forvalues i=1/`obs' { use "myfiles.dta" in `i', clear local f = dirname + "/" + filename insheet using "`f'", clear gen source = "`f'" save "mydata_`i'.dta", replace } clear forvalues i=1/`obs' { append using "mydata_`i'.dta" } save "mydatacombo.dta", replace end runby get_in_parts, by(form)

The code runs with no problems, but results in an empty data set when complete.

If I open "mydatacombo.dta", it does NOT have dirname or filename variables like it does when I run it within a specific folder with less than 10K files. It only has a source variable.

Any help is very much appreciated.

Thanks in advance. The filelist and runby commands are wonderful and have made my life better, easier, and faster. Am grateful for them.
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#24

29 Apr 2018, 03:12

Here is what -get_in_parts- is doing:

* get a pattern in variable -form-
* put all filenames matching this pattern in myfiles.dta
* for each filename, import and save in a new dta file
* append all the resulting dta files and save as mydatacombo.dta

The variables -dirname- and -filename-, which are found in myfiles.dta, should not appear in mydatacombo.dta (unless they also appear in the files you import). It has, however a -source- variable, that comes from the line "gen source = "`f'"" just before "save mydata_`i'.dta".

There is however a problem: -runby- will run -get_in_parts- for each pattern, and save each time in the same file mydatacombo.dta, replacing the previous contents. In the end, you will have only data from the last pattern. To make it work, you would have to save in mydatacombo_`g'.dta, write another loop on patterns to append all these combo files.

Hope this helps

Jean-Claude Arbaut

PS: if all you want is appending all the files in a specific folder into a single Stata datafile, it could be simpler to write a Python program that writes a Stata do file that does everything (import and append). Let Python retrieve the filenames (as it's very good at that and has no 10K limit), and Stata do the import. That's what I would do, anyway.

Last edited by Jean-Claude Arbaut; 29 Apr 2018, 03:20.
Comment

Kyle Smith

Join Date: Mar 2017
Posts: 93

#25

29 Apr 2018, 06:01

Jean-Claude - thanks for the reply and clarification. I do not have a great understanding of runby. It is clearer now.

I did not know Python could write Stata do files? I am very novice at Python. I would prefer to keep this in Stata if possible.

Is this closer?

Code:

program get_in_parts
  local g = form
  filelist, dir("mega_files") pattern("*`g'*")
  local obs = _N
  save "myfiles.dta", replace

forvalues i=1/`obs' {
    use "myfiles.dta" in `i', clear
    local f = dirname + "/" + filename
    insheet using "`f'", clear
    gen source = "`f'"
    save "mydata_`f'.dta", replace
}

clear
forvalues i=1/`obs' {
    append using "mydata_`f'.dta"
}
save "mydatacombo_`g'.dta", replace

clear
forvalues i=1/`g' {
    append using "mydatacombo_`g'.dta"
}
save "mydatacombo_all.dta", replace

end
runby get_in_parts, by(form)

Thanks in advance.

Comment

Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#26

29 Apr 2018, 07:34

You should probably put the last loop outside of -get_in_parts-.

Code:

program get_in_parts local g = form filelist, dir("mega_files") pattern("*`g'*") local obs = _N save "myfiles.dta", replace forvalues i=1/`obs' { use "myfiles.dta" in `i', clear local f = dirname + "/" + filename insheet using "`f'", clear gen source = "`f'" save "mydata_`i'.dta", replace } clear forvalues i=1/`obs' { append using "mydata_`i'.dta" } save "mydatacombo_`g'.dta", replace glo patterns $patterns `g' end runby get_in_parts, by(form) clear foreach g in $patterns { append using "mydatacombo_`g'.dta" } save "mydatacombo_all.dta", replace

Side note: be aware that patterns are not necessarily exclusive. For instance, "file12.csv" is matched by *1*, *2* and *12*, among others.

Another bug I overlooked: save "mydata_`f'.dta", replace is not correct, because f is a pathname, with one or more slashes: Stata would try to save a file "`filename'.dta" in a directory called "mydata_`dirname'", and this directory will likely not exist. However, the following forvalues tells me you really wanted to use `i' instead of `f' in the name. I've made the correction in my answer above.

About Python: it can easily write a text file, and you are free to put Stata commands in what you write, hence, even if it's not specified in Python documentation, of course it can "write a do file" (I have also used Python to write SAS programs in the past, and to prepare SAS formats and Stata labels from raw csv data, and many other similar tasks). Even if you don't write the full do file with PYthon, you could still write a list of filenames (one name by row), that you could read in Stata, then use to import data. But I'll leave this, as it's off-topic here and you prefer to use Stata. Another possibility, still within Stata, would be to write a plugin (either Java or C/C++) that the does the job of finding filenames, but that would be a more "advanced" project. However, that would be much more robust (see above the risk with patterns).

For the record, here is a Python program that prints a list of filenames. You can redirect the output to a text file and import this in Stata. That's the basis of several programs I use (here I removed all error checking to make it as simple as possible).

Code:

import sys, os def readdir(path): for name in os.listdir(path): c = os.path.join(path, name) if os.path.isfile(c): print(c) elif os.path.isdir(c): readdir(c) readdir(sys.argv[1])

Last edited by Jean-Claude Arbaut; 29 Apr 2018, 07:45.
Comment
Kyle Smith

Join Date: Mar 2017

Posts: 93
#27

29 Apr 2018, 08:05

Jean-Claude, thanks again for the help.

A couple more questions.

You mention in your side note that patterns are not exclusive. Will this cause a problem in the last loop? What is the problem? And is there a simple way to resolve it (besides writing a plugin)?

Also, the last loop fails to run. It gives an error "invalid syntax r(198)". I thought maybe it should be `g' instead of g in the last loop, but that didn't resolve it. Is it because g is defined as a local variable in get_in_parts and as a global later?

Last edited by Kyle Smith; 29 Apr 2018, 08:12.
Comment

Jean-Claude Arbaut

Join Date: Jul 2017
Posts: 209

#28

29 Apr 2018, 08:25

If your patterns are not exclusive, then the same file can appear several times in the listings, thus be appended several times. This can probably be a problem for you.

The syntax error may come from the form of patterns. I should have asked first: what does the form variable look like? In case Stata is not very happy with it (in foreach), you can use numbers instead:

Code:

program get_in_parts
    local g = form
    filelist, dir("mega_files") pattern("*`g'*")
    local obs = _N
    save "myfiles.dta", replace

    forvalues i=1/`obs' {
        use "myfiles.dta" in `i', clear
        local f = dirname + "/" + filename
        insheet using "`f'", clear
        gen source = "`f'"
        save "mydata_`i'.dta", replace
    }

    clear
    forvalues i=1/`obs' {
        append using "mydata_`i'.dta"
    }
    glo last=$last+1
    save "mydatacombo_$last.dta", replace
end

glo last=0
runby get_in_parts, by(form)

clear
forv i=1/$last {
    append using "mydatacombo_`i'.dta"
}
save "mydatacombo_all.dta", replace

Comment

Kyle Smith

Join Date: Mar 2017

Posts: 93
#29

29 Apr 2018, 09:08

Jean-Claude, Thanks again. I think we are getting closer.

When I run the last loop now I get the error "no variables defined r(111)".

If I open one of the mydata_`i' files, the variable source is there. Somehow we are losing that variable in the last loop? Any idea why?

****************

If I try your python code, but add a line at the beginning (after import command) like:

Code:

path="C:\\mega_files"

I get the following error.

Traceback (most recent call last):
File "test1.py", line 24, in <module>

readdir(sys.argv[1])
IndexError: list index out of range

Last edited by Kyle Smith; 29 Apr 2018, 09:24.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#30

29 Apr 2018, 10:39

Kyle Smith, you are reviving a thread that's several months old and you seem lost compared to where it was left with #21 (and #22). The code in #21 will process in one pass as many subdirectories as you have and will create a dataset of millions and millions of file names if that's what you have. The only limitation is that Stata will not return more than 10,000 file names from a single directory (ignoring those in its subdirectories). This is a hard coded limitation of Stata that still exist in the most up to date version of Stata (I just checked).

To illustrate the issue, here's code that will create a little over 50K files into a "mega_files" directory within Stata's current directory. The files are split into 3 subdirectories called "batch1", "batch2", and "batch3".

Code:

clear all
set seed 3213
set obs 11
gen form = _n
expand 3
bysort form: gen batch = _n
expand runiformint(1000,2000)
bysort form batch: gen path = "mega_files/batch" + string(batch) + ///
    "/file" + string(_n) + "_form" + string(form) + ".txt"

cap mkdir mega_files
cap mkdir mega_files/batch1
cap mkdir mega_files/batch2
cap mkdir mega_files/batch3

program doit
    local fpath = path
    save "`fpath'"
end
runby doit, by(path)

It's easy to check if the 10K limi is biting you, you simply need to check how many files filelist has returned by directory:

Code:

. filelist, dir("mega_files")
Number of files found = 30000

. contract dirname

. list

     +---------------------------+
     | dirname             _freq |
     |---------------------------|
  1. | mega_files/batch1   10000 |
  2. | mega_files/batch2   10000 |
  3. | mega_files/batch3   10000 |
     +---------------------------+

The code in #21 offers a workaround for the limitation provided you spell out a list of patterns that will pick-up fewer than 10,000 files in any given directory. In the "mega_files" directory, all files follow a pattern that I can identify. Files end with "_form1.txt", "_form2.txt", ..., "_form11.txt". With this information in hand, I can overcome the 10,000 file limit using:

Code:

clear all
input str11 form
"_form1.txt" 
"_form2.txt" 
"_form3.txt" 
"_form4.txt" 
"_form5.txt" 
"_form6.txt" 
"_form7.txt" 
"_form8.txt" 
"_form9.txt" 
"_form10.txt"
"_form11.txt"
end

program get_in_parts
  local p = form
  filelist, dir("mega_files") pattern("*`p'")
  gen form = "`p'"
end
runby get_in_parts, by(form) verbose

contract dirname form
assert _freq < 10000

list, sepby(dirname)

And here are the results:

Code:

. list, sepby(dirname)

     +-----------------------------------------+
     | dirname                    form   _freq |
     |-----------------------------------------|
  1. | mega_files/batch1    _form1.txt    1570 |
  2. | mega_files/batch1   _form10.txt    1504 |
  3. | mega_files/batch1   _form11.txt    1768 |
  4. | mega_files/batch1    _form2.txt    1445 |
  5. | mega_files/batch1    _form3.txt    1474 |
  6. | mega_files/batch1    _form4.txt    1035 |
  7. | mega_files/batch1    _form5.txt    1781 |
  8. | mega_files/batch1    _form6.txt    1137 |
  9. | mega_files/batch1    _form7.txt    1648 |
 10. | mega_files/batch1    _form8.txt    1484 |
 11. | mega_files/batch1    _form9.txt    1923 |
     |-----------------------------------------|
 12. | mega_files/batch2    _form1.txt    1633 |
 13. | mega_files/batch2   _form10.txt    1417 |
 14. | mega_files/batch2   _form11.txt    1191 |
 15. | mega_files/batch2    _form2.txt    1031 |
 16. | mega_files/batch2    _form3.txt    1903 |
 17. | mega_files/batch2    _form4.txt    1506 |
 18. | mega_files/batch2    _form5.txt    1329 |
 19. | mega_files/batch2    _form6.txt    1942 |
 20. | mega_files/batch2    _form7.txt    1141 |
 21. | mega_files/batch2    _form8.txt    1877 |
 22. | mega_files/batch2    _form9.txt    1645 |
     |-----------------------------------------|
 23. | mega_files/batch3    _form1.txt    1963 |
 24. | mega_files/batch3   _form10.txt    1540 |
 25. | mega_files/batch3   _form11.txt    1408 |
 26. | mega_files/batch3    _form2.txt    1981 |
 27. | mega_files/batch3    _form3.txt    1180 |
 28. | mega_files/batch3    _form4.txt    1330 |
 29. | mega_files/batch3    _form5.txt    1225 |
 30. | mega_files/batch3    _form6.txt    1437 |
 31. | mega_files/batch3    _form7.txt    1188 |
 32. | mega_files/batch3    _form8.txt    1979 |
 33. | mega_files/batch3    _form9.txt    1665 |
     +-----------------------------------------+

.

The code you posted in #23 refers to a post of mine that predates runby. What you posted in #23 and subsequent posts to fix the issue do not make any sense in my mind. Your task is first to make a dataset of all the files you want to process. That follows the format in #21 and the example above. Only once you are satisfied that the list is complete and that there are no remaining issues with the 10K limit, proceed to import content from each file as you do in #22.

If I'm missing something, please clarify what the issue is.

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment