insheet directory with many files: r(134) using dir . files AND non-ASCII characters

Josh McSorley

Join Date: May 2014

Posts: 22
#1

insheet directory with many files: r(134) using dir . files AND non-ASCII characters

26 Oct 2014, 10:06

Hi:

I am guessing there is a straightforward solution to this but I am having some problems reading in all of the files in a directory with about 6k files.

My set-up:
Windows 7 Enterprise x64
8 G RAM
Stata 12.1 MP

Folders I am reading in:
- N=6142 CSV files
- most files have longish names around 100 characters, e.g.
PingerFILE_OfficePC4_LatestBackup_ExternalDrive_Fu llEncoding_NoCharacterAbbreviation_CheckedCopy_09-10-14_22.11.43.txt
- some files have non-standard characters in them, e.g.
PingerFILE_OfficePC4_gÇôstg_09-10-07_15.46.18.txt

I list below the two approaches I have tried (unsuccessfully).
I am guessing the first one can work with a small tweak, but any suggestions on how I could get either approach to work would be most appreciated!

Strategy 1
I tried using the usual approach in Stata:
. cd "D:\Data\Workdata"
. local tempvar : dir . files "pingerfile*.*"
After the second line I get the error message,
too many filenames
r(134);
If instead I use,
. local tempvar : dir . files "peerlist*.*", nofail
it reads in the first 1768 files.

I tried the fs ado command
. fs pinger*.*
and it again gives the "too many file names" error code.

Is there a setting I can change which will overcome this limit? It seems strange since I do not think I am reading in more than 175k or so characters (1768 files * 100 character/file) before it dtops.

Strategy 2
Generate a list of files and then loop through it.
The issue here are the non-standard characters: the file listing seems to change teh encoding which prevents the later file read-in.

So I run something like
. ! dir "D:\Data\Workdata" > PingerFiles.txt
. insheet using PingerFiles.txt
. gen tmp=index(v1, "PINGERFILE")
. drop if tmp=0
. gen FileName=substr(v1,tmp,.)
. local i=1
. while `i' <=_N {
. local k=FileName[`i']
. preserve
. insheet using "`k'", comma nonames doubl
.
.
.
. restore
. local i=`i'+1
}

The issue is that when the loop gets to a file with non-standard characters like
PingerFILE_OfficePC4_gÇôstg_09-10-07_15.46.18.txt
it crashes and says it cannot find the file. I am guessing the encoding gets changed when it is written to a file in the piped command in the first row of the code.
On the positive side if I skip these files I have no problem looping through all files in my folder.

I am not sure if there is a way to salvage this approach.
Tags: None
Josh McSorley

Join Date: May 2014

Posts: 22
#2

26 Oct 2014, 17:00

Sorry two small additions/notes:

- in Strategy 1, if there is a way to have it just read the first 1000 files say, and then the next 1000, etc into tempvar I would be ok.
I am not sure how to do this though.

- in Strategy 1 there was a typo in the second dir command I listed. It was actually,
. local tempvar : dir . files "pingerfile*.*", nofail
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#3

26 Oct 2014, 20:50

I've been beating my head against a wall trying to solve your problem. It's interesting to me, and Windows scripting is one thing I do outside of Stata. I can get the command prompt to recognize non-ASCII characters, but all approaches (shell commands and Stata commands that are presumably invoking shell commands) still save the filenames into an ASCII file, thus mangling things. I do have one possible suggestion that *might* work. What if you appended all the files together *before* reading into Stata?

shell copy pinger*.txt>allpingers.txt

If there is enough information *in* each file to know what info came from what file, then you're in business? Also saves you from appending from within Stata.

ps. and no filthy nasssty little loopses getting in the way of your Precious (that's a reference to Gollum in LOTR).

Last edited by ben earnhart; 26 Oct 2014, 20:56. Reason: added ps.
Comment
Josh McSorley

Join Date: May 2014

Posts: 22
#4

26 Oct 2014, 21:26

HI Ben:

Thanks (and sorry do not want you to injure your head or the wall!).

Your suggestion was a great one! I think the ">" will give the directory listing: mine worked fine with,
. ! copy pinger*.txt allpingers.txt
I can then read in allpingers.txt with no problem.

The issue is that I also need the file names associated with each row (I would also like to get the date and file size metadata which I would with my Strategy 2, but I can live without that for now).
So the short end is if I can somehow include the original filename in the allpingers.txt file (even if it broke up the CSV pattern of the main data) I should be ok.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#5

26 Oct 2014, 21:37

Hmm. Does each file have the exact same length? I am too burned out (not just from your issue) to write code, but if you can get away with a directory listing (drop the first five lines or so after importing), then -expand-, then -merge-, that *might* do the job. You'd lose all non-ASCII characters, and need to be really, really careful that the copy and dir commands put stuff in the same order. Sketchy, but might work.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#6

26 Oct 2014, 21:51

btw -- that would be the "merge 1:1 _n using filename" command, which is seldom recommended. But if you know each file has the same number of cases, this could be one of the few times it's appropriate.
Comment
Josh McSorley

Join Date: May 2014

Posts: 22
#7

26 Oct 2014, 22:46

Alas the files are all of different lengths so I do not think this will work.
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4403

26 Oct 2014, 23:26

It looks as if the file names contain a date-time signature. If so, then you can exploit that. You can set up a series of nested loops of pertinent years, months and days and limit the total number of files that local tempvar : dir . files "pingerfile*.*" sees each pass through the loop to a small number that you know it can handle. Something along the lines of the illustration below should work. (I've used commands that are in Release 13, but you can see what I'm getting at--start at the "Begin here" comment--and adapt it to your Release-12 code accordingly.)

Code:

version 13.1

clear *
set more off

*
* Create a couple of toy spreadsheet files for illustration
*
foreach file_name in ///
    "PingerFILE_OfficePC4_LatestBackup_ExternalDrive_Fu llEncoding_NoCharacterAbbreviation_CheckedCopy_09-10-14_22.11.43.txt" ///
    "PingerFILE_OfficePC4_gÇôstg_09-10-07_15.46.18.txt" {
    quietly set obs 2
    generate byte a = 1
    generate byte b = 2
    quietly export delimited "`file_name'", delimiter(",") replace
    drop _all
}

*
* Begin here
*
tempfile tmpfil0
quietly count
assert r(N) == 0
quietly save "`tmpfil0'", emptyok

// cd whatever
forvalues year = 1/99 {
    local Year : display %02.0f `year'
    forvalues month = 1/12 {
        local Month : display %02.0f `month'
        forvalues day = 1/31 {
            local Day : display %02.0f `day'

            local spreadsheet_list : dir . files "pingerfile*_`Month'-`Day'-`Year'*.txt"
            if `"`spreadsheet_list'"' != "" {
                foreach spreadsheet in `spreadsheet_list' {
                    display in smcl as text "`spreadsheet'"
                    import delimited "`spreadsheet'", delimiter(comma) ///
                        varnames(1) asdouble rowrange(2) clear

                    generate str244 spreadsheet = "`spreadsheet'"
                    append using "`tmpfil0'"
                    quietly save "`tmpfil0'", replace
                }
            }
        }
    }
}
exit

Consider adopting ISO 8601 for formatting dates and date-times. I find that it makes coding much easier, and I highly recommend it.

Last edited by Joseph Coveney; 26 Oct 2014, 23:28.

Comment

Josh McSorley

Join Date: May 2014

Posts: 22
#9

27 Oct 2014, 11:00

Hi Joseph:

Thanks great suggestion! I think this should work for most of my files (I have other folders like this one I need to read in...).

My only concern is that some files might get omitted due to idiosyncrasies of my file names: some file names have the date stamp at the end omitted.
For robustness sake I would like to be able to confirm all files are read-in (and preferably to also confirm they are read in exactly once --- I do not think this would be the case but some file might have other time stamps in the early part of the file name and so could get read in on multiple days in the loop).
Is there a way to do this?
Comment
Josh McSorley

Join Date: May 2014

Posts: 22
#10

29 Oct 2014, 10:45

I did manage to find a rather clumsy way to do this using the command prompt leveraging Ben's suggestion. In the code below, I add the file names in before appending the contents of each file (in the omitted code that comes afterwards I reorganize the file to put the file names in a separate variable; I also had to use filelfilter to read in the file due to the non-standard characters and quote marks in some file names)

. local j="D:\Data\Workdata"

. ! for %I in ("`j'pingerfile*.*") do @echo. >>allfiles.txt && @echo =====%~tnxI >>allfiles.txt && type "%I" >>allfiles.txt

Hopefully this may be of some help to anyone stumbling on this thread later.

Still I am hoping someone from Stata can chime in here to indicate if there is some parameter which can be changed to make my Strategy 1 listed in the original post work.
1 like
Comment

Announcement