Hi:
I am guessing there is a straightforward solution to this but I am having some problems reading in all of the files in a directory with about 6k files.
My set-up:
Windows 7 Enterprise x64
8 G RAM
Stata 12.1 MP
Folders I am reading in:
- N=6142 CSV files
- most files have longish names around 100 characters, e.g.
PingerFILE_OfficePC4_LatestBackup_ExternalDrive_Fu llEncoding_NoCharacterAbbreviation_CheckedCopy_09-10-14_22.11.43.txt
- some files have non-standard characters in them, e.g.
PingerFILE_OfficePC4_gÇôstg_09-10-07_15.46.18.txt
I list below the two approaches I have tried (unsuccessfully).
I am guessing the first one can work with a small tweak, but any suggestions on how I could get either approach to work would be most appreciated!
Strategy 1
I tried using the usual approach in Stata:
. cd "D:\Data\Workdata"
. local tempvar : dir . files "pingerfile*.*"
After the second line I get the error message,
too many filenames
r(134);
If instead I use,
. local tempvar : dir . files "peerlist*.*", nofail
it reads in the first 1768 files.
I tried the fs ado command
. fs pinger*.*
and it again gives the "too many file names" error code.
Is there a setting I can change which will overcome this limit? It seems strange since I do not think I am reading in more than 175k or so characters (1768 files * 100 character/file) before it dtops.
Strategy 2
Generate a list of files and then loop through it.
The issue here are the non-standard characters: the file listing seems to change teh encoding which prevents the later file read-in.
So I run something like
. ! dir "D:\Data\Workdata" > PingerFiles.txt
. insheet using PingerFiles.txt
. gen tmp=index(v1, "PINGERFILE")
. drop if tmp=0
. gen FileName=substr(v1,tmp,.)
. local i=1
. while `i' <=_N {
. local k=FileName[`i']
. preserve
. insheet using "`k'", comma nonames doubl
.
.
.
. restore
. local i=`i'+1
}
The issue is that when the loop gets to a file with non-standard characters like
PingerFILE_OfficePC4_gÇôstg_09-10-07_15.46.18.txt
it crashes and says it cannot find the file. I am guessing the encoding gets changed when it is written to a file in the piped command in the first row of the code.
On the positive side if I skip these files I have no problem looping through all files in my folder.
I am not sure if there is a way to salvage this approach.
I am guessing there is a straightforward solution to this but I am having some problems reading in all of the files in a directory with about 6k files.
My set-up:
Windows 7 Enterprise x64
8 G RAM
Stata 12.1 MP
Folders I am reading in:
- N=6142 CSV files
- most files have longish names around 100 characters, e.g.
PingerFILE_OfficePC4_LatestBackup_ExternalDrive_Fu llEncoding_NoCharacterAbbreviation_CheckedCopy_09-10-14_22.11.43.txt
- some files have non-standard characters in them, e.g.
PingerFILE_OfficePC4_gÇôstg_09-10-07_15.46.18.txt
I list below the two approaches I have tried (unsuccessfully).
I am guessing the first one can work with a small tweak, but any suggestions on how I could get either approach to work would be most appreciated!
Strategy 1
I tried using the usual approach in Stata:
. cd "D:\Data\Workdata"
. local tempvar : dir . files "pingerfile*.*"
After the second line I get the error message,
too many filenames
r(134);
If instead I use,
. local tempvar : dir . files "peerlist*.*", nofail
it reads in the first 1768 files.
I tried the fs ado command
. fs pinger*.*
and it again gives the "too many file names" error code.
Is there a setting I can change which will overcome this limit? It seems strange since I do not think I am reading in more than 175k or so characters (1768 files * 100 character/file) before it dtops.
Strategy 2
Generate a list of files and then loop through it.
The issue here are the non-standard characters: the file listing seems to change teh encoding which prevents the later file read-in.
So I run something like
. ! dir "D:\Data\Workdata" > PingerFiles.txt
. insheet using PingerFiles.txt
. gen tmp=index(v1, "PINGERFILE")
. drop if tmp=0
. gen FileName=substr(v1,tmp,.)
. local i=1
. while `i' <=_N {
. local k=FileName[`i']
. preserve
. insheet using "`k'", comma nonames doubl
.
.
.
. restore
. local i=`i'+1
}
The issue is that when the loop gets to a file with non-standard characters like
PingerFILE_OfficePC4_gÇôstg_09-10-07_15.46.18.txt
it crashes and says it cannot find the file. I am guessing the encoding gets changed when it is written to a file in the piped command in the first row of the code.
On the positive side if I skip these files I have no problem looping through all files in my folder.
I am not sure if there is a way to salvage this approach.
Comment