I am using -import delimited- to read several large (1G) text files in which fields are delimited by "||", i.e., two "pipes" in a row. For at least one observation, I find parsing is *sometimes* not occurring at the "||". For example, after importing into Stata, I have one variable that contains:
.....||0.00 || || || || || || ||0.00 ||0 ||0 ||3 .....
where "." represents an ellipsis (The variable is 900 char or so wide, but should probably be more like 100 or so)
So, not all the fields are being parsed. The "0.00 " should be a variable, as should the " ", etc.
For each file, it's nice to reduce the size by trimming consecutive blanks, so the code I am using for each file has a bit of filtering before the import:
(Note that everything is to be read as string.) My understanding is that my use of the -delimiters" option with -collapse- should fit this "||" file scheme. I did check one of the problematic fields with -charlist- (SSC), and found only ASCII codes in the printable range (32-124). Also, I checked one known problematic observation, and the delimiter did occur in pairs ("||"), 181 times as it happened.
1) Can anyone suggest a reason why this failure to parse would occur, or does this look like a bug?
2) My idea of the next step is to try to just read each line as a strL with -infix-, and parse it myself with -split-. Or, I thought that filtering the delimiters from "||" to "|" might help -import delimited- work better. Other suggestions?
Regards, Mike
.....||0.00 || || || || || || ||0.00 ||0 ||0 ||3 .....
where "." represents an ellipsis (The variable is 900 char or so wide, but should probably be more like 100 or so)
So, not all the fields are being parsed. The "0.00 " should be a variable, as should the " ", etc.
For each file, it's nice to reduce the size by trimming consecutive blanks, so the code I am using for each file has a bit of filtering before the import:
Code:
// infile and outfile defined previously // Read file; itrim() consecutive blanks; write file to temp tempfile temp set obs 1 // DIY filter with fileread() filewrite() gen s = fileread("`infile'") replace s = itrim(s) gen b = filewrite("`temp'", s) clear // Import import delimited using "`temp'", delimiters("|", collapse) /// varnames(nonames) stringcols(_all) save "`outfile'", replace
1) Can anyone suggest a reason why this failure to parse would occur, or does this look like a bug?
2) My idea of the next step is to try to just read each line as a strL with -infix-, and parse it myself with -split-. Or, I thought that filtering the delimiters from "||" to "|" might help -import delimited- work better. Other suggestions?
Regards, Mike
Comment