Problem (possible bug?) with -import delimited- with collapse option

Mike Lacy

Join Date: Apr 2014

Posts: 2413
#1

Problem (possible bug?) with -import delimited- with collapse option

31 Aug 2015, 14:20

I am using -import delimited- to read several large (1G) text files in which fields are delimited by "||", i.e., two "pipes" in a row. For at least one observation, I find parsing is *sometimes* not occurring at the "||". For example, after importing into Stata, I have one variable that contains:
.....||0.00 || || || || || || ||0.00 ||0 ||0 ||3 .....
where "." represents an ellipsis (The variable is 900 char or so wide, but should probably be more like 100 or so)
So, not all the fields are being parsed. The "0.00 " should be a variable, as should the " ", etc.

For each file, it's nice to reduce the size by trimming consecutive blanks, so the code I am using for each file has a bit of filtering before the import:

Code:

// infile and outfile defined previously // Read file; itrim() consecutive blanks; write file to temp tempfile temp set obs 1 // DIY filter with fileread() filewrite() gen s = fileread("`infile'") replace s = itrim(s) gen b = filewrite("`temp'", s) clear // Import import delimited using "`temp'", delimiters("|", collapse) /// varnames(nonames) stringcols(_all) save "`outfile'", replace

(Note that everything is to be read as string.) My understanding is that my use of the -delimiters" option with -collapse- should fit this "||" file scheme. I did check one of the problematic fields with -charlist- (SSC), and found only ASCII codes in the printable range (32-124). Also, I checked one known problematic observation, and the delimiter did occur in pairs ("||"), 181 times as it happened.

1) Can anyone suggest a reason why this failure to parse would occur, or does this look like a bug?

2) My idea of the next step is to try to just read each line as a strL with -infix-, and parse it myself with -split-. Or, I thought that filtering the delimiters from "||" to "|" might help -import delimited- work better. Other suggestions?

Regards, Mike
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

31 Aug 2015, 15:02

In looking at help import delimited, I learned that

Code:

delimiters("||", asstring)

might be designed to do what you need. Please let us know if it helps!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#3

01 Sep 2015, 11:08

Thanks for the help. Just got back to this task. Using -asstring- seemed to result in the same problem, that is, for the problematic variable/observation in question, parsing did not occur, so that there are many "||" contained within the variable, and the length is still about 900 bytes or so.

This time, I used:

Code:

import delimited using "`temp'", delimiters("||", asstring) /// // note "||" varnames(nonames) stringcols(_all) // Rather than the previous: import delimited using "`temp'", delimiters("|", collapse) /// // note "|" varnames(nonames) stringcols(_all)

Further, I'm confused by the documentation for asstring and collapse, which sound like slightly different versions of the same thing to me:

" delimiters("chars"[, collapse|asstring])
collapse forces import delimited to treat multiple consecutive delimiters as just one delimiter.
asstring forces import delimited to treat chars as one delimiter. By default, each character in chars is treated as an individual delimiter."

What's your take on the difference between these two?

Regards, Mike

Last edited by Mike Lacy; 01 Sep 2015, 11:14.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#4

01 Sep 2015, 13:20

New datum, perhaps revealing, mistifying to me:

I filtered the input text file to change all instances of "||" to just "|", and then imported it accordingly with:

Code:

import delimited using "`temp'", delimiters("|") /// varnames(nonames) stringcols(_all)

When I did this, I still found that the problematic observation and problematic field was still not properly parsed at each "|". It went along nicely for 100 char or so, encountered the first(?) "|", and then failed to split the string at a good many of those delimiters. Again, to my knowledge, there are no untoward character codes in the field. I'm leaning more heavily in the direction of a bug now, but before I contact tech support, any other thoughts? - Mike
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#5

01 Sep 2015, 13:49

Mike,

My interpretation of the difference between asstring and collapse is that collapse changes multiple consecutive occurrence of the same delimiter to one occurrence (e.g., || becomes |), whereas if you specified more than one delimiter, asstring would treat the group of delimiters as a single delimiter. For example: delimiters(",;|") would treat "a,b;c|d" as four different fields, while delimiters(",;|", asstring) would treat "a,b;c|d" as one field but would treat "a,;|b,;|c,;|d" as four fields.

If you only have one delimiter (as in your case), delimiters("|",collapse) and delimiters("||",asstring) should do the same thing. asstring seems designed for very unusual cases where multiple different delimiters are present between each field.

Sorry, I don't have any other insights into your problem at the moment.

Regards,
Joe
Comment
James Hassell (StataCorp)

StataCorp Employee

Join Date: Apr 2015

Posts: 74
#6

01 Sep 2015, 14:04

Let me see if I can shed some light on collapse vs asstring...

The collapse option says treat multiple consecutive delimiters in the file as just one delimiter, in other words, two or more delimiters in the file right together, would be treated just as one. Said another way, the collapse option changes the default behavior where a delimiter followed by a delimiter in the data would normally be read as a missing.

The asstring option tells import delimited how to handle a mulit-character delimiter specification. By default import delimited will treat each delimiter character specified as an independent delimiter allowing for multiple delimiters to be specified. For example, a comma and a vertical bar may be specified allowing both to be used as delimiters; or perhaps "\t ", a tab and a space. The asstring option tell import delimited that the characters should be handled as a single string or sequence.

Regarding the problem you are having importing your data, my advice is make sure you are up to date. If that does not help, put together a data file that reproduces the problem and send it to [email protected] and we will see if we can find out what is going on.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#7

01 Sep 2015, 14:38

Thanks for all the advice. I now think the problem may come from the presence of some double-quotes and single-quotes within text strings. (The problem field is a legal description of a property, and has degrees and minutes in it sometimes.) I hope to have an intelligent followup on this tomorrow. - Mike
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#8

02 Sep 2015, 08:14

I have now solved this one, and in the process, discovered a difficulty with -import excel- and double-quotes ("), namely that I don't see any way to make it treat a double-quote as simply a character without special meaning.

The problem here was that one of the fields contained double quotes ("), used in this case to indicate seconds of arc in a text description of a property location. This confused -import delimited-, as when it encountered a double-quote " in a field, that caused it to ignore the delimiter. Here's an example, using comma-delimited format:

Code:

// Create a text file with quotes within a field input str50 s `"The property is, 30' 60" E of here, 1, 2, 3"' end tempfile temp list outfile using `temp' clear import delimited using `temp', delimiters(",") varnames(nonames) stringcols(_all) list

We see that v1, rather than containing just "The property is", continues beyond the first "," and treating it as thought it were text. There is -stripquote- option on -import delimited-, but removing quotes would not be desirable here, given their meaning as units of measurement.

My solution was to filter the input text file, substitute something else for the ", and then use -import delimited-.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

02 Sep 2015, 10:29

I ran your example, adding the following after the outfile command.

Code:

. type "`temp'" "The property is, 30' 60" E of here, 1, 2, 3"

So we see that your sample data is wrapped in quotation marks, a signal that it is actually a single string. Stata parsed this as follows.

Code:

. list +-----------------------------------------------------+ | v1 v2 v3 v4 | |-----------------------------------------------------| 1. | The property is, 30' 60" E of here 1 2 3" | +-----------------------------------------------------+

Before I learned about the bindquotes option to the import delimited command, I had the following expectations. In the absence of the enclosing pair of quotation marks, I expected Stata to do as you expected, and break the input at each of the 4 commas. With the enclosing pair of quotation marks, but without the interior "minute symbol" quotation mark, I would expect Stata to treat the entire line as a single string. Given the three quotation marks in your data, Stata used the first two quotation marks to shield the intervening comma, and the (without removing the second quotation mark) continued on until it encountered a comma, and then parsed as expected.

If we add the noquote option to the outfile command, and bindquotes(nobind) to the import command, we get what you hoped for.

Code:

. outfile using `temp', noquote . type `temp' The property is, 30' 60" E of here, 1, 2, 3 . clear . import delimited using `temp', delimiters(",") bindquotes(nobind) /// > varnames(nonames) stringcols(_all) (5 vars, 1 obs) . list +------------------------------------------------------+ | v1 v2 v3 v4 v5 | |------------------------------------------------------| 1. | The property is 30' 60" E of here 1 2 3 | +------------------------------------------------------+

From this and my previous comment on delimiters("||", asstring, I take away the lesson that using import delimited benefits from very close reading of the documentation and a close examination of the input data. Certainly if the data does not use quotation marks to shield appearances of a delimiter character (like comma) within a string, bindquotes(nobind) seems advisable. This would often be the case when the input file, like yours, uses uncommon (or nonprintable) characters as the delimiter. But it all depends on the source of the delimited file. My experience is that Excel uses quotation marks only when necessary, for example, but I have had other programs that use quotation marks around every string, no matter how trivial.

Let us know if any of this leads to a more straightforward solution to the problem with your data.

Last edited by William Lisowski; 02 Sep 2015, 10:37.
Comment
James Hassell (StataCorp)

StataCorp Employee

Join Date: Apr 2015

Posts: 74
#10

02 Sep 2015, 10:31

If the double quotes are always data and are not used to bind data, import delimited can handle that situation. Use the bindquote(nobind).
Comment

Announcement

Problem (possible bug?) with -import delimited- with collapse option

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment