import delimited and Unicode data

Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#1

import delimited and Unicode data

21 Oct 2019, 17:23

While reading a utf8 tab delimited file with

Code:

import delimited using Bankers-current.txt, encoding("utf-8") bindquote(strict)

I got the following message:

Unmatched quote exceeded 20 lines while processing row 5688;
there may be a problem with your data or perhaps you have a quoted string
with too many lines. You may specify maxquotedrows() to override the
default behavior.

There are supposedly some quoted fields with newlines, hence the bindquote option,
but removing it only changes the message to:

Note: Unmatched quote while processing row 5091; this can be due to a
formatting problem in the file or because a quoted data element spans
multiple lines..

.

Looking at the file I can see that there are many unmatched double quotes, including
at line 5091, but always in the context of a non-ascii string, not a multi-line field.

Here is the offending line in octal:

HTML Code:

<xmp> 0000000 R U 7 1 3 8 5 3 8 6 \t 320 221 320 220 320 0000020 235 320 232 " 320 241 320 220 320 235 320 232 320 242 </xmp>

Is it possible that -import delimited- is thinking individual bytes in a double-byte character
are quote marks? Anyway, I thought to use the -unicode convert- function to address the issue
but the command:

.

Code:

unicode convertfile Bankers-current.txt bankers.tsv ,dstencoding(latin1) srccallback(skip) replace srcencoding(utf8)

returns the message:

Code:

file "Bankers-current.txt" can not be converted to the same file[ r(602);

which I don't understand, since I have given a destination file name (bankers.tsv). Suggestions
welcome.

Daniel Feenberg
Tags: None
James Hassell (StataCorp)

StataCorp Employee

Join Date: Apr 2015

Posts: 74
#2

22 Oct 2019, 12:40

import delimited now detects cases where quotes don't match because that is often an indication of a problem with the data. [email protected] you said that the file does not contain multi-line fields, however if it does, and they span more than 20 lines, you can override the default limit using the maxquotedrows() option.

Is it possible that -import delimited- is thinking individual bytes in a double-byte character
are quote marks?

That is unlikely since import delimited reads the file based on the encoding before parsing the file. Perhaps you can make the file available so we can have a look.
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#3

23 Oct 2019, 08:39

If you think -import delimited- is handling utf-8 correctly, and not thinking that one of the bytes of a double byte character is a double quote mark, I will go back to the database vendor and ask for them to investigate and will post what I learn.
Comment

Announcement

import delimited and Unicode data

Comment

Comment