Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • import delimited and Unicode data

    While reading a utf8 tab delimited file with

    Code:
    import delimited using Bankers-current.txt, encoding("utf-8") bindquote(strict)
    I got the following message:

    Unmatched quote exceeded 20 lines while processing row 5688;
    there may be a problem with your data or perhaps you have a quoted string
    with too many lines. You may specify maxquotedrows() to override the
    default behavior.
    There are supposedly some quoted fields with newlines, hence the bindquote option,
    but removing it only changes the message to:

    Note: Unmatched quote while processing row 5091; this can be due to a
    formatting problem in the file or because a quoted data element spans
    multiple lines..
    .

    Looking at the file I can see that there are many unmatched double quotes, including
    at line 5091, but always in the context of a non-ascii string, not a multi-line field.

    Here is the offending line in octal:

    HTML Code:
    <xmp>
    0000000   R   U   7   1   3   8   5   3   8   6  \t 320 221 320 220 320
    0000020 235 320 232       " 320 241 320 220 320 235 320 232 320 242 
    </xmp>
    Is it possible that -import delimited- is thinking individual bytes in a double-byte character
    are quote marks? Anyway, I thought to use the -unicode convert- function to address the issue
    but the command:

    .
    Code:
    unicode convertfile Bankers-current.txt bankers.tsv ,dstencoding(latin1) srccallback(skip) replace srcencoding(utf8)
    returns the message:

    Code:
    file "Bankers-current.txt" can not be converted to the same file[
    r(602);
    which I don't understand, since I have given a destination file name (bankers.tsv). Suggestions
    welcome.

    Daniel Feenberg

  • #2
    import delimited now detects cases where quotes don't match because that is often an indication of a problem with the data. [email protected] you said that the file does not contain multi-line fields, however if it does, and they span more than 20 lines, you can override the default limit using the maxquotedrows() option.

    Is it possible that -import delimited- is thinking individual bytes in a double-byte character
    are quote marks?
    That is unlikely since import delimited reads the file based on the encoding before parsing the file. Perhaps you can make the file available so we can have a look.

    Comment


    • #3
      If you think -import delimited- is handling utf-8 correctly, and not thinking that one of the bytes of a double byte character is a double quote mark, I will go back to the database vendor and ask for them to investigate and will post what I learn.

      Comment

      Working...
      X