Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • binary zeros were ignored in the source file

    Hi.
    I am importing some voter registration databases that have >8 million records. The text files are tab delimited. Column names in the first row. If it matters, the support file says "encoding: UTF-16 LE."

    When importing in Stata/MP 15.1, I use a command like this: import delimited "...`filedate'.txt", stringcols(_all) clear

    For some files, I get this statement from Stata: "Note: 1,171,366,858 binary zeros were ignored in the source file. The first instance occurred on line 1. Binary zeros are not valid in text data. Inspect your data carefully."

    The files seem to import fine, but I'm wondering what this warning means. Does this have to do with the encoding? Or use of the -stringcol(_all)- option? Should I specify the UTF type?

    Thanks for any advice.

  • #2
    I think this is a symptom of importing using an incorrect encoding, and it seems in this case Stata has defaulted to assuming UTF-8 when the data are encoded as UTF-16LE (explaining so many binary zero bytes during the import process). Conveniently, the support file tells you how it was encoded, so you can amend your -import delimited- command to add the following option -encoding(utf-16le)-.

    Comment


    • #3
      Just for fun, here's an illustration where I make a a UTF-16LE encoded file, containing exactly the letters a, b, and c, each on their own line, simulating 1 variable with 3 values. The first 2 bytes are a control sequence that helps Stata identify the encoding the file, but it's not critical, it just helps Stata guess correctly when it needs to determine the encoding. In terms of zero bytes, exactly 9 are present in my test file.

      Code:
      clear *
      
      local contents_utf16le fffe 6100 0d00 0a00 6200 0d00 0a00 6300 0d00 0a00
      local compressed = subinstr("`contents_utf16le'", " ", "", .)
      local n_2b = length("`compressed'") / 2
      
      tempname fh
      tempfile myfile
      
      file open `fh' using `myfile', write binary replace
      
      forval i = 1 / `n_2b' {
        local abyte = substr("`compressed'", 2*(`i'-1)+1, 2)
        qui inten 16 `abyte'
        cap noi file write `fh' %1bu (r(ten))
      }
      file close `fh'
      
      hexdump `myfile', to(20)
      
      import delimited `myfile', clear encoding(utf-8)
      list
      import delimited `myfile', clear encoding(utf-16le)
      list
      In both cases, the returned dataset is the same, but I get warnings when using the wrong encoding.

      Code:
      . hexdump `myfile', to(20)
                       |                                         |    character
                       |           hex representation            |  representation
               address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
      -----------------+-----------------------------------------+-----------------
                     0 | fffe 6100 0d00 0a00 6200 0d00 0a00 6300 | .�a.....b.....c.
                    10 | 0d00 0a00                               | ....             
      
      . import delimited `myfile', clear encoding(utf-8)
      Note:  9 binary zeros were ignored in the source file.  The first instance occurred on line 1.  Binary zeros are not valid in text data.
             Inspect your data carefully.
      (1 var, 3 obs)
      
      . list
           +----+
           | v1 |
           |----|
        1. |  a |
        2. |  b |
        3. |  c |
           +----+
      
      . import delimited `myfile', clear encoding(utf-16le)
      (1 var, 3 obs)
      
      . list
      
           +----+
           | v1 |
           |----|
        1. |  a |
        2. |  b |
        3. |  c |
           +----+
      I believe in your case, you won't lose any information by using UTF-8 when UTF-16 is the real file encoding because Stata disregards these zero bytes. The reverse is not true, however, and it's always safer to specify the correct encoding when it is known.

      Comment


      • #4
        Originally posted by Leonardo Guizzetti View Post
        I think this is a symptom of importing using an incorrect encoding, and it seems in this case Stata has defaulted to assuming UTF-8 when the data are encoded as UTF-16LE (explaining so many binary zero bytes during the import process). Conveniently, the support file tells you how it was encoded, so you can amend your -import delimited- command to add the following option -encoding(utf-16le)-.
        Thank you! Very helpful.

        Comment

        Working...
        X