binary zeros were ignored in the source file

Doug Hess

Join Date: Nov 2016

Posts: 58
#1

binary zeros were ignored in the source file

13 Aug 2021, 14:58

Hi.
I am importing some voter registration databases that have >8 million records. The text files are tab delimited. Column names in the first row. If it matters, the support file says "encoding: UTF-16 LE."

When importing in Stata/MP 15.1, I use a command like this: import delimited "...`filedate'.txt", stringcols(_all) clear

For some files, I get this statement from Stata: "Note: 1,171,366,858 binary zeros were ignored in the source file. The first instance occurred on line 1. Binary zeros are not valid in text data. Inspect your data carefully."

The files seem to import fine, but I'm wondering what this warning means. Does this have to do with the encoding? Or use of the -stringcol(_all)- option? Should I specify the UTF type?

Thanks for any advice.
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#2

13 Aug 2021, 15:14

I think this is a symptom of importing using an incorrect encoding, and it seems in this case Stata has defaulted to assuming UTF-8 when the data are encoded as UTF-16LE (explaining so many binary zero bytes during the import process). Conveniently, the support file tells you how it was encoded, so you can amend your -import delimited- command to add the following option -encoding(utf-16le)-.
2 likes
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2402

13 Aug 2021, 16:07

Just for fun, here's an illustration where I make a a UTF-16LE encoded file, containing exactly the letters a, b, and c, each on their own line, simulating 1 variable with 3 values. The first 2 bytes are a control sequence that helps Stata identify the encoding the file, but it's not critical, it just helps Stata guess correctly when it needs to determine the encoding. In terms of zero bytes, exactly 9 are present in my test file.

Code:

clear *

local contents_utf16le fffe 6100 0d00 0a00 6200 0d00 0a00 6300 0d00 0a00
local compressed = subinstr("`contents_utf16le'", " ", "", .)
local n_2b = length("`compressed'") / 2

tempname fh
tempfile myfile

file open `fh' using `myfile', write binary replace

forval i = 1 / `n_2b' {
  local abyte = substr("`compressed'", 2*(`i'-1)+1, 2)
  qui inten 16 `abyte'
  cap noi file write `fh' %1bu (r(ten))
}
file close `fh'

hexdump `myfile', to(20)

import delimited `myfile', clear encoding(utf-8)
list
import delimited `myfile', clear encoding(utf-16le)
list

In both cases, the returned dataset is the same, but I get warnings when using the wrong encoding.

Code:

. hexdump `myfile', to(20)
                 |                                         |    character
                 |           hex representation            |  representation
         address |  0 1  2 3  4 5  6 7  8 9  a b  c d  e f | 0123456789abcdef
-----------------+-----------------------------------------+-----------------
               0 | fffe 6100 0d00 0a00 6200 0d00 0a00 6300 | .�a.....b.....c.
              10 | 0d00 0a00                               | ....             

. import delimited `myfile', clear encoding(utf-8)
Note:  9 binary zeros were ignored in the source file.  The first instance occurred on line 1.  Binary zeros are not valid in text data.
       Inspect your data carefully.
(1 var, 3 obs)

. list
     +----+
     | v1 |
     |----|
  1. |  a |
  2. |  b |
  3. |  c |
     +----+

. import delimited `myfile', clear encoding(utf-16le)
(1 var, 3 obs)

. list

     +----+
     | v1 |
     |----|
  1. |  a |
  2. |  b |
  3. |  c |
     +----+

I believe in your case, you won't lose any information by using UTF-8 when UTF-16 is the real file encoding because Stata disregards these zero bytes. The reverse is not true, however, and it's always safer to specify the correct encoding when it is known.

Comment

Doug Hess

Join Date: Nov 2016

Posts: 58
#4

13 Aug 2021, 22:21

Originally posted by Leonardo Guizzetti View Post

I think this is a symptom of importing using an incorrect encoding, and it seems in this case Stata has defaulted to assuming UTF-8 when the data are encoded as UTF-16LE (explaining so many binary zero bytes during the import process). Conveniently, the support file tells you how it was encoded, so you can amend your -import delimited- command to add the following option -encoding(utf-16le)-.

Thank you! Very helpful.
Comment

Announcement