Text File Import Ignoring Binary Zeros

Kelly Hellman

Join Date: Oct 2018

Posts: 5
#1

Text File Import Ignoring Binary Zeros

01 Nov 2018, 07:07

Hi Statalist,

I am importing a large text file into Stata using the following code:

Code:

import delimited "file.txt", delimiter("|") clear

The file loads successfully but Stata returns the following message:
Note: 19,649,776 binary zeros were ignored in the source file. The first instance occurred on line 1. Binary zeros are not valid in text data. Inspect your data carefully.

Following the advice in a previous post(https://www.statalist.org/forums/for...-into-stata-14), I tried to specify the encoding using the following code (my file is encoded as 1252 (ANSI Latin I)):

Code:

import delimited "file.txt", delimiter("|") encoding(ISO-8859-1) clear

Running this code still returns the above message regarding binary zeros. It does not appear that any data is being lost in the import; however, I would appreciate any insight on this issue, particularly whether this is something I should be concerned about and if there is a fix.

Best,
Kelly
Tags: None
Nils Enevoldsen

Join Date: Oct 2014

Posts: 296
#2

01 Nov 2018, 09:03

Is it possible to upload your file for inspection, or is it proprietary? If you delete most of your data (say all but the first observation), and modify whatever proprietary data remains, is it possible to upload that?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

01 Nov 2018, 10:11

I'd suggest a slight variation on Nils's suggestion: Use the -hexdump- command per below on a reasonable chunk of your file, say the first 2,000 bytes, and post the results here within CODE delimiters (the "#" on the toolbar.) Perhaps someone else might have an insight that permits solving your problem a priori, but I'd say that seeing the hexdump is the best bet for a solution. It also might help if you could describe whatever you know about the preparation/contents/origin of this purported text file (e.g., was it prepared with some word processor and if so which one, under what operating system).

Code:

hexdump "YourFilename", from(1) to(2000)

The results of

Code:

hexdump "YourFilename", analyze

would also be useful.
Comment

Kelly Hellman

Join Date: Oct 2018
Posts: 5

01 Nov 2018, 17:24

Thank you for your responses, Nils and Mike. Unfortunately, I have a non-disclosure agreement that prohibits me from sharing my data and also do not have any information on the preparation or origin of the file at the moment but have inquired. I was able to run the code you suggested, Mike, which produced the following output:

Code:

 hexdump "file.txt", analyze

  Line-end characters                        Line length (tab=1)
    \r\n         (Windows)     12,801,578      minimum                      119
    \r by itself (Mac)                  0      maximum                      248
    \n by itself (Unix)                 0
  Space/separator characters                 Number of lines         12,801,578
    [blank]                    29,809,615      EOL at EOF?                  yes
    [tab]                               0
    [comma] (,)                 1,903,718    Length of first 5 lines
  Control characters                           Line 1                       193
    binary 0                   19,649,776      Line 2                       188
    CTL excl. \r, \n, \t                0      Line 3                       193
    DEL                                 0      Line 4                       147
    Extended (128-159,255)              0      Line 5                       183
  ASCII printable
    A-Z                       500,615,772
    a-z                                 0    File format                 BINARY
    0-9                       849,111,800
    Special (!@#$ etc.)       670,799,106
    Extended (160-254)                  0
                          ---------------
  Total                     2,097,492,943

  Observed were:
     \0 \n \r blank " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = >
     ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \ |

I am sorry that I cannot be more specific but I hope the above information is able to provide some further insight and very much appreciate your help.
.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#5

02 Nov 2018, 07:50

The hexdump summary suggests that your file is a relatively ordinary Windows-formatted text file, except with the presence of a bunch of presumably extraneous but harmless binary 0s. Knowing where the binary 0s occur in your file could substantiate this view. What I'd want to look for would be where the binary 0s occur, i.e., are they just packed together at the ends of lines or the end of file. Without an example file to play with, I'm not immediately coming up with an easy way to do this with Stata tools, but perhaps someone else will. Or, perhaps one of your colleagues with permission to view this file can teach you how to read a hex dump.
1 like
Comment
Nils Enevoldsen

Join Date: Oct 2014

Posts: 296
#6

02 Nov 2018, 11:21

For what it's worth, based on what I see from the hexdump, I concur with Mike that the binary zeroes here are probably harmless.
Comment
Kelly Hellman

Join Date: Oct 2018

Posts: 5
#7

02 Nov 2018, 11:55

Thank you so much for your all your help, Mike and Nils. It is comforting to have you both confirm that these binary zeros are likely not a major issue. I appreciate you working within my constraints and will try to look more into the hexdump results.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

02 Nov 2018, 12:38

I reminded myself how to use Stata's binary read function, and came up with the following for you. It does a binary read of your file, and creates a Stata data set in which each observation's value is the position of a binary 0 in your file. You can then browse this, and see where the 0s are occurring.

Note that this clears out your Stata data and anything in Mata, so don't have anything in Stata you want to keep before you run this code.

Code:

clear
mata mata clear
file close _all
// Read file and record the positions of the binary zeros in a Mata column vector
local infile =  "c:/temp/serbian.jpg"    // change to fit your file
file open myread using "`infile'", read binary
local charcount = 0
local zerocount = 0
local done = 0
while !`done' {
   file read myread %1bu c
   local ++charcount
   local done = r(eof)
   if (c == 0)    {
      local ++zerocount
      if (`zerocount' == 1) {
         mata: zeros = J(1,1,1) // first time
     }   
     else {
         mata zeros = zeros \ `charcount'
     }
   }
   // This will be slow, so here's a little echo every 100,000 characters so you know the program is still running.
    if (mod(`charcount',1e5) == 0) di "`charcount' char read".
 }
file close myread
//  Get the results into a Stata data file so you can browse them
getmata zeros = zeros
quiet count
di r(N) " binary zeros detected."
label var zeros "Positions in file at which binary zeros occurred"
describe

Comment

Kelly Hellman

Join Date: Oct 2018

Posts: 5
#9

06 Nov 2018, 07:08

Thank you so much, Mike. Your code was very helpful. I apologize for the delayed response; it did take a while to run but it was very informative. To find the character positions identified in the Stata file, I opened my text file in Notepad++. The file in Notepad++ displays "NUL" in each position for which Stata indicates a binary zero exists. For my data, it appears that this is only an issue with two specific variables and that Stata is reading in any 'NUL' (which in C I believe is '\0') as a blank for that observation. I feel more confident that this issue is not resulting in any changes to the data when reading it into Stata. Thank you again for all your help.
Comment
Nils Enevoldsen

Join Date: Oct 2014

Posts: 296
#10

06 Nov 2018, 07:32

Code:

(which in C I believe is '\0')

(You are correct. 😊)
Comment
Kelly Hellman

Join Date: Oct 2018

Posts: 5
#11

06 Nov 2018, 12:02

Good to be sure! Thank you, Nils! I appreciate your help.
Comment

Announcement