Text File Import Ignoring Binary Zeros

Kelly Hellman started a topic Text File Import Ignoring Binary Zeros

01 Nov 2018, 07:07
Text File Import Ignoring Binary Zeros
Hi Statalist,

I am importing a large text file into Stata using the following code:

Code:

import delimited "file.txt", delimiter("|") clear

The file loads successfully but Stata returns the following message:
Note: 19,649,776 binary zeros were ignored in the source file. The first instance occurred on line 1. Binary zeros are not valid in text data. Inspect your data carefully.

Following the advice in a previous post(https://www.statalist.org/forums/for...-into-stata-14), I tried to specify the encoding using the following code (my file is encoded as 1252 (ANSI Latin I)):

Code:

import delimited "file.txt", delimiter("|") encoding(ISO-8859-1) clear

Running this code still returns the above message regarding binary zeros. It does not appear that any data is being lost in the import; however, I would appreciate any insight on this issue, particularly whether this is something I should be concerned about and if there is a fix.

Best,
Kelly
Tags: None
Kelly Hellman replied

06 Nov 2018, 12:02
Good to be sure! Thank you, Nils! I appreciate your help.
Leave a comment:
Nils Enevoldsen replied

06 Nov 2018, 07:32
Code:

(which in C I believe is '\0')

(You are correct. 😊)
Leave a comment:
Kelly Hellman replied

06 Nov 2018, 07:08
Thank you so much, Mike. Your code was very helpful. I apologize for the delayed response; it did take a while to run but it was very informative. To find the character positions identified in the Stata file, I opened my text file in Notepad++. The file in Notepad++ displays "NUL" in each position for which Stata indicates a binary zero exists. For my data, it appears that this is only an issue with two specific variables and that Stata is reading in any 'NUL' (which in C I believe is '\0') as a blank for that observation. I feel more confident that this issue is not resulting in any changes to the data when reading it into Stata. Thank you again for all your help.
Leave a comment:

Mike Lacy replied

02 Nov 2018, 12:38

I reminded myself how to use Stata's binary read function, and came up with the following for you. It does a binary read of your file, and creates a Stata data set in which each observation's value is the position of a binary 0 in your file. You can then browse this, and see where the 0s are occurring.

Note that this clears out your Stata data and anything in Mata, so don't have anything in Stata you want to keep before you run this code.

Code:

clear
mata mata clear
file close _all
// Read file and record the positions of the binary zeros in a Mata column vector
local infile =  "c:/temp/serbian.jpg"    // change to fit your file
file open myread using "`infile'", read binary
local charcount = 0
local zerocount = 0
local done = 0
while !`done' {
   file read myread %1bu c
   local ++charcount
   local done = r(eof)
   if (c == 0)    {
      local ++zerocount
      if (`zerocount' == 1) {
         mata: zeros = J(1,1,1) // first time
     }   
     else {
         mata zeros = zeros \ `charcount'
     }
   }
   // This will be slow, so here's a little echo every 100,000 characters so you know the program is still running.
    if (mod(`charcount',1e5) == 0) di "`charcount' char read".
 }
file close myread
//  Get the results into a Stata data file so you can browse them
getmata zeros = zeros
quiet count
di r(N) " binary zeros detected."
label var zeros "Positions in file at which binary zeros occurred"
describe

Leave a comment:

Kelly Hellman replied

02 Nov 2018, 11:55
Thank you so much for your all your help, Mike and Nils. It is comforting to have you both confirm that these binary zeros are likely not a major issue. I appreciate you working within my constraints and will try to look more into the hexdump results.
Leave a comment:
Nils Enevoldsen replied

02 Nov 2018, 11:21
For what it's worth, based on what I see from the hexdump, I concur with Mike that the binary zeroes here are probably harmless.
Leave a comment:
Mike Lacy replied

02 Nov 2018, 07:50
The hexdump summary suggests that your file is a relatively ordinary Windows-formatted text file, except with the presence of a bunch of presumably extraneous but harmless binary 0s. Knowing where the binary 0s occur in your file could substantiate this view. What I'd want to look for would be where the binary 0s occur, i.e., are they just packed together at the ends of lines or the end of file. Without an example file to play with, I'm not immediately coming up with an easy way to do this with Stata tools, but perhaps someone else will. Or, perhaps one of your colleagues with permission to view this file can teach you how to read a hex dump.
1 like
Leave a comment:

Kelly Hellman replied

01 Nov 2018, 17:24

Thank you for your responses, Nils and Mike. Unfortunately, I have a non-disclosure agreement that prohibits me from sharing my data and also do not have any information on the preparation or origin of the file at the moment but have inquired. I was able to run the code you suggested, Mike, which produced the following output:

Code:

 hexdump "file.txt", analyze

  Line-end characters                        Line length (tab=1)
    \r\n         (Windows)     12,801,578      minimum                      119
    \r by itself (Mac)                  0      maximum                      248
    \n by itself (Unix)                 0
  Space/separator characters                 Number of lines         12,801,578
    [blank]                    29,809,615      EOL at EOF?                  yes
    [tab]                               0
    [comma] (,)                 1,903,718    Length of first 5 lines
  Control characters                           Line 1                       193
    binary 0                   19,649,776      Line 2                       188
    CTL excl. \r, \n, \t                0      Line 3                       193
    DEL                                 0      Line 4                       147
    Extended (128-159,255)              0      Line 5                       183
  ASCII printable
    A-Z                       500,615,772
    a-z                                 0    File format                 BINARY
    0-9                       849,111,800
    Special (!@#$ etc.)       670,799,106
    Extended (160-254)                  0
                          ---------------
  Total                     2,097,492,943

  Observed were:
     \0 \n \r blank " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = >
     ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \ |

I am sorry that I cannot be more specific but I hope the above information is able to provide some further insight and very much appreciate your help.
.

Announcement