Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Kelly Hellman
    started a topic Text File Import Ignoring Binary Zeros

    Text File Import Ignoring Binary Zeros

    Hi Statalist,

    I am importing a large text file into Stata using the following code:

    Code:
    import delimited "file.txt", delimiter("|") clear
    The file loads successfully but Stata returns the following message:
    Note: 19,649,776 binary zeros were ignored in the source file. The first instance occurred on line 1. Binary zeros are not valid in text data. Inspect your data carefully.

    Following the advice in a previous post(https://www.statalist.org/forums/for...-into-stata-14), I tried to specify the encoding using the following code (my file is encoded as 1252 (ANSI Latin I)):

    Code:
    import delimited "file.txt", delimiter("|") encoding(ISO-8859-1) clear
    Running this code still returns the above message regarding binary zeros. It does not appear that any data is being lost in the import; however, I would appreciate any insight on this issue, particularly whether this is something I should be concerned about and if there is a fix.

    Best,
    Kelly

  • Kelly Hellman
    replied
    Good to be sure! Thank you, Nils! I appreciate your help.

    Leave a comment:


  • Nils Enevoldsen
    replied
    Code:
    (which in C I believe is '\0')
    (You are correct. 😊)

    Leave a comment:


  • Kelly Hellman
    replied
    Thank you so much, Mike. Your code was very helpful. I apologize for the delayed response; it did take a while to run but it was very informative. To find the character positions identified in the Stata file, I opened my text file in Notepad++. The file in Notepad++ displays "NUL" in each position for which Stata indicates a binary zero exists. For my data, it appears that this is only an issue with two specific variables and that Stata is reading in any 'NUL' (which in C I believe is '\0') as a blank for that observation. I feel more confident that this issue is not resulting in any changes to the data when reading it into Stata. Thank you again for all your help.

    Leave a comment:


  • Mike Lacy
    replied
    I reminded myself how to use Stata's binary read function, and came up with the following for you. It does a binary read of your file, and creates a Stata data set in which each observation's value is the position of a binary 0 in your file. You can then browse this, and see where the 0s are occurring.

    Note that this clears out your Stata data and anything in Mata, so don't have anything in Stata you want to keep before you run this code.

    Code:
    clear
    mata mata clear
    file close _all
    // Read file and record the positions of the binary zeros in a Mata column vector
    local infile =  "c:/temp/serbian.jpg"    // change to fit your file
    file open myread using "`infile'", read binary
    local charcount = 0
    local zerocount = 0
    local done = 0
    while !`done' {
       file read myread %1bu c
       local ++charcount
       local done = r(eof)
       if (c == 0)    {
          local ++zerocount
          if (`zerocount' == 1) {
             mata: zeros = J(1,1,1) // first time
         }   
         else {
             mata zeros = zeros \ `charcount'
         }
       }
       // This will be slow, so here's a little echo every 100,000 characters so you know the program is still running.
        if (mod(`charcount',1e5) == 0) di "`charcount' char read".
     }
    file close myread
    //  Get the results into a Stata data file so you can browse them
    getmata zeros = zeros
    quiet count
    di r(N) " binary zeros detected."
    label var zeros "Positions in file at which binary zeros occurred"
    describe

    Leave a comment:


  • Kelly Hellman
    replied
    Thank you so much for your all your help, Mike and Nils. It is comforting to have you both confirm that these binary zeros are likely not a major issue. I appreciate you working within my constraints and will try to look more into the hexdump results.

    Leave a comment:


  • Nils Enevoldsen
    replied
    For what it's worth, based on what I see from the hexdump, I concur with Mike that the binary zeroes here are probably harmless.

    Leave a comment:


  • Mike Lacy
    replied
    The hexdump summary suggests that your file is a relatively ordinary Windows-formatted text file, except with the presence of a bunch of presumably extraneous but harmless binary 0s. Knowing where the binary 0s occur in your file could substantiate this view. What I'd want to look for would be where the binary 0s occur, i.e., are they just packed together at the ends of lines or the end of file. Without an example file to play with, I'm not immediately coming up with an easy way to do this with Stata tools, but perhaps someone else will. Or, perhaps one of your colleagues with permission to view this file can teach you how to read a hex dump.

    Leave a comment:


  • Kelly Hellman
    replied
    Thank you for your responses, Nils and Mike. Unfortunately, I have a non-disclosure agreement that prohibits me from sharing my data and also do not have any information on the preparation or origin of the file at the moment but have inquired. I was able to run the code you suggested, Mike, which produced the following output:


    Code:
     hexdump "file.txt", analyze
    
      Line-end characters                        Line length (tab=1)
        \r\n         (Windows)     12,801,578      minimum                      119
        \r by itself (Mac)                  0      maximum                      248
        \n by itself (Unix)                 0
      Space/separator characters                 Number of lines         12,801,578
        [blank]                    29,809,615      EOL at EOF?                  yes
        [tab]                               0
        [comma] (,)                 1,903,718    Length of first 5 lines
      Control characters                           Line 1                       193
        binary 0                   19,649,776      Line 2                       188
        CTL excl. \r, \n, \t                0      Line 3                       193
        DEL                                 0      Line 4                       147
        Extended (128-159,255)              0      Line 5                       183
      ASCII printable
        A-Z                       500,615,772
        a-z                                 0    File format                 BINARY
        0-9                       849,111,800
        Special (!@#$ etc.)       670,799,106
        Extended (160-254)                  0
                              ---------------
      Total                     2,097,492,943
    
      Observed were:
         \0 \n \r blank " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = >
         ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \ |
    I am sorry that I cannot be more specific but I hope the above information is able to provide some further insight and very much appreciate your help.
    .

    Leave a comment:


  • Mike Lacy
    replied
    I'd suggest a slight variation on Nils's suggestion: Use the -hexdump- command per below on a reasonable chunk of your file, say the first 2,000 bytes, and post the results here within CODE delimiters (the "#" on the toolbar.) Perhaps someone else might have an insight that permits solving your problem a priori, but I'd say that seeing the hexdump is the best bet for a solution. It also might help if you could describe whatever you know about the preparation/contents/origin of this purported text file (e.g., was it prepared with some word processor and if so which one, under what operating system).

    Code:
    hexdump "YourFilename", from(1) to(2000)
    The results of
    Code:
    hexdump "YourFilename", analyze
    would also be useful.

    Leave a comment:


  • Nils Enevoldsen
    replied
    Is it possible to upload your file for inspection, or is it proprietary? If you delete most of your data (say all but the first observation), and modify whatever proprietary data remains, is it possible to upload that?

    Leave a comment:

Working...
X