Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Text File Import Ignoring Binary Zeros

    Hi Statalist,

    I am importing a large text file into Stata using the following code:

    Code:
    import delimited "file.txt", delimiter("|") clear
    The file loads successfully but Stata returns the following message:
    Note: 19,649,776 binary zeros were ignored in the source file. The first instance occurred on line 1. Binary zeros are not valid in text data. Inspect your data carefully.

    Following the advice in a previous post(https://www.statalist.org/forums/for...-into-stata-14), I tried to specify the encoding using the following code (my file is encoded as 1252 (ANSI Latin I)):

    Code:
    import delimited "file.txt", delimiter("|") encoding(ISO-8859-1) clear
    Running this code still returns the above message regarding binary zeros. It does not appear that any data is being lost in the import; however, I would appreciate any insight on this issue, particularly whether this is something I should be concerned about and if there is a fix.

    Best,
    Kelly

  • #2
    Is it possible to upload your file for inspection, or is it proprietary? If you delete most of your data (say all but the first observation), and modify whatever proprietary data remains, is it possible to upload that?

    Comment


    • #3
      I'd suggest a slight variation on Nils's suggestion: Use the -hexdump- command per below on a reasonable chunk of your file, say the first 2,000 bytes, and post the results here within CODE delimiters (the "#" on the toolbar.) Perhaps someone else might have an insight that permits solving your problem a priori, but I'd say that seeing the hexdump is the best bet for a solution. It also might help if you could describe whatever you know about the preparation/contents/origin of this purported text file (e.g., was it prepared with some word processor and if so which one, under what operating system).

      Code:
      hexdump "YourFilename", from(1) to(2000)
      The results of
      Code:
      hexdump "YourFilename", analyze
      would also be useful.

      Comment


      • #4
        Thank you for your responses, Nils and Mike. Unfortunately, I have a non-disclosure agreement that prohibits me from sharing my data and also do not have any information on the preparation or origin of the file at the moment but have inquired. I was able to run the code you suggested, Mike, which produced the following output:


        Code:
         hexdump "file.txt", analyze
        
          Line-end characters                        Line length (tab=1)
            \r\n         (Windows)     12,801,578      minimum                      119
            \r by itself (Mac)                  0      maximum                      248
            \n by itself (Unix)                 0
          Space/separator characters                 Number of lines         12,801,578
            [blank]                    29,809,615      EOL at EOF?                  yes
            [tab]                               0
            [comma] (,)                 1,903,718    Length of first 5 lines
          Control characters                           Line 1                       193
            binary 0                   19,649,776      Line 2                       188
            CTL excl. \r, \n, \t                0      Line 3                       193
            DEL                                 0      Line 4                       147
            Extended (128-159,255)              0      Line 5                       183
          ASCII printable
            A-Z                       500,615,772
            a-z                                 0    File format                 BINARY
            0-9                       849,111,800
            Special (!@#$ etc.)       670,799,106
            Extended (160-254)                  0
                                  ---------------
          Total                     2,097,492,943
        
          Observed were:
             \0 \n \r blank " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = >
             ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \ |
        I am sorry that I cannot be more specific but I hope the above information is able to provide some further insight and very much appreciate your help.
        .

        Comment


        • #5
          The hexdump summary suggests that your file is a relatively ordinary Windows-formatted text file, except with the presence of a bunch of presumably extraneous but harmless binary 0s. Knowing where the binary 0s occur in your file could substantiate this view. What I'd want to look for would be where the binary 0s occur, i.e., are they just packed together at the ends of lines or the end of file. Without an example file to play with, I'm not immediately coming up with an easy way to do this with Stata tools, but perhaps someone else will. Or, perhaps one of your colleagues with permission to view this file can teach you how to read a hex dump.

          Comment


          • #6
            For what it's worth, based on what I see from the hexdump, I concur with Mike that the binary zeroes here are probably harmless.

            Comment


            • #7
              Thank you so much for your all your help, Mike and Nils. It is comforting to have you both confirm that these binary zeros are likely not a major issue. I appreciate you working within my constraints and will try to look more into the hexdump results.

              Comment


              • #8
                I reminded myself how to use Stata's binary read function, and came up with the following for you. It does a binary read of your file, and creates a Stata data set in which each observation's value is the position of a binary 0 in your file. You can then browse this, and see where the 0s are occurring.

                Note that this clears out your Stata data and anything in Mata, so don't have anything in Stata you want to keep before you run this code.

                Code:
                clear
                mata mata clear
                file close _all
                // Read file and record the positions of the binary zeros in a Mata column vector
                local infile =  "c:/temp/serbian.jpg"    // change to fit your file
                file open myread using "`infile'", read binary
                local charcount = 0
                local zerocount = 0
                local done = 0
                while !`done' {
                   file read myread %1bu c
                   local ++charcount
                   local done = r(eof)
                   if (c == 0)    {
                      local ++zerocount
                      if (`zerocount' == 1) {
                         mata: zeros = J(1,1,1) // first time
                     }   
                     else {
                         mata zeros = zeros \ `charcount'
                     }
                   }
                   // This will be slow, so here's a little echo every 100,000 characters so you know the program is still running.
                    if (mod(`charcount',1e5) == 0) di "`charcount' char read".
                 }
                file close myread
                //  Get the results into a Stata data file so you can browse them
                getmata zeros = zeros
                quiet count
                di r(N) " binary zeros detected."
                label var zeros "Positions in file at which binary zeros occurred"
                describe

                Comment


                • #9
                  Thank you so much, Mike. Your code was very helpful. I apologize for the delayed response; it did take a while to run but it was very informative. To find the character positions identified in the Stata file, I opened my text file in Notepad++. The file in Notepad++ displays "NUL" in each position for which Stata indicates a binary zero exists. For my data, it appears that this is only an issue with two specific variables and that Stata is reading in any 'NUL' (which in C I believe is '\0') as a blank for that observation. I feel more confident that this issue is not resulting in any changes to the data when reading it into Stata. Thank you again for all your help.

                  Comment


                  • #10
                    Code:
                    (which in C I believe is '\0')
                    (You are correct. 😊)

                    Comment


                    • #11
                      Good to be sure! Thank you, Nils! I appreciate your help.

                      Comment

                      Working...
                      X