No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • problems importing text files into Stata 14

    Hi Statalist,
    I have problems importing some kind of txt files into Stata 14
    I have few files in txt format with Tab delimited. Variable names are on the first raws of the txt files and contains embedded spaces.

    I have already tried to import txt files by using the following STATA command:

    import delimited " path/to/my/file/filename.txt", case(preserve)
    by applying this command Stata give me the following message:
    "Note: 792.850.604 binary zeros were ignored in the source file. The first instance occured on line1. Binary zeros are not valid in text data. Inspect your data carefully"

    What does that mean? How can I solve this problem?
    Thanks a lot in advance for your help.


  • #2
    It means that your "text" file is not actually a text file. You need to find out how the file was produced and what format it really is. If it is one that Stata can read, you can then load it with some other command. If it is a format that Stata cannot read, you will need to get it translated either with a user-written Stata command for that format, or via a utility like StatTransfer, or by using software for which the file is a native format and getting it translated into another format that Stata can read.

    Another possibility is that the file is simply corrupted and is not a valid instance of any file format. In that case, you need to go back to the source and get a clean copy.


    • #3
      you can use the hexdump command to investigate the file - read the help file carefully as almost certainly you want one of the summaries and not a full dump

      note that under certain conditions, you might want to followup with the filefilter command - each of these two commands is very fast in my experience and can "fix" many oddities in ones data
      Last edited by Rich Goldstein; 23 Mar 2016, 10:50.


      • #4
        Binary zeros mean that the byte/bit values are 0s. In languages like C, and in the .dta file specification, binary zeros are typically used to signify the end of the content stream. So if you have a string variable that takes 200 bytes to store the longest string, all of the values have 200bytes worth of space allocated to them, but as soon as a binary zero is encountered it triggers the parser to stop reading the input. Do you know how the strings are encoded in the file? You may need to specify the string encoding for Stata to handle the data more appropriately, but in either case that looks like an absurdly large number of binary zeros to encounter in a file. What's the size of the file on disk?


        • #5
          Hi Chiara,

          I happen to have a similar problem. How did you finally solved it?




          • #6
            Thank you to everybody for your suggestions.

            Laura, the problem was that tab delimited text files were encoded in UTF-16.
            Thus I used the following command: import delimited "filename.txt ", case(preserve) encoding ("utf-16")