Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large datasets from txt files won't load completely, tried the suggested solutions!

    Dear all,

    I'm writing my thesis and use Stata 12SE to analyze news messages coded to a level of sentiment. However, I'm experiencing difficulties loading a whole dataset into Stata and can't figure out why. I've tried the following:

    The first dataset is the "2003" file. I've checked using LTF viewer (similar to notepad, but can handle more lines) that the txt file contains 1,049,323 lines/observations (896MB). However, when I load it into Stata (using "Import>Text data created by spreadsheet"), I see at the data properties window: only 527,920 observations and Size 1,079.43M with Memory 1,216M. So that's only 50.3% of the original number of obs. Furthermore, the data isn't cut-off halfway through the year.

    Secondly, I tried loading a different file, "2005". LTF viewer shows 1,439,408 lines. The file is 1.25GB. When I load this file into Stata, I get 778,087 lines, so 54.1% of the original.
    Data properties show further: size 1,635.46M and Memory 1,856M.
    So similar, but not equal results as the first dataset.

    Note:both files have 82 variables.

    I tried both files on the following three computers using Stata 12 SE (32-bit version for the first two, and 64-bit version for the third)
    1 University, servercomputer: Windows 7 32-bit, 4GB RAM (3,17 available)
    2 My own computer: Windows 7 32-bit, 2GB RAM (1,87 available)
    3 Friend's computer: MacBook Pro dual-booted into Windows 7 64-bit, 4GB RAM

    So to recap, I've tried in any way to get my txt files loaded into Stata. However, I only get about half of the observations, while the properties window in stata shows a larger size than the original file (but that doesn't seem to be odd to me)

    Does anyone have encountered similar problems? And does anybody know the solution to this?

    Many thanks,

    Maurits Munninghoff

  • #2
    This could be all sorts of things. One is unusual characters being misinterpreted. I'd use hexdump, tabulate to get an idea of strange characters.

    Another tactic is to look at your files with a really good text editor focusing on where Stata stops. I don't know if LTF (never heard of it) can do that.

    Comment


    • #3
      Thank you Nick Cox for your quick response!

      I found the following output with hexdump, tabulate
      Don't know what to do with it now though. Any suggestions?

      Thank you

      Output stata:

      hexdump "G:\20050101-20051231.EQU\TRNA.20100624.1.20050101-20051231.EQU.txt", tabulate

      Line-end characters Line length (tab=1)
      \r\n (Windows) 1,439,407 minimum 370
      \r by itself (Mac) 0 maximum 1,915
      \n by itself (Unix) 0
      Space/separator characters Number of lines 1,439,407
      [blank] 41,354,550 EOL at EOF? yes
      [tab] 116,591,967
      [comma] (,) 324,314 Length of first 5 lines
      Control characters Line 1 823
      binary 0 0 Line 2 529
      CTL excl. \r, \n, \t 0 Line 3 526
      DEL 0 Line 4 526
      Extended (128-159,255) 2,467 Line 5 530
      ASCII printable
      A-Z 248,997,219
      a-z 242,923,829 File format BINARY
      0-9 525,222,720
      Special (!@#$ etc.) 168,154,680
      Extended (160-254) 9,079
      ---------------
      Total 1,346,459,639

      Observed were:
      \t \n \r blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < =
      > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a
      b c d e f g h i j k l m n o p q r s t u v w x y z | ~ 128 E^A E^B E^C
      E^D E^E E^F E^G E^H E^I E^M E^Q E^R E^S E^V E^X E^Y E^Z 155 156 157 159
      160 ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « * ® ° ± ² ³ ´ µ ¶ ¸ ¹ º » ¼ ½ ¿ Â Ã â ï

      ...
      And then there is a list of each character with its frequency.

      Comment


      • #4
        It's difficult in your case as most of those characters seem possible in a large mass of text. I'd send the output to Stata tech support, but I think their first suggestion would be the same as my other suggestion, to look at the file at precisely where Stata stopped reading it in.

        Comment


        • #5
          I'm sorry, I'll post a print screen of the output (the above is too confusing)
          Attached Files

          Comment


          • #6
            The issue may be concerned with the lack of sufficient space. I usually use R to dirty work. You could attempt to import this file to R and then export to Stata using foreign library. There is a chance that Stata will handle *.dta file better than this text file. If you manage to import to R you could at least perform some basic operations on variables, like removing unwanted characters.
            Kind regards,
            Konrad
            Version: Stata/IC 13.1

            Comment


            • #7
              Have you verified that the import will work (and produce correct data) with smaller data sets?

              Perhaps try with the first (or last) 1000 lines from your example, or with a section from the middle that straddles the point of the problem.

              Comment


              • #8
                Thank you Konrad Zdeb and Brendan Halpin for your feedback.

                @ Brendan: I tried with smaller data sets as well now - first 1000, middle 1000 and last thousand - and all observations were loaded correctly.

                I did find out that in my 2003 sample, the data loaded into Stata jumped from the 26th of december to the 30th! So when I checked this in the text file I saw there should be observations between the two dates. I then copied the observations that at first didn't come through in stata, made a new txt file from those, and then loaded it into Stata: strangely enough, it did load those observations correctly!

                @ Konrad: I downloaded the R software and tried to load the data (The command line I got from the internet, I'm not familiar with the program though!)
                After a few minutes, it came back with the following message:

                > data1 <- read.delim(file.choose(), header=T)
                Warning message:
                In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
                EOF within quoted string



                So I'm still clueless... What does this warning message mean?
                What is my best option to process this large dataset? Any other statistical software packages I should try, maybe?

                Any help or suggestions are very much appreciated.

                Comment


                • #9
                  EOF means "end-of-file".

                  Comment


                  • #10
                    I downloaded Delimit software to open the files and deleted the unnecessary variables. When I open the data now in stata, I get all the observations! It must have been due to some unrecognized characters, indeed. I do find it strange that Stata doesn't say how many observations are omitted and why. But for future users encountering this problem: this might be a workable solution!

                    Thank you for the support

                    Comment


                    • #11
                      The evidence so far is consistent with the story that Stata exited on finding an end-of-file character. That's what I expect it to do, and I don't expect a message to that effect.

                      Comment


                      • #12
                        An alternative is to use a simple Python program to strip those characters out (e.g., http://stackoverflow.com/questions/2...s-using-python).

                        Interestingly, treating ASCII 26 (0x1A) as an EOF character is an ancient Windows/DOS quirk (see, http://stackoverflow.com/a/405169). Mac OS X and Linux (for example) do not share this behavior.

                        Comment

                        Working...
                        X