Large datasets from txt files won't load completely, tried the suggested solutions!

Maurits Munninghoff

Join Date: May 2014

Posts: 11
#1

Large datasets from txt files won't load completely, tried the suggested solutions!

09 May 2014, 05:19

Dear all,

I'm writing my thesis and use Stata 12SE to analyze news messages coded to a level of sentiment. However, I'm experiencing difficulties loading a whole dataset into Stata and can't figure out why. I've tried the following:

The first dataset is the "2003" file. I've checked using LTF viewer (similar to notepad, but can handle more lines) that the txt file contains 1,049,323 lines/observations (896MB). However, when I load it into Stata (using "Import>Text data created by spreadsheet"), I see at the data properties window: only 527,920 observations and Size 1,079.43M with Memory 1,216M. So that's only 50.3% of the original number of obs. Furthermore, the data isn't cut-off halfway through the year.

Secondly, I tried loading a different file, "2005". LTF viewer shows 1,439,408 lines. The file is 1.25GB. When I load this file into Stata, I get 778,087 lines, so 54.1% of the original.
Data properties show further: size 1,635.46M and Memory 1,856M.
So similar, but not equal results as the first dataset.

Note:both files have 82 variables.

I tried both files on the following three computers using Stata 12 SE (32-bit version for the first two, and 64-bit version for the third)
1 University, servercomputer: Windows 7 32-bit, 4GB RAM (3,17 available)
2 My own computer: Windows 7 32-bit, 2GB RAM (1,87 available)
3 Friend's computer: MacBook Pro dual-booted into Windows 7 64-bit, 4GB RAM

So to recap, I've tried in any way to get my txt files loaded into Stata. However, I only get about half of the observations, while the properties window in stata shows a larger size than the original file (but that doesn't seem to be odd to me)

Does anyone have encountered similar problems? And does anybody know the solution to this?

Many thanks,

Maurits Munninghoff
Tags: Dataset, import, incomplete, Large, text files
Nick Cox

Join Date: Mar 2014

Posts: 35720
#2

09 May 2014, 05:56

This could be all sorts of things. One is unusual characters being misinterpreted. I'd use hexdump, tabulate to get an idea of strange characters.

Another tactic is to look at your files with a really good text editor focusing on where Stata stops. I don't know if LTF (never heard of it) can do that.
Comment
Maurits Munninghoff

Join Date: May 2014

Posts: 11
#3

09 May 2014, 07:08

Thank you Nick Cox for your quick response!

I found the following output with hexdump, tabulate
Don't know what to do with it now though. Any suggestions?

Thank you

Output stata:

hexdump "G:\20050101-20051231.EQU\TRNA.20100624.1.20050101-20051231.EQU.txt", tabulate

Line-end characters Line length (tab=1)
\r\n (Windows) 1,439,407 minimum 370
\r by itself (Mac) 0 maximum 1,915
\n by itself (Unix) 0
Space/separator characters Number of lines 1,439,407
[blank] 41,354,550 EOL at EOF? yes
[tab] 116,591,967
[comma] (,) 324,314 Length of first 5 lines
Control characters Line 1 823
binary 0 0 Line 2 529
CTL excl. \r, \n, \t 0 Line 3 526
DEL 0 Line 4 526
Extended (128-159,255) 2,467 Line 5 530
ASCII printable
A-Z 248,997,219
a-z 242,923,829 File format BINARY
0-9 525,222,720
Special (!@#$ etc.) 168,154,680
Extended (160-254) 9,079
---------------
Total 1,346,459,639

Observed were:
\t \n \r blank ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < =
> ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a
b c d e f g h i j k l m n o p q r s t u v w x y z | ~ 128 E^A E^B E^C
E^D E^E E^F E^G E^H E^I E^M E^Q E^R E^S E^V E^X E^Y E^Z 155 156 157 159
160 ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « * ® ° ± ² ³ ´ µ ¶ ¸ ¹ º » ¼ ½ ¿ Â Ã â ï

...
And then there is a list of each character with its frequency.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#4

09 May 2014, 07:12

It's difficult in your case as most of those characters seem possible in a large mass of text. I'd send the output to Stata tech support, but I think their first suggestion would be the same as my other suggestion, to look at the file at precisely where Stata stopped reading it in.
Comment
Maurits Munninghoff

Join Date: May 2014

Posts: 11
#5

09 May 2014, 07:14

I'm sorry, I'll post a print screen of the output (the above is too confusing)
Attached Files
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#6

09 May 2014, 08:41

The issue may be concerned with the lack of sufficient space. I usually use R to dirty work. You could attempt to import this file to R and then export to Stata using foreign library. There is a chance that Stata will handle *.dta file better than this text file. If you manage to import to R you could at least perform some basic operations on variables, like removing unwanted characters.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Brendan Halpin

Join Date: Mar 2014

Posts: 152
#7

09 May 2014, 09:52

Have you verified that the import will work (and produce correct data) with smaller data sets?

Perhaps try with the first (or last) 1000 lines from your example, or with a section from the middle that straddles the point of the problem.
Comment
Maurits Munninghoff

Join Date: May 2014

Posts: 11
#8

11 May 2014, 18:00

Thank you Konrad Zdeb and Brendan Halpin for your feedback.

@ Brendan: I tried with smaller data sets as well now - first 1000, middle 1000 and last thousand - and all observations were loaded correctly.

I did find out that in my 2003 sample, the data loaded into Stata jumped from the 26th of december to the 30th! So when I checked this in the text file I saw there should be observations between the two dates. I then copied the observations that at first didn't come through in stata, made a new txt file from those, and then loaded it into Stata: strangely enough, it did load those observations correctly!

@ Konrad: I downloaded the R software and tried to load the data (The command line I got from the internet, I'm not familiar with the program though!)
After a few minutes, it came back with the following message:

> data1 <- read.delim(file.choose(), header=T)
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string

So I'm still clueless... What does this warning message mean?
What is my best option to process this large dataset? Any other statistical software packages I should try, maybe?

Any help or suggestions are very much appreciated.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#9

11 May 2014, 18:22

EOF means "end-of-file".
Comment
Maurits Munninghoff

Join Date: May 2014

Posts: 11
#10

12 May 2014, 06:26

I downloaded Delimit software to open the files and deleted the unnecessary variables. When I open the data now in stata, I get all the observations! It must have been due to some unrecognized characters, indeed. I do find it strange that Stata doesn't say how many observations are omitted and why. But for future users encountering this problem: this might be a workable solution!

Thank you for the support
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#11

12 May 2014, 06:31

The evidence so far is consistent with the story that Stata exited on finding an end-of-file character. That's what I expect it to do, and I don't expect a message to that effect.
Comment
Jason Kiley

Join Date: May 2014

Posts: 2
#12

12 May 2014, 11:52

An alternative is to use a simple Python program to strip those characters out (e.g., http://stackoverflow.com/questions/2...s-using-python).

Interestingly, treating ASCII 26 (0x1A) as an EOF character is an ancient Windows/DOS quirk (see, http://stackoverflow.com/a/405169). Mac OS X and Linux (for example) do not share this behavior.
Comment

Announcement

Large datasets from txt files won't load completely, tried the suggested solutions!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment