Read specified number of lines from a CSV file

Robert Grant

Join Date: Apr 2014

Posts: 58
#1

Read specified number of lines from a CSV file

13 Nov 2014, 05:31

Is there any existing Stata program out there for reading part of a CSV file? I'm thinking of offering a client a solution for CSV files potentially bigger than their RAM, so I want to read in lines 1-100,000 (say), write it out to a .dta, and then loop from there through the CSV file. I could use file read but don't want to reinvent the (unpleasant) wheel of interpreting and updating variable types and formats if the code already exists somewhere!
Tags: ASCII, big data, CSV, import
Joe Canner

Join Date: Mar 2014

Posts: 580
#2

13 Nov 2014, 06:51

Robert,

If you are using Stata 13, you can use import delimited with the rowrange option.

If you are using Stata 12 or prior, the CSV file can be read into Excel, saved as an Excel file and then you can use import excel with the cellrange option.

Regards,
Joe
1 like
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#3

13 Nov 2014, 08:31

never mind, infix might solve the problem, but import delimited mentioned above is better anyway.

Last edited by ben earnhart; 13 Nov 2014, 08:33.
Comment
Robert Grant

Join Date: Apr 2014

Posts: 58
#4

21 Nov 2014, 11:59

I have tried this out in earnest now and import delimited is wonderful for this task. The only caveat being that when you come to read rows 10,000,000 to 10,100,000 it has to count its way up there each time, so as a looping strategy it might be best to chop up the file first with some kind of shell command and loop within the chops... but I'll give that some more experimentation. Anyway, I'm aiming to write this up as a longer blog post before too long because "big data" fiends should know that Stata is more than adequate for most of those sorts of jobs.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#5

21 Nov 2014, 12:52

Strange. Might investigate "infix (specifications) using filename in `x1'/`x2'" after all, or there is something strange about your loop. Infix is a bit of a pain since you need to specify formats, but on a project that large, worth the time and effort if it is that much faster.

When I was reading in literally billions of lines of a .csv file (on a 4 GB P4), it didn't slow down as it went along, it was like honeybadger, just kept going.
Comment
Robert Grant

Join Date: Apr 2014

Posts: 58
#6

21 Nov 2014, 14:14

Well, I will try it out and report back. I'm using the NYC taxi data as a test case (see chriswhong.com) which is 173 million rows, about 40GB in total.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#7

21 Nov 2014, 14:44

Strange. My memory was that it did not slow down as it got deeper into the file. I generated a 100,000,000 observation file just now, and it was noticeably slowing down after about 10,000,000 cases, or even sooner. Weird. Sorry if I got you off on a bad track; I guess you'll know quickly enough. It might have been infile, not infix.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#8

21 Nov 2014, 14:54

Even stranger. I threw infile at it, even with the "if" option so it would only bother to read certain cases. Still slows down. So either my memory is bad, or something's changed about Stata. But I know I did it about six years ago, and it was amazingly fast. Whatever it was that I did. Hmm... however I did it. Hmmm...
Comment

Announcement

Read specified number of lines from a CSV file

Comment

Comment

Comment

Comment

Comment

Comment

Comment