Import Large File

Rick Ennes

Join Date: Jul 2017

Posts: 1
#1

Import Large File

17 Jul 2017, 15:36

Hello guys,

I am new to this forum and hope you can help me with a problem I am facing with stata.

I have a very large .dta file (25gb) with 12 variables and more than 200.000.000 obs. My question is simply if there is a certain way to open this file in a way that does not make my computer crash without using external programs (SAS etc.)? If someone could help me who had similar experiences I would be very happy.

Best Wishes
Tags: None
Apoorva Lal

Join Date: Aug 2015

Posts: 14
#2

17 Jul 2017, 16:18

You can read in only the variables and observations you want.

So, say if your dataset was an individual level file and you were interested in ages of women, you could say

Code:

loc usevars age f loc cond "f == 1" use `usevars' if `cond' using "data.dta", clear

Another option is to read the dataset in chunks, subset/collapse to a manageable size, and stack.

Code:

loc stop 10000 loc mastersize 200000000 loc n = 1 forv start = 1(`stop')`mastersize' { use in `start'/`stop' using "data.dta", clear // keep relevant vars/ collapse save "chunk`n'", replace loc stop = min(`start'+`stop',`mastersize') loc ++n } clear forv i = 1/`n' { append using "chunk`i'" }

I've had a reasonable amount of success combining the two approaches to get the original dataset into memory.
http://www.nber.org/stata/efficient/ is a good resource for this sort of problem.

Last edited by Apoorva Lal; 17 Jul 2017, 16:25.
4 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30096
#3

17 Jul 2017, 16:26

The other point to make here is that the size of the data set should not be causing Stata to crash, regardless of how back it is. If it is too big to fit in the available memory, Stata will not crash: it will halt with an error message "op. sys. refuses to provide memory." What can happen with a very large data set is that it can take a very long time for Stata to read it. Stata does not issue any "progress reports" while it reads in the file, so it can easily appear that your computer is hung and that Stata has crashed. But Stata has never crashed on me when trying to read a large file. I admit I have never tried to read a 25gb file, but I have gone up to 20gb and Stata has always been able to read the file as long as my computer's memory wasn't too taken up with other open applications. But reading in a file that size does take a long time and can create the appearance of a hung computer.

That said, Apoorva Lai's advice to read only those observations and variables you actually need is excellent: not only will it save you time reading the file in, many of your subsequent commands will also execute more quickly.
2 likes
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#4

18 Jul 2017, 06:31

From my experience, it's a bit more complicated due to the existence of "page files". There are three scenarios.

1. The file is smaller than your physical memory. Loading the dataset will be fine.
2. The file is larger than your physical memory + page file. Stata will provide an error.
3. The file is larger than your physical memory, but smaller than physical memory + page file. Loading will take forever.

The page file is basically a section on your hard drive that Windows calls upon when it runs out of physical memory. The downside is that the page file is super slow - often 1000x slower than memory. So if you are in situation 3, Stata will load the data quite quickly into the physical memory, observe that it's full and start filling the page file. But this last step takes so long that for all intents and purposes Stata "crashes". Working with the data will also be super slow.

The solution is to either move to a cluster/server (which can easily get to 256GB of memory), find a way to do your stuff in pieces (see #2) or move to a line-by-line based language (SQL).
2 likes
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2402

08 Mar 2019, 11:47

I have recently found myself in the situation of reading in datasets that were larger than my physical memory allowed (~10 GB). The main issue I had was that the variable formats were not optimized, so the file size was much larger than necessary. I found this thread helpful for the ideas presented by Apoorva Lal , because the chunking approach made it possible to split the compression into chunks, then reassemble the dataset with a more reasonable size.

The presented code did not work as anticipated, but I rewrote it slightly and verified that it works as my quick solution. I am posting it here for posterity in case someone else comes along with a similar problem.

Code:

version 15.1

* start with some example data
cd "C:/temp"
input double id str12 x
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
end
save "testin", replace
* end of example data


*** This relevant bit of code starts here.
* modify the following 4 lines accordingly.
local indata "testin.dta"
local outdata "testout.dta"
scalar stepsize = 3  /* number of records to read in at one time. */
scalar recordstart = 1
* Set the above parameters accordingly.

describe using "`indata'"
scalar nrecords = r(N)
scalar nchunks = ceil( (nrecords - recordstart + 1) / stepsize )

forvalues chunki = 1/`=nchunks' {
   di "`chunki'"
   scalar start = recordstart + ((`chunki' - 1) * stepsize)
   scalar stop = min(start + stepsize - 1, nrecords)
   use in `=start'/`=stop' using "`indata'", clear
  
   // keep relevant vars/ collapse plus compress
   compress /* optional, but suggested */
   save "chunk`chunki'", replace
}

* assemble the chunks
clear
forv i = 1/`=nchunks' {
   append using "chunk`i'"
}
save "`outdata'", replace
*** end of code segment

* verify reassembled data
list, clean

Last edited by Leonardo Guizzetti; 08 Mar 2019, 11:49.

Comment

Vishal Sharma

Join Date: Sep 2018

Posts: 60
#6

15 Feb 2020, 15:26

hi,
is there any way to load only 2 variables from a huge data set that will not load because of memory capacity and that comes from .csv file and not .dta?

thanks
Vishal
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#7

15 Feb 2020, 18:54

yes, if you look at the help for -use- you will see that the second syntax shown is for exactly that
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#8

17 Feb 2020, 14:41

Originally posted by Vishal Sharma View Post

hi,
is there any way to load only 2 variables from a huge data set that will not load because of memory capacity and that comes from .csv file and not .dta?

thanks
Vishal

See -help import delimited- and then the colrange option.
1 like
Comment
Vishal Sharma

Join Date: Sep 2018

Posts: 60
#9

22 Feb 2020, 17:08

thanks!
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment