Opening huge data files in STATA

Joe Ward

Join Date: Jun 2015

Posts: 45
#1

Opening huge data files in STATA

17 Jun 2019, 05:19

Dear Satalist

I am having difficulty opening and working with several huge data files in Stata. These are .txt files which are upwards fo 15GB each - they have around 300 variables and up to 16 million observations each, and I have around 20 of them (representing different years of data), which ideally I would like to append in to one master database. Does anyone have any advice as to how to do this within STATA, or which database software to use in order to then analyse the data in Stata?

Best Wishes

Joe
Tags: None
Enrico Zorzi

Join Date: Jun 2016

Posts: 30
#2

17 Jun 2019, 05:31

h infile
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#3

17 Jun 2019, 06:46

Some generic possibilities:
- run Stata on a server with hardware that can manage such data and computing
- upload your data on a PostgreSQL installation and query it from Stata
- collapse/summarize each file and append the collapsed files.
- take smaller random samples from each file and append those.
Depends on how much detail you really need from your info in the 15GB files.
2 likes
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

17 Jun 2019, 13:57

To explain things a bit more: Enrico is asking you to type

Code:

help infile

to see documentation for the infile command. You can use infile with just a subset of your variables. You can also limit the number of observations.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#5

17 Jun 2019, 14:34

Elaborating on #2 and #4, you might be better off with -import delimited- if it is a delimited text file, or with -infix- if it is a fixed width file.

The master data base is going to be on the order of 300 GB. You'd better be working on a system with a lot of RAM to work with a file this big.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#6

17 Jun 2019, 15:22

All good advice above, but I have a perhaps pedestrian suggestion about how to possibly solve this on an ordinary machine, using Stata alone. Do you need to use all 300 variables in any one analysis? (Pretty unusual; I usually find that I don't use all 300 variables in *all* my analyses put together.) If not, what about creating master subsets of variables that will be used together? The subsets might overlap, of course. For one subset, I'm thinking of something like this:

Code:

local filelist= "thisfile thatfile someotherfile" clear local vars1 = "x1 z3 q5 r10" save using "master1.dta", emptyok foreach f of local filelist { import delimited using `f' ..... keep `vars1' compress // Make "accidental" doubles into floats *if* this is ok ds, has(type double) // recast float `r(varlist)' // append using "subset`.dta" }

You could make multiple subsets, as needed.

Last edited by Mike Lacy; 17 Jun 2019, 15:26.
1 like
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#7

18 Jun 2019, 00:24

Limiting the number of variables is a good point of course, but I'd call collapsing or random samples fairly pedestrian solutions as well.
And postgres can be installed locally on a laptop or PC as well. Not so straightforward, but not hugely complicated, and does allow you to work quickly with datasets that well exceed your laptops RAM.
Comment

Announcement

Opening huge data files in STATA

Comment

Comment

Comment

Comment

Comment

Comment