Loop over observations to open a large file

Paolo Rossi

Join Date: Mar 2016

Posts: 2
#1

Loop over observations to open a large file

12 Mar 2016, 03:49

Dear statalists,

I am very new to Stata, and I would like to ask your help because I cannot create a loop to repeatedly open a very large dataset in smaller portions, in order to save it smaller parts, so that I could work with each separately
Suppose I have a large dataset file (dataset.dta), which I cannot open alltogether because too large for my PC. I would like to:
1) open it in the range 1/1000
2) save the file as dataset_1.dta
3) close it.
4) open 1001/2000 and restart the process until the entire dataset has been saved.

My idea was the following, but I think that it is impossible to use `i' in the range

forval i = 1/100 { use "C:\Stata\dataset.dta" in [(`i'-1)*1000+1]/(1000*`i') save "C:\Stata\dataset_`i'.dta" }
Many thanks in advance for your help, and I am sorry if it could be a silly question.

Paolo
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35727
#2

12 Mar 2016, 04:29

I am not sure what you are going to do with all these datasets. but I think this answers your question. Note as explained at http://www.statalist.org/forums/help#stata that CODE delimiters make code easier to read.

Code:

forval i = 1/100 { local i2 = 1000 * `i' local i1 = `i2' - 999 use "C:\Stata\dataset.dta" in `i1'/`i2' save "C:\Stata\dataset_`i'.dta" }

There is a syntax for evaluating functions of locals on the fly, but I recommend something like the above while you are new to Stata.
Comment
Paolo Rossi

Join Date: Mar 2016

Posts: 2
#3

12 Mar 2016, 04:58

Thank you very much, works perfectly! Next time I will do my best to make it more readable.
Thanks again
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#4

12 Mar 2016, 07:38

Paolo Rossi given the expense of I/O operations, you might be better off finding some number of observations that could comfortable fit in memory (e.g., 500,000). Then open the file once and take a random sample of observations of the largest size you can work with on the machine. It will take a little bit longer initially if the system starts to swap memory, but you'll reduce the amount of clutter on the disk, and will have a single data set that best represents the larger file that you can work with to write your code. If you really wanted to split the file into multiple "shards", I would still take the approach of loading the whole file, then use export excel or export delimited to write multiple files without having to read the data back into memory several times (since both allow the if and in qualifiers).
1 like
Comment

Announcement

Loop over observations to open a large file

Comment

Comment

Comment