Very Large CSV file

Beckie Garvin

Join Date: Apr 2018

Posts: 8
#1

Very Large CSV file

02 Apr 2018, 11:17

Good afternoon,

I am new STATA use and working with a very large .csv file (58 million observations and 24 variables). I am trying to figure out the best way to analyze this data. The statistics that I am looking for from this dataset are not complicated (primarily just summary statistics). I was thinking that the best way would be to somehow import the file piece by piece into STATA, run the same code on each piece, and then aggregate the results. Any help would be greatly appreciated!

Thanks,
Beckie
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30163
#2

02 Apr 2018, 12:23

Well, I'm not convinced that breaking the data into subsets and then combining the descriptive statistics is really the best way to go here. I think the time it will take you to code the re-combination of the descriptive statistics (and the risk of doing it incorrectly) will outweigh the time it will take you to just run the descriptive statistics on the whole data set. The largest dataset I currently have on my computer has 14,342,307 observations and contains 10 numeric variables. It took Stata 37 seconds to calculate descriptive statistics on all 10 variables.

Anyway if you want to just break your data set up into, say, 6 chunks of about 10,000,000 observations each, you could do this:

Code:

capture program drop save_one_chunk program define save_one_chunk local chunk_num = chunk[1] drop chunk save chunk_`chunk_num', replace exit end import delimited my_big_csv_file.csv, clear gen int chunk = ceil(_n/10000000) runby save_one_chunk, by(chunk)

To use this, you must install the -runby- command, written by Robert Picard and me, available from SSC. This code will create 6 Stata data sets, called chunk1.dta through chunk6.dta, each containing 10,000,000 observations (except the last will be a bit smaller.)

But frankly, for the purpose you describe, I'd just save the whole thing as a single data file and run my analyses on that. Unless you had a way to enforce parallel processing on the separate chunks, I don't see any gain from breaking it up.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2405
#3

02 Apr 2018, 12:39

Even with Stata/IC, you can have ~2 billion observations so I don't see a technical reason preventing Stata from analyzing the data altogether, per se. I have not tried to import such a large file into Stata, so I tested the scenario with dummy data to see if the import facility would have issues. My code was:

Code:

set obs 58000000 gen id = _n gen x = rnormal(0, 1) export delimited test.csv

This took about 10 minutes to write out a ~1.0 GB file with the 58M observations and 2 variables. I then imported the dataset using:

Code:

drop _all import delimited id x using test.csv

On my reasonably fast machine, this took a few minutes to read in all data. So that shows there s no issue with importing a large file.

Summary statistics (and other commands) would then be used as normal, though given the size of the dataset, I expect will take some time (for example, -- summ x, detail -- took a minute).

Last edited by Leonardo Guizzetti; 02 Apr 2018, 12:48. Reason: fixed grammar
Comment
Shadi Shad

Join Date: Apr 2021

Posts: 33
#4

07 Sep 2021, 03:40

Originally posted by Leonardo Guizzetti View Post

Even with Stata/IC, you can have ~2 billion observations so I don't see a technical reason preventing Stata from analyzing the data altogether, per se. I have not tried to import such a large file into Stata, so I tested the scenario with dummy data to see if the import facility would have issues. My code was:

Code:

set obs 58000000 gen id = _n gen x = rnormal(0, 1) export delimited test.csv

This took about 10 minutes to write out a ~1.0 GB file with the 58M observations and 2 variables. I then imported the dataset using:

Code:

drop _all import delimited id x using test.csv

On my reasonably fast machine, this took a few minutes to read in all data. So that shows there s no issue with importing a large file.

Summary statistics (and other commands) would then be used as normal, though given the size of the dataset, I expect will take some time (for example, -- summ x, detail -- took a minute).

Hi,

I am working with large csv files some of which having the size of 1 to 2 GB. But I cannot even import them to Stata in such a short time, for example with one 1.53 GB file I had to wait 2 hours and in the end the process was not complete and stata stopped responding. I would appreciate your comment on this and why do you think this is happening.

Best,
Shadi
Comment
Shadi Shad

Join Date: Apr 2021

Posts: 33
#5

07 Sep 2021, 03:51

Originally posted by Clyde Schechter View Post

Well, I'm not convinced that breaking the data into subsets and then combining the descriptive statistics is really the best way to go here. I think the time it will take you to code the re-combination of the descriptive statistics (and the risk of doing it incorrectly) will outweigh the time it will take you to just run the descriptive statistics on the whole data set. The largest dataset I currently have on my computer has 14,342,307 observations and contains 10 numeric variables. It took Stata 37 seconds to calculate descriptive statistics on all 10 variables.

Anyway if you want to just break your data set up into, say, 6 chunks of about 10,000,000 observations each, you could do this:

Code:

capture program drop save_one_chunk program define save_one_chunk local chunk_num = chunk[1] drop chunk save chunk_`chunk_num', replace exit end import delimited my_big_csv_file.csv, clear gen int chunk = ceil(_n/10000000) runby save_one_chunk, by(chunk)

To use this, you must install the -runby- command, written by Robert Picard and me, available from SSC. This code will create 6 Stata data sets, called chunk1.dta through chunk6.dta, each containing 10,000,000 observations (except the last will be a bit smaller.)

But frankly, for the purpose you describe, I'd just save the whole thing as a single data file and run my analyses on that. Unless you had a way to enforce parallel processing on the separate chunks, I don't see any gain from breaking it up.

Hi,

I am working with large csv files (500 MB to 2 GB) and I have troubles importing my files into stata. I tried a loop at first as followed to drop the unwanted variables and observations only and save the results in separate dta file. But stata keeps working and crashes after some several hours. Even with the directory only contains two of the cvs files (there are over 50 files in total). Here is the loop I tried for importing files from the directory:

Code:

global path "E:/Comtradet1" cd "$path" local datafiles: dir "." files "*.csv" dir foreach file of local datafiles { import delimited "`file'" , clear some commands save "$path/`file'.dta", replace }

As I said I cannot even import the data successfully. Please let me know if I have made mistakes in my script.
To check if the code works I even tried this with only one file in my directory and still the same thing happened.

Then I tried importing a single csv file using the code below:

Code:

global path "E:/Comtradet" import delimited "E:\Comtrade\type-C_r-ALL_ps-1977_freq-A_px-S2_pub-20041207_fmt-csv_ex-20190512", clear cd "$path" some commands compress save "$path/1977.dta", replace

Which actually worked but this is my smallest file (280 MB) and when I run the same code for another csv file (1.50 GB) stata keeps working and after some hours crashes. After reading the post here however I found out that it's also not very normal that it takes even one hours. So please let me know if you could think of a solution.
Best,
Shadi

Last edited by Shadi Shad; 07 Sep 2021, 03:59.
Comment

Announcement

Very Large CSV file

Comment

Comment

Comment

Comment