Memory to store data while processing: Matrix vs Dataset

Minahil Aamir

Join Date: Jun 2024

Posts: 7
#1

Memory to store data while processing: Matrix vs Dataset

20 Aug 2024, 15:59

Hi,

I have a task that takes in a huge dataset (180GB, after merging some files together). Loading the dataset (merging files) is the main RAM expense, and then housing the data while I run some analysis. I want to know if its more efficient if I load each subsequent file, store it as a matrix and merge matrices, and then run my analysis in MATA, or the improvement in RAM usage would be minimal. To put simply, is the RAM needed to house the data as its loaded/processed less when the data is kept as a matrix compared to it being housed as a dataframe or about the same.

I need the entire dataset, so I cannot subset.

Thanks!
Tags: None
FernandoRios

Join Date: Apr 2014

Posts: 2459
#2

20 Aug 2024, 17:41

I would say it uses less memory as data frame (dta) because of the use of data types
Using mata everything is stored as double precision regardless of being a single digit or long number.
usage wise, I think mata is faster because you can some
times skip some of the security checks Stata does.
Other than that, I think it’s a matter of the specific task, and how efficient you are with the data types and variables you are using
1 like
Comment

Announcement