Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Memory to store data while processing: Matrix vs Dataset

    Hi,

    I have a task that takes in a huge dataset (180GB, after merging some files together). Loading the dataset (merging files) is the main RAM expense, and then housing the data while I run some analysis. I want to know if its more efficient if I load each subsequent file, store it as a matrix and merge matrices, and then run my analysis in MATA, or the improvement in RAM usage would be minimal. To put simply, is the RAM needed to house the data as its loaded/processed less when the data is kept as a matrix compared to it being housed as a dataframe or about the same.

    I need the entire dataset, so I cannot subset.

    Thanks!

  • #2
    I would say it uses less memory as data frame (dta) because of the use of data types
    Using mata everything is stored as double precision regardless of being a single digit or long number.
    usage wise, I think mata is faster because you can some
    times skip some of the security checks Stata does.
    Other than that, I think it’s a matter of the specific task, and how efficient you are with the data types and variables you are using

    Comment

    Working...
    X