Memory Usage Issues

James Everson

Join Date: Jan 2022

Posts: 7
#1

Memory Usage Issues

16 May 2023, 10:08

Hi all, this is a question about a work project, so I am unfortunately very limited in what I can share info- and data-wise. However, still hopeful someone may have helpful pointers!

The project I am working on involves a large amount of data (approx. 3TB). We have a number of different data types, and each data type is split into 500 samples for more manageable file sizes. I am working to put together a data summary of the entire dataset (all data types). The largest file size for one individual sample is about 500MB.

The do file I have created is designed to loop through each sample within each of the data types. It loads the data for sample 1, stores a number of stats in local macros (observation counts, some percentiles, etc.), and then clears the memory and loads sample 2 and continues this loop until all 500 have been read in. This repeats for each of the 16 data types. So, as far as I can tell, the only "carry over" memory usage between samples should be the info stored in the local macros. There are thousands of macros being created, but my understanding is that these should be only a very minor amount of memory usage.

However, the task manager on the server I am running the code on shows that my memory usage continues to increase as each loop iteration completes. I have added some additional checks into the do file (e.g., 'clear' at the end of each loop to ensure data is removed from memory) and also memory checks after each sample is run. All of the memory checks show that there are only a few MB of data in Stata. However, the task manager still shows a significant amount of memory use. It appears that somehow each sample is being retained in memory even though Stata is showing no data in memory and very small amounts of usage.

I have made some adjustments to niceness, segmentsize, and max_memory for other portions of this project, so I confirmed that the adjustments were not made permanently. All of my session memory settings are Stata defaults. So basically, I am at a complete loss as to why I keep having this discrepancy between Stata's memory output and the task manager's.

I know that it will be impossible to diagnose without reviewing the code, but if anyone has any suggestions at all that I might be able to look into, I would really appreciate it.

Last edited by James Everson; 16 May 2023, 11:06.
Tags: data, memory, storage, usage
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2407
#2

16 May 2023, 11:35

I can't really say with any certainty, but this is how I would approach the problem (at least to start). Define a -program- that either uses -marksample- to define your subgroup, or pushes just those data into a temporary frame from which to calculate your desired statistics. Since you mention having thousands of macros being created, you might also consider pushing those into a separate results frame. Using the temporary frame has the advantage that Stata will delete it upon program completion. It does come with an additional memory cost of copying the data, but this size is relatively small compared to your overall dataset size. If you already have >3 TB of memory, this additional need should be trivial.
Comment

Announcement

Memory Usage Issues

Comment