Hi all, this is a question about a work project, so I am unfortunately very limited in what I can share info- and data-wise. However, still hopeful someone may have helpful pointers!
The project I am working on involves a large amount of data (approx. 3TB). We have a number of different data types, and each data type is split into 500 samples for more manageable file sizes. I am working to put together a data summary of the entire dataset (all data types). The largest file size for one individual sample is about 500MB.
The do file I have created is designed to loop through each sample within each of the data types. It loads the data for sample 1, stores a number of stats in local macros (observation counts, some percentiles, etc.), and then clears the memory and loads sample 2 and continues this loop until all 500 have been read in. This repeats for each of the 16 data types. So, as far as I can tell, the only "carry over" memory usage between samples should be the info stored in the local macros. There are thousands of macros being created, but my understanding is that these should be only a very minor amount of memory usage.
However, the task manager on the server I am running the code on shows that my memory usage continues to increase as each loop iteration completes. I have added some additional checks into the do file (e.g., 'clear' at the end of each loop to ensure data is removed from memory) and also memory checks after each sample is run. All of the memory checks show that there are only a few MB of data in Stata. However, the task manager still shows a significant amount of memory use. It appears that somehow each sample is being retained in memory even though Stata is showing no data in memory and very small amounts of usage.
I have made some adjustments to niceness, segmentsize, and max_memory for other portions of this project, so I confirmed that the adjustments were not made permanently. All of my session memory settings are Stata defaults. So basically, I am at a complete loss as to why I keep having this discrepancy between Stata's memory output and the task manager's.
I know that it will be impossible to diagnose without reviewing the code, but if anyone has any suggestions at all that I might be able to look into, I would really appreciate it.
The project I am working on involves a large amount of data (approx. 3TB). We have a number of different data types, and each data type is split into 500 samples for more manageable file sizes. I am working to put together a data summary of the entire dataset (all data types). The largest file size for one individual sample is about 500MB.
The do file I have created is designed to loop through each sample within each of the data types. It loads the data for sample 1, stores a number of stats in local macros (observation counts, some percentiles, etc.), and then clears the memory and loads sample 2 and continues this loop until all 500 have been read in. This repeats for each of the 16 data types. So, as far as I can tell, the only "carry over" memory usage between samples should be the info stored in the local macros. There are thousands of macros being created, but my understanding is that these should be only a very minor amount of memory usage.
However, the task manager on the server I am running the code on shows that my memory usage continues to increase as each loop iteration completes. I have added some additional checks into the do file (e.g., 'clear' at the end of each loop to ensure data is removed from memory) and also memory checks after each sample is run. All of the memory checks show that there are only a few MB of data in Stata. However, the task manager still shows a significant amount of memory use. It appears that somehow each sample is being retained in memory even though Stata is showing no data in memory and very small amounts of usage.
I have made some adjustments to niceness, segmentsize, and max_memory for other portions of this project, so I confirmed that the adjustments were not made permanently. All of my session memory settings are Stata defaults. So basically, I am at a complete loss as to why I keep having this discrepancy between Stata's memory output and the task manager's.
I know that it will be impossible to diagnose without reviewing the code, but if anyone has any suggestions at all that I might be able to look into, I would really appreciate it.
Comment