Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • What is the best way to reduce the memory usage in Stata?

    Dear Sir,

    We are running the Stata in the AWS cloud service.Our setup is very simple:
    1. Each user manipulate to run the Stata application installed in the EC2 instance.
    2. A large dataset (8TB in total) is stored in the S3 bucket.
    3. S3 bucket and EC2 instance are connected through the storage gateway.
    Since we are dealing with a very huge amount of data, 8TB in total, the Stata application requires to utilize a huge amount of memory. In fact, one of our instances is designed to develop with the instance type as r5a.8xlarge, which means that our instance has 32 vCPU and 256 GB of memory. It takes too much cost.

    We have always been facing with such a memory issue, and the required large amount of cost has been a serious obstacle for us to advance our research projects.

    Now, I am seeking for some ways to reduce the memory usage in Stata, though we have not yet examined to test any ways compared with the standard use of do-command. What I am thinking to try is to make a separation of a user interface and in-memory database structures in Stata. I have no idea what would be the best way to realize it and there might be no simple ways as long as we use the Stata application with a very huge amount of data. The below is a list of my very simple ideas I am trying to test it.
    • Call the Stata command in the Jupyter-Lab.
    • Convert the existing Stata command to the Mata language.
    • Construct a RDBMS in the storage sector and connect the Stata through the odbc connector.
    If anyone could give me some comments on the memory usage problem in Stata, it would be a great pleasure for us.

    Best regards,
    Tatsuru Kikuchi
    Last edited by Tatsuru Kikuchi; 01 Jun 2023, 07:47.

  • #2
    First thing I thought of was "do your users need all data?". Very often I only need some of the variables. You can ask Stata to open a subset of your dataset, something like this: use vara varb varc using big_dataset.dta You can also use this syntax with the if and in conditions to limit what observations are loaded.

    PS. it is probably safer not to assume we are all male. Most post just leave the salutation out altogether, and that is probably best.
    Last edited by Maarten Buis; 01 Jun 2023, 08:22.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Stata wants all the data used in a procedure to be in core when the procedure runs. Workarounds for this are very difficult if you can't squeeze the data into the available space. Stata itself takes very little memory - only megabytes, not gigabytes. There is an enormous simplication in the in-core restriction, so I don't think it will ever be removed. SAS is an out-of-core package with good coverage, but very costly.

      I have posted some information on conserving memory at http://www.nber.org/stata/efficient. Especially you should look at https://www.nber.org/stata/efficient/memory.html but there are additional considerations. Most Stata procedures take only the amount of memory required to hold the data plus a few megabytes. However some will make a copy of the data promoting all the byte variables to doubles. That factor of 8 can be a problem. You can only find out what your projects require by running programs and observing memory usage with OS commands. The Stata -memory- command only counts the memory used between procedures, not in the procedures.

      If in fact you need to keep more than 250 or so GB for any procedure, you have to consider the difficulties of processing data out-of-core compared to the cost of buying a large memory server yourself. A 2TB server is easily obtained for less than $US20,000. With sufficient memory, most procedures are fast, even with very large datasets.

      In answer to your specific suggestions, I don't find them very promising. Mata is Turing-complete, so you could rewrite Stata commands to run with data out-of-core, but Mata built in statements all assume the data is in-core. It would be a massive and thankless job to make them out-of-core if you had any diversity in needs. It might be attractive if you had only one troublesome procedure and could do the rest in Stata. ODBC is a way to feed data to Stata, but won't reduce the memory required for each procedure. It may be easier than -if- qualifier for subsetting data, but that is all it can do for you. Juypter won't even do that.

      You could consider doing some work in Python. In many projects some initial condensation of the data needs to be done before the intense econometrics. Doing that out-of-core in a general purpose language is sometimes quite easy.
      Last edited by Daniel Feenberg; 01 Jun 2023, 09:14.

      Comment

      Working...
      X