General Advice on Upgrading

Joshua Tobias

Join Date: Aug 2020

Posts: 21
#1

General Advice on Upgrading

05 Jan 2024, 08:22

(Note: If not allowed, moderators feel free to remove this post.) I just wanted people's opinions on upgrading from Stata 17 SE to Stata 18 MP to deal with large datasets. I am working on my dissertation, and the data I am working on with the Medical Expenditure Panel Survey is taking a long time just reshaping the data back and forth. My current laptop is still good (in terms of being able to support Stata), but I am finding the long wait between commands to be one of the reasons why I have been having a hard time working on my data and feeling very discouraged. I don't know what other solutions I should seek to complete my dissertation. I want to finish by the end of the year, and the only thing holding me back is the slow turnaround time. I would love to hear any advice on this topic - especially since upgrading from SE to MP is $755, even as a student.
Tags: None
Erik Ruzek

Join Date: Oct 2017

Posts: 436
#2

05 Jan 2024, 08:33

I think you can probably get away without upgrading especially if you switched over to some of the user-written commands such as gtools? gtools speeds up data management with large datasets considerably. Now the analyses might still take a while depending on what kinds of models you are running. Perhaps you can say a bit more about that?
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30141
#3

05 Jan 2024, 09:17

I would add to Erik Ruzek 's excellent advice that you can often get a substantial speedup by paring down the data set in memory to only those variables and observations that you will actually need for the computations. For an example see my post at #191 in https://www.statalist.org/forums/for...tata-19/page13. Sometimes even just -frame put-ting just the variables you need for a step in the program into another frame and then carrying out that step in the other frame and then sending the results back with -frame get- can save a lot of time--particularly if the step involves -sort- or any command that sorts the data behind the scenes.

Another tip: if you data set includes lengthy encrypted ID string variables (things that look like strong passwords, or many-digit hexadecimal numbers represented as strings), replacing those with consecutive numbers that can be stored as a -long- or -double- will not only free up a lot of memory and disk space but will also speed up many commands. (Of course, be sure to preserve a crosswalk if you do this!)

It is hard to overstate the efficiency benefits of working with the smallest possible data set in memory that can do the job.

That said, if you are fitting multi-level models with lots of cross-level interactions and unstructured covariance on a data set with tens of millions of observations, you are going to wait a long time no matter what you do.
2 likes
Comment

Announcement

General Advice on Upgrading

Comment

Comment