STATA MP version?

Manish Gupta

Join Date: Jun 2015

Posts: 11
#1

STATA MP version?

19 May 2019, 21:01

Dear STATA community,

I am facing an issue and I was hoping that some of you might share ideas or experiences on the same.

I work with large data files, exceeding 40 GB. STATA/SE was not enough so I switched to STATA/MP 2 but it has also been painfully slow. I have been thinking of upgrading the STATA MP version but not sure which is the optimal one. Online information/guidance is not very clear, and that is why I am reaching out to you.

I have tried splitting the files and work in batches, and it is helpful but it still takes a lot of time. Some basic functions could take hours or even days.

My system specs are:
Intel/Xeon/ CPU E5-1620 v4 @ 3.5 GHz
RAM: 64 GB
64-bit OS/x64 processor

Look forward to your suggestions.

Thank you.

Kind regards,
Manish
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

20 May 2019, 06:53

Hi Manish, you don't say what exactly you are trying to do and which commands are slow. I would say that the increasing cores from SE to MP# will help, but maybe not as much as you think. A real bottleneck to speed is actually RAM. If your dataset is 40 GB, then you need at least 40 GB just to hold the data in memory. Running some regressions may require as much as 2-3x this amount of additional memory to compute, and when Stata doesn't get enough RAM, it turns to caching on disk, which is 100-1000x slower than working in memory.

The few options I see are:
* Refactor your code to perform analyses in chunks (not always possible)
* You don't say how long the code takes to rub, so maybe you can just wait longer, or run the code over night
* Increase available memory if economical
* Invest in time on a cloud computing cluster and run Stata there (if economical, but not always possible with data sharing agreements)
Comment
Manish Gupta

Join Date: Jun 2015

Posts: 11
#3

20 May 2019, 10:13

Hi Leonardo, Thank you very much. If you know something more about cloud computing cluster, please feel free to share the information. In the meantime, I heard from STATA technical team and they recommend Stata/MP 4 since it matches the number of PC cores.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#4

20 May 2019, 10:48

I'd be inclined to work with a sample(s) of the data: If your 40 GB of data arises from (say) 4k of data per observation, then you would have 4 million observations. In most situations I can imagine, you would get quite good precision with a sample of only (say) 40,000 observations, and I don't think you could gain any great advantage from looking at the entire 4 million observations.

If this possibility interests you, why don't you explain the situation so as to clarify why it's necessary to use all the observations? Perhaps you are doing some kind of administrative work, e.g., looking at 40 GB of insurance claims to detect all instances of error, but if you're doing "scientific" work, trying to characterize the patterned features of some kind of process, then I'd think some kind of sample would be fine. It might be that the problem could be re-characterized as about how to use Stata to get a sample of a large data file. Even if you are dealing with a rare event, which leads you to need to want to examine large numbers of observations, there are likely good sampling strategies that could help.

I've only dealt with something like your situation once (back 15 years ago when 200,000 observations seemed like a lot!), but we had good luck with a sampling strategy, using repeated analyses of samples whose results we averaged.
1 like
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#5

21 May 2019, 12:36

Originally posted by Manish Gupta View Post

Hi Leonardo, Thank you very much. If you know something more about cloud computing cluster, please feel free to share the information. In the meantime, I heard from STATA technical team and they recommend Stata/MP 4 since it matches the number of PC cores.

I have used the guide posted here with success and am planning some statistical work using one of these large clusters in the near future. The guide is somewhat old, but the general instructions still work.
1 like
Comment
Manish Gupta

Join Date: Jun 2015

Posts: 11
#6

22 May 2019, 10:52

@ Dear Leonardo, thank you very much. Much appreciated.

@ Dear Mike, Many thanks for sharing your experience. The thing is that I am dealing with some time-sensitive panel data and it is important that I use all information at this stage. It might change later, but for now, I must keep them.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment