Run time lag with large datasets

Robert Eldritch

Join Date: Oct 2017

Posts: 12
#1

Run time lag with large datasets

03 Jun 2018, 10:26

I'm working with Stata IC (v.15), and it's taking forever to open a dataset with 3 million+ observations and 102 variables. If I upgraded to SE (MP is out of my budget range), would I save a significant amount of time? Or do I just need a faster machine with more RAM (my machine has 4 GB of RAM)? It's taking over half an hour to just load in the dataset (that's if the program doesn't crash beforehand), and it's taking 10 minutes to do simple things like dropping observations.

Thanks.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#2

03 Jun 2018, 10:37

If you run -help limits-, you will see that there aren't many limits on Stata IC that are more stringent than SE/MP. Those limits concern the number of variables, maximum matrix size, and a few other things. But most of the limits are the same. Most important, none of the limits that do differ between the Stata flavors appear to have much relevance to your situation. 102 variables is well below the IC limit. At least so far you aren't building any matrices. The limit on the number of time periods per panel in -xt- might be relevant to you, but it has nothing to do with reading in the file or dropping observations.

So I think that upgrading to SE/MP will probably make little or no difference for what you are doing here. I think this is more a hardware issue. I would also add that while CPU speed and RAM are, in general, the most important factors in performance, when the major issue you have is the time spent in reading (and perhaps later writing) a large file, mass storage access time may well be critical in this situation. I would look for a machine with a very fast disk drive.
1 like
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

04 Jun 2018, 09:57

To add to Clyde's comment, while it probably won't matter for loading data, MP may be much faster when you come to run regressions or do similar analyses. Please note Clyde's entire comment and not just the last sentence. When it has loaded, CPU speed and RAM are likely to make a big difference in processing. Stata is fast partly because it loads all the data into RAM. With insufficient RAM, as I understand it, Stata uses the hard drive which is much slower.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

04 Jun 2018, 13:23

Following up on Phil's advice, my understanding is not that Stata uses the "hard drive" (whatever that means in these days of SSD storage) when the operating system cannot allocate sufficient RAM. Rather, the operating system allocates virtual memory, swapping blocks data from RAM to the "hard drive" and reloading it as needed.

If I'm correct, there are two implications.

First, we should avoid thinking of this as a consequence of Stata code that could be improved. Statacorp bet on Moore's law and chose a memory-only design, then devoted their resources to interesting statistical algorithms, leaving hardware and software evolution to take care of performance and capacity improvement. A good bet, I'd argue.

Second, the effect of virtual memory usage is, I think, dramatically better on systems with some or al SSD storage for their "hard drive". Our household includes three laptops with SSD storage and two with magnetic media, and the difference in rebooting, switching users, and launching applications is dramatic.

With that said, I agree with Phil, and if you can cheaply add another 4GB RAM to your system, that would be a good place to start. A slow hard drive may make loading your data take a long time, but it should not in and of itself may Stata fail.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#5

04 Jun 2018, 13:48

A slow hard drive may make loading your data take a long time, but it should not in and of itself may Stata fail.

Well, both others on this Forum, and I have experienced a situation where, when writing to a remote drive over a network inside a loop, Stata can overwhelm the buffers used by the operating system, and will end up aborting in the midst of the program. I suppose this isn't, strictly speaking, a "slow hard drive" making "Stata fail." But it is in that spirit, and it would appear to be that to the end user.

That said, let me be clear that I agree with both Phil Bromiley and William Lisowski that in general, the key to getting good performance out of Stata is to use a machine with plenty of RAM and a good processor. But when problems specifically arise from I/O processes, the nature of the mass storage becomes a factor as well. The flavor of Stata, or Stata code in general, is rarely the issue.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

04 Jun 2018, 17:50

Clyde is correct: in writing about “hard drive” issues I was thinking only about storage local to the computer on which Stata is running. On reflection, post #1 does not tell us exactly how the dataset is being “loaded” or from where, so we may have overlooked some other possibilities.
Comment

Announcement

Run time lag with large datasets

Comment

Comment

Comment

Comment

Comment