Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • RAMDisk and Stata to (greatly) improve loading times

    Dear all,

    I follow up on one of my previous posts (http://www.statalist.org/forums/foru...ta-bottlenecks) to actually propose a way to greatly improve Stata's loading times when opening large datasets.

    The solution I have found (and maybe many of you already use) is to set up a RAM disk and store the large datasets there.
    Of course, one needs to have enough ram to 1) store the data into the ram disk 2) allocate enough ram to stata to use that dataset.

    Assuming that this is the case, I wonder whether there could be any problems arising from this setup. Stata will work correctly if I use a ram disk?
    Data will not be corrupted?

    Many thanks for your suggestions
    Best,
    Last edited by Jean-Luc Morin-Chesnel; 26 Oct 2014, 08:51.

  • #2
    Oh, wow! Flashbacks! I had completely forgotten about RAMDisks! They were a fad back in the 90's and early 2000's, but in practice, seldom useful with 32-bit OS's with the limits on RAM. But now with 64-bit machines that have oodles of accessible RAM, they might be feasible, and sounds like a near-perfect solution to your problem.

    I'm not too worried about data corruption: if data corruption were common, it would barf pretty quickly with strange errors. For important stuff (final models, etc), you could re-run something without the RAMDisk just to be on the safe side.

    What I would be concerned about is running out of RAM. If I remember correctly, you have a 10 GB dataset, with 32 GB of RAM. So you allocate 11 GB to the RAMDisk. Now you have 21 GB left. The data itself chews up 10 GB, so you have 11 GB left. For *some* procedures, you're totally set. For other procedures involving large matrices in the background, you will quickly use up all available RAM. But this might be a concern even without the RAMDisk -- some models just require huge matrices, and 11 GB free vs 22 GB free might not make a difference.

    Comment


    • #3
      thanks ben for your nice suggestions again! Actually I often use the -collapse function on my large datasets, to reduce them at the firm-month level (or the like). The problem is that this function is INCREDIBLY slow. Maybe because stata uses the hard disk to temporarily store some data? Would the ram disk be of any help for this function according to you?

      Comment


      • #4
        Have you used the -fast- option of collapse? It avoids preserving the dataset.
        Also, I wouldn't go through the RAMDIsk route. Usually we don't open a dataset that often, and 1 minute (or half that of if you have a faster SSD) is not that much for an operation that is seldom done. If you want to be able to ctrl+d fast when writing some code, just work on a sample of the file for some firms/years and then run on the big dataset at the very end.

        Comment


        • #5
          I do not know. If you were to do this, you might allocate even more ram to the RAMDisk, and follow these instructions http://www.stata.com/support/faqs/da...ary-directory/ to move the temp directory.. Only way to know what works is to try it; I dunno how much performance is simply loading the data, how much is inevitable due to multiple bottlenecks, and how much moving the temp directory will help. It may hurt, since it might be what I'll call "main RAM" that it needs (and every GB dedicated to the RAMDisk eats into main RAM).. But try different settings, and see what works. If it is storing lots of little tempfiles, then this might improve things a lot. If it needs one ginormous tempfile, then this won't help.

          As a side-thought, do you really need to run the entire file all at once every time? Especially if you're collapsing it, seems like that would be a good time to save the data. So steps 1, 2,and 3 take the full 10 GB, and slow and painful stuff, maybe taking ten or twenty hours to run. But you save at the end of step 3, and from then on, just read that in to do steps 4, 5, and 6. But you have almost certainly explored this option.

          Comment


          • #6
            hi ben, hi sergio. thanks for your inputs.
            well I am going to try soon. I ll keep you posted on this thread.

            Comment


            • #7
              follow up. It works. Loading times were faster, but not dramatically faster. Relative performance increase depends however on your computer/stata version/data. Still, given that those ramdisk programs are very cheap, I recommend this route.

              Have a nice day everyone

              Comment

              Working...
              X