Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Best way to improve processing speed for large data sets (~3gb)

    How much will more RAM help me with processing speed.

    I am working with a dataset of 87 million records and is about 3gb in size. Its a 20 year cohort of people and my datasets contain 7 columns data corresponding to their health events and dates of such. It would be difficut to collapse any further (I could of course separate them into years, but would rather improve processing power than add these extra steps.)

    For example, today i tried to run a bsample to sample 800000 records out of my 80,000,000, and it took 1 hour and never did complete. I have similar long wait time when trying to flag if a string contained any of a list of 20 ICD codes from this database, waiting times of upwards of an hour while the computer sounds like its about to take off like an airplaine

    I have 7GB of available ram and it always running at max during these churns I currently only have 50GM of free C drive disk space (its mostly full). Lots of this I cant delete as they are records and dataset from other projects

    What will make the most difference for me? More RAM ($$$) or more HD space?

    This is an ongoing project and will be working with these datasets for at least a year, so need to fix. Using stata 17SE

    Thanks
    Last edited by Richard Golonka; 28 Jan 2023, 17:30.

  • #2
    What you are describing does not sound right. On a 32gb Dell laptop, the operation you described takes 83 seconds.

    Code:
    . set obs 87000000
    Number of observations (_N) was 0, now 87,000,000.
    
    . for num 1/7: gen xX = rnormal()
    
    ->  gen x1 = rnormal()
    
    ->  gen x2 = rnormal()
    
    ->  gen x3 = rnormal()
    
    ->  gen x4 = rnormal()
    
    ->  gen x5 = rnormal()
    
    ->  gen x6 = rnormal()
    
    ->  gen x7 = rnormal()
    
    . timeit 1: bsample 800000
    
    . timer list
       1:     93.43 /        1 =      93.4260
    The only rational explanation that I can think of, is that your computer runs out of RAM, and starts using the hard drive, and then everything becomes super slow.

    Comment


    • #3
      it does run out of RAM, when its doing many operations on a dataset of this size the useable ram is 7.75GB and its running at max via task manager.

      I heard that when Stata runs out of ram, it uses HD space, which is also dwindling (but still 50gb free).

      Just didnt know if using HD space made it run slow? Or maybe my HD free space was not large enough? Or maybe it was just the RAM issue.

      Sounds like its RAM. I have made my case for a 16gb upgrade so I hope that should do the trick. I am using a dell latitude 7440 - INTEL i56300 2.4gz. 8GB ram (7.76 useable) 64bit
      Last edited by Richard Golonka; 31 Jan 2023, 02:01.

      Comment


      • #4
        Originally posted by Richard Golonka View Post
        it does run out of RAM, when its doing many operations on a dataset of this size the useable ram is 7.75GB and its running at max via task manager.

        I heard that when Stata runs out of ram, it uses HD space, which is also dwindling (but still 50gb free).

        Just didnt know if using HD space made it run slow? Or maybe my HD free space was not large enough? Or maybe it was just the RAM issue.

        Sounds like its RAM. I have made my case for a 16gb upgrade so I hope that should do the trick. I am using a dell latitude 7440 - INTEL i56300 2.4gz. 8GB ram (7.76 useable) 64bit
        I am surprised to learn that for 3gb dataset (relatively small) and with almost 8gb RAM available, Stata has to resort to the HD...

        But yes, if Stata resorts to the HD, everything becomes super slow. Working with the HD is orders of magnitude slower than working with the RAM.

        Comment


        • #5
          While you are waiting for your upgrade, you can try to do what you are doing manually. Maybe it would be faster than the Stata's -bsample-.

          This note explains how to sample with replacements and how to reshuffle:

          https://journals.sagepub.com/doi/pdf...867X1101000410

          Comment


          • #6
            I ran Kolev's test while watching "top" display the memory used. The workspace takes 2.9GB before -bsample- starts, but quickly grows to 7.4GB during -bsample- operation. It isn't obvious why -bsample- would require any additional memory, but it may make a copy of the entire dataset at higher precision for some reason. The source code is available for examination in ado/base/b/bsample.ado in every Stata installation, but a quick look doesn't show me what is happening. Do-it-yourself may be the best way to avoid this resource limitation.

            Comment


            • #7
              The user-written -gsample- command (-ssc describe gsample-) might be worth trying here. I tried sampling 1e5 observations from a data set containing 1e7 observations, and -gsample- ran 20X (!) faster than -bsample-. (This timing was done *after* the first call to -gsample-, which was slow, presumably reflecting some time to compile and load various pieces of it.) I don't know if this savings would generalize to the memory-bound problem described above. The -generate- option on -gsample-, which creates a frequency variable rather than actually dropping observations, might be useful here, too.

              Code:
              clear
              set obs 10000000
              gen id = _n
              timer clear 1
              timer on 1
              // Un-comment one of the following to compare.
              // bsample 100000 
              gsample 100000
              timer off 1
              timer list 1

              Comment

              Working...
              X