Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computer says 'no' - How to allocate more memory for STATA

    [Please note: I've looked at a few other forums with similar problems, but the solutions don't seem to solve my problem]

    I'm a PhD student working with very large data sets (healthcare data), ranging from 20-50GB .dta files.

    I need to clean my data and as such I am attempting to remove patients in the master file (20GB) that do not appear in my user file (50KB) based on a common variable (drug ID) via the -merge- function. I have 8 different user files to merge separately with the master file (so I can produce 8 different cohort groups). I have successfully done this with 3 user files, but upon attempting with the 4th user file I receive the following error message:

    op. sys. refuses to provide memory r(909);
    Stata's data-storage memory manager has already allocated 90999619608 bytes and it just attempted to allocate
    another 32m bytes. The operating system said no. Perhaps you are running another memory-consuming task and
    the command will work later when the task completes. Perhaps you are on a multiuser system that is
    especially busy and the command will work later when activity quiets down. Perhaps a system administrator
    has put a limit on what you can allocate; see help memory. Or perhaps that's all the memory your computer
    can allocate to Stata.

    r(909);
    • I am not running any other task
    • I am on an unlimited user network
    • -query memory- and max_memory is (.) which means all the memory the operating system is willing to supply.
    Other details:
    • I am using Microsoft Windows 10 Enterprise with a 64-bit operating system and 32GB RAM (31.8GB usable). My local disk has 351GB free
    • I am using STATA SE 16 (64 bit)
    • The master file (20GB is stored and opened on my laptop from an external hard drive)
    • Attempted to compress the master file (0 bytes saved)
    I am aware this is an operating system problem and not STATA specifically. However, my question is has anyone else had this problem in their use of STATA and overcome it? Is there a way to get around this error and allocate more memory? I'm not an IT specialist, but it looks like I still have enough RAM left?
    Any help or advice would be much appreciated.


    Thank you.



  • #2
    Allocating 90999619608 bytes is a little over 84 GB of storage, this plus your mention of a network configuration suggests to me in that Stata is being run on the server and does not use your local machine memory. Either way, it's not really important -- the crux of the issue is that you have exhausted available memory and Windows can't produce any more for Stata to use.

    Generic advice follows, in roughly decreasing order of ease:
    • reduce the dataset to only the variables and subset absolutely necessary using drop/keep;
    • really examine why your program requires 3x more memory for your operations (perhaps there's a very inefficient merge taking place, or you have a lot of data stored in -frames- that aren't essential to maintain in memory);
    • run your program during a less busy time;
    • ask your system administrator to temporarily allocate more memory to the task;
    • investigate whether you can use resampling to produce valid results for estimation commands, rather than using the full sample.

    Comment


    • #3
      The following code will reduce your 20GB primary datasaet to a dataset containing just the drugID and an observation identifier.
      Code:
      use drugID using primary, clear
      generate seq = _n
      save smallprimary, replace
      You can merge this, rather than the 20GB dataset, in your existing code, and then merge the result back to the primary dataset.

      Comment


      • #4
        A minor modification to William's code:

        replace

        Code:
        generate seq = _n
        with

        Code:
        generate `c(obs_t)' seq = _n

        Comment


        • #5
          Thanks Daniel and William, unfortunately. I still have the same r(909) error when I attempted to merge the smaller dataset back into the primary as outline in #3.

          Unless there is another solution, I think I need to examine the merge taking place and why 3x more memory is required. I'm not very familiar with -frames- and if 'removing data stored as frames' is event possible or a method to try.

          Comment


          • #6
            I need to clean my data and as such I am attempting to remove patients in the master file (20GB) that do not appear in my user file (50KB) based on a common variable (drug ID) via the -merge- function.
            If you are able to load the 20GB file, no need to merge. Try appending.

            Code:
            *OPEN 50KB FILE
            use file50kb, clear
            keep drug_ID
            contract drug_ID, freq(master)
            replace master=0
            tempfile using
            save `using', replace
            
            *OPEN 20GB FILE
            use file20gb, clear
            gen master=1
            append using `using'
            bys drug_ID (master): replace master=master[1]
            keep if !master

            Comment


            • #7
              Andrew, this seemed to work! My data is so vast (over 100m observations). It's difficult to check the accuracy, but I'm beginning to do so.

              However, could you please explain/teach me what the steps above have involved exactly and why it is better than my original method?

              Comment


              • #8
                Code:
                webuse grunfeld, clear
                keep if time<=5 & company<=5
                l, sepby(company)
                Consider the dataset below created from the code above as your master dataset.

                Code:
                 
                . l, sepby(company)
                
                     +--------------------------------------------------+
                     | company   year   invest   mvalue   kstock   time |
                     |--------------------------------------------------|
                  1. |       1   1935    317.6   3078.5      2.8      1 |
                  2. |       1   1936    391.8   4661.7     52.6      2 |
                  3. |       1   1937    410.6   5387.1    156.9      3 |
                  4. |       1   1938    257.7   2792.2    209.2      4 |
                  5. |       1   1939    330.8   4313.2    203.4      5 |
                     |--------------------------------------------------|
                  6. |       2   1935    209.9   1362.4     53.8      1 |
                  7. |       2   1936    355.3   1807.1     50.5      2 |
                  8. |       2   1937    469.9   2676.3    118.1      3 |
                  9. |       2   1938    262.3   1801.9    260.2      4 |
                 10. |       2   1939    230.4   1957.3    312.7      5 |
                     |--------------------------------------------------|
                 11. |       3   1935     33.1   1170.6     97.8      1 |
                 12. |       3   1936       45   2015.8    104.4      2 |
                 13. |       3   1937     77.2   2803.3      118      3 |
                 14. |       3   1938     44.6   2039.7    156.2      4 |
                 15. |       3   1939     48.1   2256.2    172.6      5 |
                     |--------------------------------------------------|
                 16. |       4   1935    40.29    417.5     10.5      1 |
                 17. |       4   1936    72.76    837.8     10.2      2 |
                 18. |       4   1937    66.26    883.9     34.7      3 |
                 19. |       4   1938     51.6    437.9     51.8      4 |
                 20. |       4   1939    52.41    679.7     64.3      5 |
                     |--------------------------------------------------|
                 21. |       5   1935    39.68    157.7    183.2      1 |
                 22. |       5   1936    50.73    167.9      204      2 |
                 23. |       5   1937    74.24    192.9      236      3 |
                 24. |       5   1938    53.51    156.7    291.7      4 |
                 25. |       5   1939    42.65    191.4    323.1      5 |
                     +--------------------------------------------------+
                Now, let us pick a dataset with companies 4 and 5


                Code:
                webuse grunfeld, clear
                keep if time<=5
                keep if inrange(company, 4, 5)
                l, sepby(company)
                Code:
                 l, sepby(company)
                
                     +--------------------------------------------------+
                     | company   year   invest   mvalue   kstock   time |
                     |--------------------------------------------------|
                  1. |       4   1935    40.29    417.5     10.5      1 |
                  2. |       4   1936    72.76    837.8     10.2      2 |
                  3. |       4   1937    66.26    883.9     34.7      3 |
                  4. |       4   1938     51.6    437.9     51.8      4 |
                  5. |       4   1939    52.41    679.7     64.3      5 |
                     |--------------------------------------------------|
                  6. |       5   1935    39.68    157.7    183.2      1 |
                  7. |       5   1936    50.73    167.9      204      2 |
                  8. |       5   1937    74.24    192.9      236      3 |
                  9. |       5   1938    53.51    156.7    291.7      4 |
                 10. |       5   1939    42.65    191.4    323.1      5 |
                     +--------------------------------------------------+
                As your goal is to keep observations in dataset 1 of companies in dataset 2, we only need one instance of the company identifier per firm. contract will do this, see

                Code:
                help contract
                Secondly, I create an indicator called master (=1 if the appended observation is from the master dataset and 0 if it is from the using dataset).

                Code:
                webuse grunfeld, clear
                keep if time<=5
                keep if inrange(company, 4, 5)
                contract company, freq(master)
                replace master=0
                So the above yields

                Code:
                . l
                
                     +------------------+
                     | company   master |
                     |------------------|
                  1. |       4        0 |
                  2. |       5        0 |
                     +------------------+
                If I generate the same variable "master" in the master dataset and append the above, the company observations will be appended, all other variables in the master dataset and not in the using dataset will have missing values and the variable master will allow me to identify what observations are from the using dataset and what are from the master dataset. Sorting, 0s are sorted before ones, so if I replace the values of the variable master with the first sorted value, then those in both the master and using will have the value zero and the rest, one.

                Code:
                webuse grunfeld, clear
                keep if time<=5
                keep if inrange(company, 4, 5)
                contract company, freq(master)
                replace master=0
                tempfile using
                save `using'
                webuse grunfeld, clear
                keep if time<=5 & company<=5
                gen master=1
                append using `using'
                bys company (master): replace master= master[1]
                l, sepby(company)

                Code:
                . l, sepby(company)
                
                     +-----------------------------------------------------------+
                     | company   year   invest   mvalue   kstock   time   master |
                     |-----------------------------------------------------------|
                  1. |       1   1936    391.8   4661.7     52.6      2        1 |
                  2. |       1   1937    410.6   5387.1    156.9      3        1 |
                  3. |       1   1938    257.7   2792.2    209.2      4        1 |
                  4. |       1   1939    330.8   4313.2    203.4      5        1 |
                  5. |       1   1935    317.6   3078.5      2.8      1        1 |
                     |-----------------------------------------------------------|
                  6. |       2   1935    209.9   1362.4     53.8      1        1 |
                  7. |       2   1939    230.4   1957.3    312.7      5        1 |
                  8. |       2   1937    469.9   2676.3    118.1      3        1 |
                  9. |       2   1938    262.3   1801.9    260.2      4        1 |
                 10. |       2   1936    355.3   1807.1     50.5      2        1 |
                     |-----------------------------------------------------------|
                 11. |       3   1939     48.1   2256.2    172.6      5        1 |
                 12. |       3   1935     33.1   1170.6     97.8      1        1 |
                 13. |       3   1937     77.2   2803.3      118      3        1 |
                 14. |       3   1936       45   2015.8    104.4      2        1 |
                 15. |       3   1938     44.6   2039.7    156.2      4        1 |
                     |-----------------------------------------------------------|
                 16. |       4      .        .        .        .      .        0 |
                 17. |       4   1939    52.41    679.7     64.3      5        0 |
                 18. |       4   1935    40.29    417.5     10.5      1        0 |
                 19. |       4   1938     51.6    437.9     51.8      4        0 |
                 20. |       4   1936    72.76    837.8     10.2      2        0 |
                 21. |       4   1937    66.26    883.9     34.7      3        0 |
                     |-----------------------------------------------------------|
                 22. |       5      .        .        .        .      .        0 |
                 23. |       5   1938    53.51    156.7    291.7      4        0 |
                 24. |       5   1939    42.65    191.4    323.1      5        0 |
                 25. |       5   1935    39.68    157.7    183.2      1        0 |
                 26. |       5   1937    74.24    192.9      236      3        0 |
                 27. |       5   1936    50.73    167.9      204      2        0 |
                     +-----------------------------------------------------------+
                Therefore, you can just drop observations with master=1 to retain the companies (drugs in your case) in both files. And eventually the observations from the using dataset that have missing values in all other variables.


                why it is better than my original method?
                A combination of including information that is not needed (note that all I needed were the company and dataset identifiers) and the merge code will do some other things that are memory and time-consuming.
                Last edited by Andrew Musau; 02 Jun 2021, 07:48.

                Comment


                • #9
                  Andrew, thank you so much. As mentioned in #1 I had successfully used -merge- on 3 user files. I went back and used your code on these 3 user files, compared the results and it was the same. Thank you again!

                  Comment

                  Working...
                  X