Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large database: op. sys. refuses to provide memory

    Hello, my database is 19GB, I'm using Stata 12 SE and my computer is windows 7 - 64 bits 3GB RAM. I´m trying to load this large database in stata and the program shows this:

    op. sys. refuses to provide memory
    Stata's data-storage memory manager has already allocated 4g bytes and it just attempted to allocate another 32m bytes. The operating system said no.
    Perhaps you are running another memory-consuming task and the command will work later when the task completes. Perhaps you are on a multiuser system that is
    especially busy and the command will work later when activity quiets down. Perhaps a system administrator has put a limit on what you can allocate; see help
    memory. Or perhaps that's all the memory your computer can allocate to Stata.
    r(909);

    and this is the query memory:


    -------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Memory settings
    set maxvar 2048 (not settable in this version of Stata)
    set matsize 400 10-800; max. # vars in models
    set niceness 5 0-10
    set min_memory 0 0-1600g
    set max_memory . 32m-1600g or .
    set segmentsize 32m 1m-32g

    what can I do? I need the entire database

    thank you
    Last edited by Lina Anaya; 07 Sep 2015, 16:23.

  • #2
    How exactly were you planning to analyze a 19GB database with a system that only has 3GB of RAM in the first place? With the exception of SAS, and some of the newer lower level data processing libraries, analytic software uses an in-memory model for the storage, manipulation, and processing of the data. Another issue is using Windows, since it is an unrelenting sociopath with regards to its consumption of compute resources. Even if you have the physical RAM available the Windows kernel may not allocate that memory since it will want to waste a ton of memory caching everything on the system.

    Comment


    • #3
      Welcome to Statalist, Lina!

      If you are at a university, inquire if you can access a Remote Desktop Server or Unix Server with Stata MP installed; or failing that, get access to a workstation with sufficient memory and upgrade it to Stata MP. Even with that much capacity, analyses will take hours.

      In similar situations, I've taken a sample of the data So I'm curious: What are the goals of your analysis and exactly why do you require the "entire database" to meet those goals?

      Please also note that Stata etiquette is to register with full real names, first and last; you can do this via the CONTACT US button at the bottom right of the page.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Hello, thank you for your answers

        I just need to know what options do I have, we are working with data about colombian medicare with almost 35 million observations.

        What I'm doing now is to load the database fragmented with the command:

        use [varlist] [if] [in] using filename [, clear nolabel]

        It takes time but at least is working, if you know something more efficient, let me know

        Thank you

        Comment


        • #5
          As Steve said, you'd need access to a machine with adequate RAM to access the whole dataset. A few years ago, 24 or 32 GB RAM was expensive and unusual, but nowadays, such a computer is well within reach of most institutions/researchers.

          Short of that, your current approach seems to be the best you can do. I'd be mildly concerned about how the database is sorted -- whether chunks taken this way lead to the equivalent of a random sample.

          Comment


          • #6
            I echo the sentiments here. To examine the data set first, I prefer the sample command, which randomly draws a % sample of the data as a subset. Not sure if this is feasible if you can't even open the data set.

            Code:
            webuse nlswork, clear
            sample 1 // random draw of 1% of the sample
            describe, short

            Nathan E. Fosse, PhD
            [email protected]

            Comment


            • #7
              Thank you so much for your answers!! I'll take all your advices in consideration and I'll let you know whether I have another trouble.
              PD: at least now is working right

              Comment


              • #8
                Thanks for changing your reservation, Lina.


                Although you read in a chunk at a time, the size of the database will still be 19Gb, still much too big for your setup, unless there is something about your use statement that you haven't

                You also haven't responded to previous question: what analysis you do plan that cannot be answered by a sample. One problem, of course, is the possibility of multiple records per person, so that a simple random or systematic sample of records would sample proportional to the number of records. That number cannot be estimated from the sample, so the sampling weight for analysis of individuals cannot be calculated. However with a little effort I believe, that that problem can be overcome.

                Another possibility: Once when I was confronted with a very administrative database, I asked the issuing agency to take a sample to my specifications. They did! I suggest that you ask as well.

                Nathan: sample is not a solution, because it requires sorting of the data, something infeasile here. However there is a sequential method for drawing a simple random sample without replacement that does not require sorting. (Chromy, 1979). However these database are frequently ordered informatively (e.g. by date of entry for an individual or date of event). Therefore I'd prefer a systematic sample.

                Reference:
                Chromy, James R. 1979. Sequential sample selection methods. Proceedings of the Survey Research Methods Section of the American Statistical Association 401-406.

                http://www.amstat.org/sections/srms/...s/1979_081.pdf
                Last edited by Steve Samuels; 08 Sep 2015, 15:46.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #9
                  Steve Samuels Lina Anaya could sample records from the database itself (depending on the database) by using windowing functions (e.g., GROUP BY, hierarchical queries, etc...) to pull in something a bit more manageable. Another potential issue is that the data simply are stored horribly. I ran into cases where I formerly worked where my predecessor had created numerous SAS datasets that were ~4-6GB in size, but after converting the file to Stata with StatTransfer the resulting size of the file was typically around 1GB. It may not be the case, but could be something worth considering to find a solution that would work without the additional hardware.

                  That said, getting better hardware could definitely help Lina Anaya and would also provide increased capacity/efficiencies in the longer-term.

                  Comment


                  • #10
                    The thing is that we want to clean the data before making any calculation in stata, the database is in format .DTA. We want to clean the information before because it's horrible and it has typing errors, illogical values, etc. That's why we need the entire database before taking samples or doing anything else.
                    As result of above, I decided to fragment the database so I can clean it; I know it takes time, but I don't know another better idea right now.

                    Thanks for your comments.

                    Comment


                    • #11
                      Lina Anaya If it is a Stata data file (e.g., has a .dta file extension) it must have been read into memory at some point in the past. That said, if the big issue at hand is cleaning the data, you'd be better served by exporting the dataset to JSON and using something like Storm/Spark or pushing it into a database as a partitioned parallel table (assuming Oracle) to distribute the cleaning workload. Stata does some fairly impressive data cleaning, but if you're trying to do cleaning on a massive scale your life would be easier leveraging technologies that are more specialized around those types of tasks.

                      Comment

                      Working...
                      X