Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import Large File

    Hello guys,

    I am new to this forum and hope you can help me with a problem I am facing with stata.

    I have a very large .dta file (25gb) with 12 variables and more than 200.000.000 obs. My question is simply if there is a certain way to open this file in a way that does not make my computer crash without using external programs (SAS etc.)? If someone could help me who had similar experiences I would be very happy.

    Best Wishes

  • #2
    You can read in only the variables and observations you want.

    So, say if your dataset was an individual level file and you were interested in ages of women, you could say
    Code:
    loc usevars age f
    loc cond "f == 1"
    use `usevars' if  `cond' using "data.dta", clear
    Another option is to read the dataset in chunks, subset/collapse to a manageable size, and stack.
    Code:
    loc stop 10000
    loc mastersize 200000000
    loc n = 1
    forv start = 1(`stop')`mastersize' {
       use in `start'/`stop' using "data.dta", clear
       // keep relevant vars/ collapse
       save "chunk`n'", replace
       loc stop = min(`start'+`stop',`mastersize')
       loc ++n
    }
    clear
    forv i = 1/`n' {
       append using "chunk`i'"
    }
    I've had a reasonable amount of success combining the two approaches to get the original dataset into memory.
    http://www.nber.org/stata/efficient/ is a good resource for this sort of problem.
    Last edited by Apoorva Lal; 17 Jul 2017, 16:25.

    Comment


    • #3
      The other point to make here is that the size of the data set should not be causing Stata to crash, regardless of how back it is. If it is too big to fit in the available memory, Stata will not crash: it will halt with an error message "op. sys. refuses to provide memory." What can happen with a very large data set is that it can take a very long time for Stata to read it. Stata does not issue any "progress reports" while it reads in the file, so it can easily appear that your computer is hung and that Stata has crashed. But Stata has never crashed on me when trying to read a large file. I admit I have never tried to read a 25gb file, but I have gone up to 20gb and Stata has always been able to read the file as long as my computer's memory wasn't too taken up with other open applications. But reading in a file that size does take a long time and can create the appearance of a hung computer.

      That said, Apoorva Lai's advice to read only those observations and variables you actually need is excellent: not only will it save you time reading the file in, many of your subsequent commands will also execute more quickly.

      Comment


      • #4
        From my experience, it's a bit more complicated due to the existence of "page files". There are three scenarios.

        1. The file is smaller than your physical memory. Loading the dataset will be fine.
        2. The file is larger than your physical memory + page file. Stata will provide an error.
        3. The file is larger than your physical memory, but smaller than physical memory + page file. Loading will take forever.

        The page file is basically a section on your hard drive that Windows calls upon when it runs out of physical memory. The downside is that the page file is super slow - often 1000x slower than memory. So if you are in situation 3, Stata will load the data quite quickly into the physical memory, observe that it's full and start filling the page file. But this last step takes so long that for all intents and purposes Stata "crashes". Working with the data will also be super slow.

        The solution is to either move to a cluster/server (which can easily get to 256GB of memory), find a way to do your stuff in pieces (see #2) or move to a line-by-line based language (SQL).

        Comment


        • #5
          I have recently found myself in the situation of reading in datasets that were larger than my physical memory allowed (~10 GB). The main issue I had was that the variable formats were not optimized, so the file size was much larger than necessary. I found this thread helpful for the ideas presented by Apoorva Lal , because the chunking approach made it possible to split the compression into chunks, then reassemble the dataset with a more reasonable size.

          The presented code did not work as anticipated, but I rewrote it slightly and verified that it works as my quick solution. I am posting it here for posterity in case someone else comes along with a similar problem.

          Code:
          version 15.1
          
          * start with some example data
          cd "C:/temp"
          input double id str12 x
          1 a
          2 b
          3 c
          4 d
          5 e
          6 f
          7 g
          8 h
          9 i
          10 j
          end
          save "testin", replace
          * end of example data
          
          
          *** This relevant bit of code starts here.
          * modify the following 4 lines accordingly.
          local indata "testin.dta"
          local outdata "testout.dta"
          scalar stepsize = 3  /* number of records to read in at one time. */
          scalar recordstart = 1
          * Set the above parameters accordingly.
          
          describe using "`indata'"
          scalar nrecords = r(N)
          scalar nchunks = ceil( (nrecords - recordstart + 1) / stepsize )
          
          forvalues chunki = 1/`=nchunks' {
             di "`chunki'"
             scalar start = recordstart + ((`chunki' - 1) * stepsize)
             scalar stop = min(start + stepsize - 1, nrecords)
             use in `=start'/`=stop' using "`indata'", clear
            
             // keep relevant vars/ collapse plus compress
             compress /* optional, but suggested */
             save "chunk`chunki'", replace
          }
          
          * assemble the chunks
          clear
          forv i = 1/`=nchunks' {
             append using "chunk`i'"
          }
          save "`outdata'", replace
          *** end of code segment
          
          * verify reassembled data
          list, clean
          Last edited by Leonardo Guizzetti; 08 Mar 2019, 11:49.

          Comment


          • #6
            hi,
            is there any way to load only 2 variables from a huge data set that will not load because of memory capacity and that comes from .csv file and not .dta?

            thanks
            Vishal

            Comment


            • #7
              yes, if you look at the help for -use- you will see that the second syntax shown is for exactly that

              Comment


              • #8
                Originally posted by Vishal Sharma View Post
                hi,
                is there any way to load only 2 variables from a huge data set that will not load because of memory capacity and that comes from .csv file and not .dta?

                thanks
                Vishal
                See -help import delimited- and then the colrange option.

                Comment


                • #9
                  thanks!

                  Comment

                  Working...
                  X