Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing large csv file into Stata

    Hi,
    I have multiple large (6Gb) csv files that I am trying to import into Stata. Is there a way to import a sample (random rows) of the original data (in csv file) into Stata?

    One option is to write a loop parsing the data by rowrange, and appending the datasets. But is there also a function or package that can do this too?

  • #2
    The -chunky- package might help you with some of this, as it automates the breaking of a large text file into smaller chunks. Try -ssc describe chunky-. With -chunky-, you'd still need to pull samples out of the chunks it creates. However, you might be better off with a loop as you describe, though, since the programming effort is fairly minimal.

    Comment


    • #3
      Thank you Mike.

      I solved the problem by importing the cvs file into Stata using a larger RAM computer. To reduce the data so that I can use it on my PC, I did the following:
      Code:
      forvalues parse=1/8{ 
          gen sample_`parse'=rnormal(0,1)
          quietly sum sample_`parse'
          drop if sample_`parse'>r(mean)
          drop sample_`parse'
      }

      Comment


      • #4
        I'm glad you have a workable solution. Just as a teaching point, what you have done is reduced the sample size by approximately half (because the normal distribution is symmetrical, so approximately half of the random variates will be less than or equal to the mean, and retained) each time through the loop. So you could have accomplished essentially the same thing more simply by:
        Code:
        keep if runiform() < 2^(-8)
        That wouldn't give you the exact same subset you got, but it would give you a subset of essentially the same size, namely 1/256'th of the original size.

        Comment


        • #5
          continuing on this topic...
          any new ways of opening a HUGE csv file that my computer cant handle . i have stata mp 15 and i get the following error message :

          . -import delimited D:\Vishal\synthetic_opioid_project\LAB_OUT.CSV-
          op. sys. refuses to provide memory
          Stata ran out of room to track where observations are stored. Right now, Stata has 1236m bytes
          allocated to track observations. Stata requested an extra 1m bytes and the operating system said no.
          Stata is currently tracking 648019968 observations and was asked to track 648019968. You are up
          against the memory limits of this computer.
          an error occurred while writing data
          r(198);

          any thoughts how to open this up ? my plan is to run a few simple commands on only 2 variables in the data set but I think I have to do this in chunks,

          thanks
          Vishal

          Comment


          • #6
            Originally posted by Vishal Sharma View Post
            any new ways of opening a HUGE csv file that my computer cant handle . . .my plan is to run a few simple commands on only 2 variables in the data set . . .
            Have you tried the colrange() option of the import delimited command?

            If limiting to two variables still isn't enough paring, then you could use it in conjunction with the command's rowrange() option to get digestible portions.

            Comment


            • #7
              I ve tried rowrange and still get the memory error message.

              Comment

              Working...
              X