Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing a large CSV file into Stata

    Dear All,

    the following thread discussing the same issue points to a useful command -chunky- that can be used to cut a large CSV file into smaller pieces, which may be processed separately.

    This is helpful, but I wonder if there is any possibility to tell it to maintain some groups of observations together? (even at the expense of having the chunks' size more volatile). In other words, I am looking for a by(hhid) or similar option that could say that (in this case) I wish the observations (persons) belonging to the same household to stay together in the same chunk. The source data is grouped, but not sorted (meaning you can have A,A,C,C,C,C,D,D,B,B,B,B,B, but not A,C,A,B....).

    I estimate that the size of the group (e.g. 1MB of data) is negligible compared to the size of the chunk (e.g. 1GB of data).

    Is there an option I am overlooking in -chunky- or is there any other more recent and usable command for this purpose?

    PS: I understand I can re-import each chunk, combine them together and save by groups, but I'd rather look for a more ready-made solution.

    Thank you, Sergiy

  • #2
    How large is the CSV file? I’ve imported files in the order of 4 GB in the past without much issue and a bit of waiting.

    Comment


    • #3
      Hello Leonardo,

      you've asked "How large is the CSV file?".

      Unfortunately I can't say, because today the files are of one size, tomorrow they will be different. Likewise, I am working with one subset of data, the guy next door works with different. So the code must handle files of arbitrary size. If it matters, this is machine-generated data.

      To be specific, we can be relatively certain that:

      - the files are going to be larger than the memory size of any one computer processing them, and
      - they will be fluctuating between 10GB and 20GB most of the time,
      - that they will not be more than about 250GB in size (unpacked). (and if they are - there well be other challenges I'll have to face).

      Thank you, Sergiy

      Comment


      • #4
        Hi Sergiy,

        I did notice the other day that an older, simpler "chunky8" algorithm is up on SCC. I didn't post at the time, because digging around in the source code isn't really a "ready made" solution. However, seeing as there is little activity on this thread, I thought I might point this out in case you are looking for a jump start on your own solution.

        I don't really know what your background is, only that you've been posting here much longer than I have and that I've been relying on some of your posts about low level graphics recently, so I know you know your Stata very well. If your comfortable with C or C++, I bet you could get a very fast, general solution to this problem that should be orders of magnitude faster than what you would get out of Stata.

        I bet you could even multithread the algorithm, since the orders of the rows don't matter. Just have each thread treat the row index of the original csv like a queue. Every time a thread finishes executing, lock the index, save it in a temp variable within the scope of the thread, increment the index, free the index so that another thread can use it, then run the algorithm using the temporary variable. The algorithm itself seems like it should be straightforward enough: read in each line at the current index into memory one by one, then (1) convert to a vector or array, (2) find the element at the index that corresponds to the group affiliation, (3) write the original line to the disk in the file associated with the group, and (4) dispose of the data before reading in the next line. The algorithm should be in linear time with respect to the number of lines in the csv, and since you aer already very close to the metal, the big bottle neck is the Sata cable over which you will need to preform a read and a write for every line. I bet you could get something very fast like this.

        Just my two cents.

        Edit: Actually, if it were me and I were not under any serious time constraints, I'd probably use this as an excuse to pick up some Rust. Haskell like functional programing for low level hardware applications? Sign me up!
        Last edited by Daniel Schaefer; 14 Oct 2022, 14:09.

        Comment


        • #5
          I'm a very modest Mata programmer, but it's not bad to use it to create a "while ! end of file, read a line, inspect it, write a line" program , as its fget() and fput() functions will read/write one line of a text file. If this is of any interest, I may have some code around that will illustrate that. Since this is a pretty low level function, with run time presumably determined by physical read/write factors, I suspect it would be about as fast as something in a low level language.
          Last edited by Mike Lacy; 14 Oct 2022, 16:25.

          Comment


          • #6
            Originally posted by Sergiy Radyakin View Post
            Hello Leonardo,

            you've asked "How large is the CSV file?".

            Unfortunately I can't say, because today the files are of one size, tomorrow they will be different. Likewise, I am working with one subset of data, the guy next door works with different. So the code must handle files of arbitrary size. If it matters, this is machine-generated data.

            To be specific, we can be relatively certain that:

            - the files are going to be larger than the memory size of any one computer processing them, and
            - they will be fluctuating between 10GB and 20GB most of the time,
            - that they will not be more than about 250GB in size (unpacked). (and if they are - there well be other challenges I'll have to face).

            Thank you, Sergiy
            That's quite the challenge! I tried to play around with -infile- or -infix- to see if there was a solution that might present itself, but I could not come up with one. They generally have the effect of reading in all variables present on a line.

            Mike's suggestion is a good one, and it can even coded up using Stata, if Mata is not appealing.

            I might first take this straightforward approach. Read in consecutive chucks of the file that you know will fit into memory. Keep only your ID column, and scan it to see when your ID of interest appears. Since you can be sure the data are already grouped, you can simplify this a bit to find the first and last observation number with your ID, and then read in those observations. It's quite possible that this might become very slow as your dataset grows to the sizes you've suggested.

            Here's a crude Mata example that offers a path forward, I think. It does work for your specific case, albeit I've tested it on a tiny example.

            Code:
            // create the test file
            sysuse auto, clear
            keep price mpg rep78
            keep in 1/10
            
            gen id = (int(_n/3)+1)^2
            order id, first
            outfile using myfile.csv, comma replace
            type myfile.csv
            
            
            // find id=9 obervations only
            clear
            mata:
              filename = "myfile.csv"
              first=last=current=.
              fh = fopen(filename, "r")
              n=0
              while ((line=fget(fh))!=J(0,0,"")) {
                n++
                printf("%2s: %s\n", strofreal(n, "%2.0f"), line)
                id = strtoreal(substr(line, 1, strpos(line, ",")-1))
                current=n
                if (id==9) {
                  if (missing(first)) first=n
                  if (missing(last) | current > last) last=n
                }
                
                if (current > last) {
                  st_numscalar("first", first)
                  st_numscalar("last", last)
                  break
                }
              }
              fclose(fh)
            end
            
            di "Observation range for id=9: `=first', `=last'"
            
            import delim myfile.csv, clear rowrange(`=first':`=last')
            list

            Comment

            Working...
            X