Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with chunky please

    Hi,

    Could I possibly have some help using chunky (or chunky8 - I've tried both!)

    I have a 1.6GB txt file that I'd like to break down into chunks and convert to .dta files.

    I've attempted using chunky, and chunky8 but am getting error messages for both. I also considered using the gssplit software but a bit wary of this not knowing much about it and in view of the confidentiality of the data.

    Here is my attempt using chunky:

    chunky using "Delivery1_Table1.txt", chunksize (100000KB) header(include) stub("Table1 data/import")

    Chunking using the following settings:

    Chunksize: 100,000,000
    Memory: 33,554,432
    Bites: 3
    Bitesize: 33,333,333

    Include header: patid\x09pracid\x09yob\x09gender\x09followup_start \x09followup_end\x09medcodeid\x
    > 09value\x09numunitid\x09obsdate\x0d\x0a

    (for reference: EOL characters 0d0a (CRLF) indicate Windows, 0a (LF) Unix and 0d (CR) Mac. 09 is th
    > e TAB character.)

    file Table1 data/import0001.txt could not be opened
    fopen(): 603 file could not be opened
    chunkfile(): - function returned error
    <istmt>: - function returned error [1]

    .................................................. ............................................

    And here is my attempt using chunky8 (before I had heard about chunky). I felt that I'd nearly got this one going, but have no idea why it objects to an index of 1.

    *Set up chunksize and locals for the loop:
    local part 0
    local index 1
    local chunksize 2000000
    tempfile chunkfile

    *Check var names to see what you want to keep:
    chunky8 using "Delivery1_Table1.txt", list

    *Loop until the end of the file is reached:

    while r(eof)!=1{
    chunky8 using "Delivery1_Table1.txt" ///
    index(`index') chunk(`chunksize') ///
    saving("`chunkfile'", replace) keepfirst
    if r(eof) {
    continue, break
    }
    else {
    local index `r(index)'
    }
    import delimited "`chunkfile'", clear delimiter("\t")
    keep oatid yob gender followup_start medcodeid value numunitid
    save test_`++part', replace
    }

    invalid 'index'



    Thanks for looking at my query. Sorry if it's really obvious what I'm doing wrong - sadly it's not obvious to me!

    Jemima










  • #2
    Does the folder "Table1 data" exist in advance? If not, that might be the problem, as I don't think that -chunky- will create new folders for stubs while it is working. (Just a quick guess at the problem, as I don't have anything handy on which to try out this idea.)

    Comment


    • #3
      You must have been right as it seems to be working. What a relief!

      Thank you SO much, this had been frustrating me for ages.

      Jemima

      Comment


      • #4
        ......although the program has now stopped, having chunked about 1/9 of the data (2 out of 16GB). I don't suppose you have any idea of why this might be? The files were all 97,000kb other than the last at 94,000 if that's any clue?

        Thankyou



        *************************

        Chunk Table 1 data/import0021.txt saved. Now at position 2,100,002,437
        file Table 1 data/import0022.txt could not be opened
        fopen(): 603 file could not be opened
        chunkfile(): - function returned error
        <istmt>: - function returned error [1]
        r(603);

        Comment


        • #5
          No, I don't know why. I might guess that you're running into some filesystem limitation on your computer, but that seems unlikely. Or, sometimes when asked to process a lot of files consecutively, Stata might try to open a new file while disk access is still occurring and cause a conflict. I don't know to solve the latter problem. (One could edit the -chunk.ado- source code, but I'm not sure how here.)

          An easy thing to try: Try running -chunky- on your dataset with a smaller or larger chunk size and see what happens. That might work, or it might fail, but show something diagnostically useful.

          Comment


          • #6
            Thankyou.

            Interestingly, I had the same problem with a loop I later used to import the txt files and covert them to .dta files - it just seemed to cut out after a few files. Starting the loop afresh from a later txt file seemed to work.

            Running -chunky- for larger chunks caused complete system failure, so I'll go the other direction and try smaller ones! Plus might pass via our IT team....

            Thanks again

            Jemima

            Comment


            • #7
              Have you tried using -set trace on- to help diagnose the problem? It's messy, but can be very useful.

              Your problem with reading/writing does make me think of a possible file system issue. Are the data files you're working on located on network drives, rather than on your computer? The slowness of access to drives on a network might cause the problem. If that's the case, you should copy the big file to your local computer drive and try -chunky- on that, with the conversion to .dta files also being with files located on your local computer drive.

              Finally: It's possible you don't need to break this file into chunks. Have you tried to read it directly as one big file, and did that fail? If so, seeing the code for that might help.

              Comment


              • #8
                Thanks so much for replying again.

                Yes, I did initially try doing a standard import for a txt file but this just seemed to crash my entire system.

                The files are on a network drive so yes ok - I'll try copying to my computer and see what happens. If that fails I think I'll have to access a computer with more RAM and see if that works.

                Comment


                • #9
                  I very much doubt that RAM is the issue, with a 1.6G file, unless your machine had (say) only 4G of RAM. The problem you encountered was a file system problem which is relative distinct from a RAM issue. I can't see any good reason why a 1.6G file would have resulted in a crash on your initial import, so it might also be worth revisiting that problem if moving your file to your local computer doesn't solve the problem.

                  Comment

                  Working...
                  X