Help with chunky please

jemima spencer

Join Date: Jan 2022

Posts: 38
#1

Help with chunky please

17 Aug 2023, 09:21

Hi,

Could I possibly have some help using chunky (or chunky8 - I've tried both!)

I have a 1.6GB txt file that I'd like to break down into chunks and convert to .dta files.

I've attempted using chunky, and chunky8 but am getting error messages for both. I also considered using the gssplit software but a bit wary of this not knowing much about it and in view of the confidentiality of the data.

Here is my attempt using chunky:

chunky using "Delivery1_Table1.txt", chunksize (100000KB) header(include) stub("Table1 data/import")

Chunking using the following settings:

Chunksize: 100,000,000
Memory: 33,554,432
Bites: 3
Bitesize: 33,333,333

Include header: patid\x09pracid\x09yob\x09gender\x09followup_start \x09followup_end\x09medcodeid\x
> 09value\x09numunitid\x09obsdate\x0d\x0a

(for reference: EOL characters 0d0a (CRLF) indicate Windows, 0a (LF) Unix and 0d (CR) Mac. 09 is th
> e TAB character.)

file Table1 data/import0001.txt could not be opened
fopen(): 603 file could not be opened
chunkfile(): - function returned error
<istmt>: - function returned error [1]

.................................................. ............................................

And here is my attempt using chunky8 (before I had heard about chunky). I felt that I'd nearly got this one going, but have no idea why it objects to an index of 1.

*Set up chunksize and locals for the loop:
local part 0
local index 1
local chunksize 2000000
tempfile chunkfile

*Check var names to see what you want to keep:
chunky8 using "Delivery1_Table1.txt", list

*Loop until the end of the file is reached:

while r(eof)!=1{
chunky8 using "Delivery1_Table1.txt" ///
index(`index') chunk(`chunksize') ///
saving("`chunkfile'", replace) keepfirst
if r(eof) {
continue, break
}
else {
local index `r(index)'
}
import delimited "`chunkfile'", clear delimiter("\t")
keep oatid yob gender followup_start medcodeid value numunitid
save test_`++part', replace
}

invalid 'index'

Thanks for looking at my query. Sorry if it's really obvious what I'm doing wrong - sadly it's not obvious to me!

Jemima
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#2

17 Aug 2023, 09:43

Does the folder "Table1 data" exist in advance? If not, that might be the problem, as I don't think that -chunky- will create new folders for stubs while it is working. (Just a quick guess at the problem, as I don't have anything handy on which to try out this idea.)
Comment
jemima spencer

Join Date: Jan 2022

Posts: 38
#3

21 Aug 2023, 04:29

You must have been right as it seems to be working. What a relief!

Thank you SO much, this had been frustrating me for ages.

Jemima
Comment
jemima spencer

Join Date: Jan 2022

Posts: 38
#4

21 Aug 2023, 05:16

......although the program has now stopped, having chunked about 1/9 of the data (2 out of 16GB). I don't suppose you have any idea of why this might be? The files were all 97,000kb other than the last at 94,000 if that's any clue?

Thankyou

*************************

Chunk Table 1 data/import0021.txt saved. Now at position 2,100,002,437
file Table 1 data/import0022.txt could not be opened
fopen(): 603 file could not be opened
chunkfile(): - function returned error
<istmt>: - function returned error [1]
r(603);
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#5

21 Aug 2023, 08:12

No, I don't know why. I might guess that you're running into some filesystem limitation on your computer, but that seems unlikely. Or, sometimes when asked to process a lot of files consecutively, Stata might try to open a new file while disk access is still occurring and cause a conflict. I don't know to solve the latter problem. (One could edit the -chunk.ado- source code, but I'm not sure how here.)

An easy thing to try: Try running -chunky- on your dataset with a smaller or larger chunk size and see what happens. That might work, or it might fail, but show something diagnostically useful.
Comment
jemima spencer

Join Date: Jan 2022

Posts: 38
#6

22 Aug 2023, 02:51

Thankyou.

Interestingly, I had the same problem with a loop I later used to import the txt files and covert them to .dta files - it just seemed to cut out after a few files. Starting the loop afresh from a later txt file seemed to work.

Running -chunky- for larger chunks caused complete system failure, so I'll go the other direction and try smaller ones! Plus might pass via our IT team....

Thanks again

Jemima
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#7

22 Aug 2023, 13:09

Have you tried using -set trace on- to help diagnose the problem? It's messy, but can be very useful.

Your problem with reading/writing does make me think of a possible file system issue. Are the data files you're working on located on network drives, rather than on your computer? The slowness of access to drives on a network might cause the problem. If that's the case, you should copy the big file to your local computer drive and try -chunky- on that, with the conversion to .dta files also being with files located on your local computer drive.

Finally: It's possible you don't need to break this file into chunks. Have you tried to read it directly as one big file, and did that fail? If so, seeing the code for that might help.
Comment
jemima spencer

Join Date: Jan 2022

Posts: 38
#8

23 Aug 2023, 09:44

Thanks so much for replying again.

Yes, I did initially try doing a standard import for a txt file but this just seemed to crash my entire system.

The files are on a network drive so yes ok - I'll try copying to my computer and see what happens. If that fails I think I'll have to access a computer with more RAM and see if that works.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#9

23 Aug 2023, 11:20

I very much doubt that RAM is the issue, with a 1.6G file, unless your machine had (say) only 4G of RAM. The problem you encountered was a file system problem which is relative distinct from a RAM issue. I can't see any good reason why a 1.6G file would have resulted in a crash on your initial import, so it might also be worth revisiting that problem if moving your file to your local computer doesn't solve the problem.
Comment

Announcement

Help with chunky please

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment