I have an estimation procedure which requires bootstrapped standard errors. My dataset is 2.7M observations, ~3500 clusters, and the estimation procedure takes a long time per sample (the supercomputer takes roughly 15hr to bootstrap 20 replications using Stata's bootstrap). To make better use of my time and supercomputing access, I want to essentially do the bootstrap command in pieces.
First, I want to draw 500 (or more) samples with replacement in a replicable manner (setting seed, etc.), and save each one. Then, I'll run an "array job" of my estimation procedure on each saved sample. Finally, I'll compute standard errors as Stata's bootstrap does.
On the 3rd sampling of my dataset, I receive this error message:
Because (I think) Stata has some sort of temporary file associated with each sample drawn? I am not sure how to proceed. What's odd about this, is that I've successfully used preserve, restore, and save in this looping manner on much larger datasets, but never before with the bsample command.
Though my own dataset fails on the 3rd sample, you can replicate the error on the first sample using below code (at least on my machine). This being a "large" dataset problem, I did not use dataex.
First, I want to draw 500 (or more) samples with replacement in a replicable manner (setting seed, etc.), and save each one. Then, I'll run an "array job" of my estimation procedure on each saved sample. Finally, I'll compute standard errors as Stata's bootstrap does.
On the 3rd sampling of my dataset, I receive this error message:
Code:
I/O error writing .dta file
Usually such I/O errors are caused by the disk or file system being full.
r(693);
Because (I think) Stata has some sort of temporary file associated with each sample drawn? I am not sure how to proceed. What's odd about this, is that I've successfully used preserve, restore, and save in this looping manner on much larger datasets, but never before with the bsample command.
Though my own dataset fails on the 3rd sample, you can replicate the error on the first sample using below code (at least on my machine). This being a "large" dataset problem, I did not use dataex.
Code:
clear all
sysuse auto
expand 37000
global test ""
set seed 1234 // Statalist-seed
forvalues i = 1(1)5 {
// preserve dataset
preserve
// draw a random number to advance the seedstate
gen random = runiform()
//bootstrap sample it
bsample, cluster(make) idcluster(make2)
// attach seedstate
gen seedstate = c(seed)
// save repcount
gen dataset_id = `i'
// drop random number
drop random
// save dataset
save "$test/sample_`i'", replace
// save seedstate
keep seedstate
duplicates drop
save "$test/seedstate_`i'", replace
// restore dataset
restore
}

Comment