Creating and saving bootstrap samples for

Torin McFarland

Join Date: May 2023

Posts: 2
#1

Creating and saving bootstrap samples for

29 Aug 2023, 09:35

I have an estimation procedure which requires bootstrapped standard errors. My dataset is 2.7M observations, ~3500 clusters, and the estimation procedure takes a long time per sample (the supercomputer takes roughly 15hr to bootstrap 20 replications using Stata's bootstrap). To make better use of my time and supercomputing access, I want to essentially do the bootstrap command in pieces.

First, I want to draw 500 (or more) samples with replacement in a replicable manner (setting seed, etc.), and save each one. Then, I'll run an "array job" of my estimation procedure on each saved sample. Finally, I'll compute standard errors as Stata's bootstrap does.

On the 3rd sampling of my dataset, I receive this error message:

Code:

I/O error writing .dta file Usually such I/O errors are caused by the disk or file system being full. r(693);

Because (I think) Stata has some sort of temporary file associated with each sample drawn? I am not sure how to proceed. What's odd about this, is that I've successfully used preserve, restore, and save in this looping manner on much larger datasets, but never before with the bsample command.

Though my own dataset fails on the 3rd sample, you can replicate the error on the first sample using below code (at least on my machine). This being a "large" dataset problem, I did not use dataex.

Code:

clear all sysuse auto expand 37000 global test "" set seed 1234 // Statalist-seed forvalues i = 1(1)5 { // preserve dataset preserve // draw a random number to advance the seedstate gen random = runiform() //bootstrap sample it bsample, cluster(make) idcluster(make2) // attach seedstate gen seedstate = c(seed) // save repcount gen dataset_id = `i' // drop random number drop random // save dataset save "$test/sample_`i'", replace // save seedstate keep seedstate duplicates drop save "$test/seedstate_`i'", replace // restore dataset restore }
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

29 Aug 2023, 10:08

I cannot replicate your problem. The code you show ran with no error messages on my system, which is neither unusually fast nor especially capacious,

Code:

Stata/MP 18.0 for Windows (64-bit x86-64) Revision 13 Jul 2023 Copyright 1985-2023 StataCorp LLC Total physical memory: 32.00 GB Available physical memory: 22.50 GB Stata license: Single-user 4-core perpetual

and successfully saved all of the files.

I wonder if the problem you are encountering is not well described by the error message. Although it suggests considering a full disk or file system, it is actually a non-specific write error message. It may be that you are creating these files to be saved faster than the operating system can digest them and pass them on to the disk. If the OS's file buffers are full when another request to write to disk arrives, the OS will refuse and pass an error back to Stata, which Stata reports with r(693). So a possibility is that you are asking the OS to write file n+1 when it is still trying to write file n, and it is choking. This kind of situation is particularly common if you are saving your files to a network drive. You can slow down this process by inserting a -sleep- command before your -save- instruction: this usually requires some experimentation to find out how long a sleep period is needed to relieve the write bottleneck.
Comment

FernandoRios

Join Date: Apr 2014
Posts: 2534

29 Aug 2023, 10:08

Hi there
The problem is the line

gen seedstate = c(seed)

c(seed) is being stored as a long String, of length 5011. So you are asking to store that on Every dataset for the 2.5million observations.
Too much.

Perhaps you may want to store that as a note in the dataset

Code:

clear all
sysuse auto

expand 5000 

global test "."

set seed 1234 // Statalist-seed
forvalues i = 1(1)5 {
    // preserve dataset
    preserve
    
    // draw a random number to advance the seedstate
    gen random = runiform()
    
    //bootstrap sample it
    bsample, cluster(make) idcluster(make2)
    
    // attach seedstate
    note : `c(seed)'
    
    // save repcount
    gen dataset_id = `i'
    
    // drop random number
    drop random
    
    // save dataset
    save "$test/sample_`i'", replace
    
  
    // restore dataset
    restore
    
}

Since you are initializing the Seed, I don't think you need to save each individual dataset "state".
F

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30355

29 Aug 2023, 10:37

Since you are initializing the Seed, I don't think you need to save each individual dataset "state".

Well, if everything goes well, it is unnecessary. But bootstrap samples sometimes prove to be unanalyzable, or produce anomalous results that require investigation. It is useful to be able to reproduce those particular data sets without having to re-run the entire bootstraping process from the beginning. So storing the random number generator state along the way is a good practice.

That said, since the data sets themselves are being saved, I don't see how saving the seed on top of that is helpful. And I agree that if you are going to save the seed and the data set, doing it as a note in the data set makes more sense than as a variable. Even so, strL's do a pretty good job of conserving memory. Saving a 5,000 character strL in each observation of the data set does not expand the size of the data set by 5000*_N:

Code:

. clear*

. sysuse auto
(1978 automobile data)

.
. memory

Memory usage
                                         Used                Allocated
----------------------------------------------------------------------
Data                                    3,182               67,108,864
strLs                                       0                        0
----------------------------------------------------------------------
Data & strLs                            3,182               67,108,864

----------------------------------------------------------------------
Data & strLs                            3,182               67,108,864
Variable names, %fmts, ...              4,370                   71,230
Overhead                            1,081,344                1,082,136

Stata matrices                              0                        0
ado-files                               8,873                    8,873
Stored results                              0                        0

Mata matrices                               0                        0
Mata functions                              0                        0

set maxvar usage                    5,281,738                5,281,738

Other                                   4,884                    4,884
----------------------------------------------------------------------
Total                               6,378,879               73,557,725

.
. gen seed = "`c(seed)'" in 1
(73 missing values generated)

. memory

Memory usage
                                         Used                Allocated
----------------------------------------------------------------------
Data                                    3,774               67,108,864
strLs                                   5,092                    5,092
----------------------------------------------------------------------
Data & strLs                            8,866               67,113,956

----------------------------------------------------------------------
Data & strLs                            8,866               67,113,956
Variable names, %fmts, ...              4,711                   71,230
Overhead                            1,081,344                1,082,136

Stata matrices                              0                        0
ado-files                               8,873                    8,873
Stored results                              0                        0

Mata matrices                               0                        0
Mata functions                              0                        0

set maxvar usage                    5,281,738                5,281,738

Other                                   4,884                    4,884
----------------------------------------------------------------------
Total                               6,384,904               73,562,817

.
. replace seed = "`c(seed)'"
(73 real changes made)

. memory

Memory usage
                                         Used                Allocated
----------------------------------------------------------------------
Data                                    3,774               67,108,864
strLs                                  11,912                   11,912
----------------------------------------------------------------------
Data & strLs                           15,686               67,120,776

----------------------------------------------------------------------
Data & strLs                           15,686               67,120,776
Variable names, %fmts, ...              4,711                   71,230
Overhead                            1,081,344                1,082,136

Stata matrices                              0                        0
ado-files                               8,873                    8,873
Stored results                              0                        0

Mata matrices                               0                        0
Mata functions                              0                        0

set maxvar usage                    5,281,738                5,281,738

Other                                   4,884                    4,884
----------------------------------------------------------------------
Total                               6,391,724               73,569,637

Adding 73 copies of the 5,011 byte seed to the data set only required an additional 6,820 bytes of memory. (I think, but do not know, that in fact only a single copy of the seed itself is stored and it is handled by storing pointers to that in the variable values.)

Comment

Torin McFarland

Join Date: May 2023

Posts: 2
#5

29 Aug 2023, 10:57

Clyde Schechter , thank you for the sleep suggestion. I think you are likely correct, as I am currently writing to a OneDrive folder. I ran into the same error on different code which didn't produce the error before, so i am trying this suggestion there and testing if it works.

FernandoRios, I wasn't aware of the note function and will definitely make use of that instead. That will work significantly better, thank you. I was mainly saving it so as to have 'proof' of seedstate advancement, and this note is precisely what i need.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2534
#6

29 Aug 2023, 11:12

Thank you Clyde Schechter
Point taken. I can see why saving the State to generate a particular Bootstrap would be better.
I definitely have no idea how efficient StrL format works. I was making parallelism with str only. In any case, on my computer, that was the bottle neck.
Not necessarily for the data saved in disk, but for the computer memory required, even tho it was not allocated into Stata.

Best wishes.
Fernando
Comment

Announcement