Bootstrapping very large sample

Laurie Molina

Join Date: Jul 2014

Posts: 16
#1

Bootstrapping very large sample

13 Feb 2020, 19:09

Hi everyone,

I have a dataset of 70 million observations and am running the following model.

First I estimate the probability of belonging to class A with a logit model. Then I calculate fitted probabilities. Call those values x1hat.

Then I run an OLS regression to explain another variable y, as follows.

reg y x1hat x2 x2*x1hat

Where x2 is another explanatory variable.

I know that the standard errors of the last regression will not reflect the uncertainty of x1hat. So I wanted to bootstrap the standard errors of the entire procedure: first logit then Ols.

But my sample size is very large so I am afraid it won't be feasible to do 1000 reps with a sample size of 70 million each time.

Any suggestions on how to do this?

I noticed the wild and fast bootstrapping command (boottest) but I think that is only after one single estimation command ?So don't know if I can use boottest to repeat the joint procedure of logit followed by ols.

Appreciate your guidance,

Laurie
Tags: None
Laurie Molina

Join Date: Jul 2014

Posts: 16
#2

13 Feb 2020, 19:10

Forgot to mention, I do have a powerful computer 120gb of ram. But still this sounds like too much even for that one.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2406
#3

13 Feb 2020, 19:50

I don't see why you can't write your own program that handles the resampling, estimation and posting of results, then use -simulate- to run each iteration. It's possible that you if you don't have at least enough memory to hold twice the size of the data set (and whether extra is needed for the estimation commands) that it could run slowly, but "slow" is relative to how long it takes your computer to do one iteration of work.
Comment
Laurie Molina

Join Date: Jul 2014

Posts: 16
#4

13 Feb 2020, 19:57

Thanks Leonard. Do you think it would be efficient to do sample size of 70 million on ever rep? Or would it be ok to use smaller sample sizes? Not sure about the tradeoffs between reps and sample size.

I could write it my self but my understanding is that boottest has a particularly efficient way to do bootstrapping inferences in some cases. So if I could piggy back to an existing command that already went through several efficiency checks that would definitely be better for me.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2406
#5

14 Feb 2020, 05:20

Efficient is hard to say -- in what sense are you concern about efficiency? If speed, maybe not.

Speaking generally, it may be possible to draw random subsamples (bootstrap using a size smaller than _N). This would be faster. However, if your data come from complex sampling designs, it wouldn't be valid.

I haven't used boottest so I can't give you specific recommendations there. You may try to read it's documentation and associated paper to find out if it's right for your use case. I mention writing your own simulation program because that is what it's for.
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 330
#6

14 Feb 2020, 06:42

Are you sure you need 1000 replications? The bootstrap docs suggest 50 for most applications while admitting that some will require 1000. Is this one of those hard cases?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2423
#7

14 Feb 2020, 10:33

You could repeatedly use the user-written module -gsample- (available at ssc) to generate the frequencies for a bootstrap (with replacement) sample, or do the same thing on a do-it-yourself basis, and then run each of your regression commands using frequency weights. Here's an illustration of both of these ways to take one bootstrap sample with size = _N:

Code:

clear set obs 70000000 gen long id = _n // local N = _N local p = 1/`N' // gf will contain the frequency with with each observation was selected for the sample gsample `N', gen(gf) // about 40 sec on my machine // Same idea, but taking the do-it-yourself approach using rbinomial() \ // to generate the frequencies, which was much faster on my machine. gen int f = rbinomial(`N', `p') // about 7 sec. on my machine tab gf tab f // Now you could do something like this: regress y x1 x2 x3 [fw=f]

I have not seen the use of rbinomial() for this purpose before but it seems to make sense to me: Each of the N picks for the sample occurs with probability 1/N, each pick is independent, and there are N such picks. And, rbinomial() appears to yield the same expected frequency distribution as -gsample-, which I trust.

My idea, filling out what Leonardo suggested above, would be that you would use -simulate-, with a frequency-based sampling routine procedure and your two regression commands inside the program that -simulate- calls. Note that the memory requirements here are not bad, just the frequency variable. Any way you look at this it's going to be slow, though. Still, if each rep took a minute including your regression command, you could get quite a few done in a day.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2423
#8

14 Feb 2020, 16:22

Note that using rbinomial() does not guarantee that the sum of the frequencies is _N, although it should be close. (Sorry for the oversight on my part.)
Nevertheless, perhaps something in that direction might help. (e.g., set the p a little higher and then randomly adjust some frequencies to get down to _N).
Comment

Announcement

Bootstrapping very large sample

Comment

Comment

Comment

Comment

Comment

Comment

Comment