Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrapping very large sample

    Hi everyone,

    I have a dataset of 70 million observations and am running the following model.

    First I estimate the probability of belonging to class A with a logit model. Then I calculate fitted probabilities. Call those values x1hat.


    Then I run an OLS regression to explain another variable y, as follows.

    reg y x1hat x2 x2*x1hat

    Where x2 is another explanatory variable.

    I know that the standard errors of the last regression will not reflect the uncertainty of x1hat. So I wanted to bootstrap the standard errors of the entire procedure: first logit then Ols.

    But my sample size is very large so I am afraid it won't be feasible to do 1000 reps with a sample size of 70 million each time.

    Any suggestions on how to do this?

    I noticed the wild and fast bootstrapping command (boottest) but I think that is only after one single estimation command ?So don't know if I can use boottest to repeat the joint procedure of logit followed by ols.

    Appreciate your guidance,

    Laurie


  • #2
    Forgot to mention, I do have a powerful computer 120gb of ram. But still this sounds like too much even for that one.

    Comment


    • #3
      I don't see why you can't write your own program that handles the resampling, estimation and posting of results, then use -simulate- to run each iteration. It's possible that you if you don't have at least enough memory to hold twice the size of the data set (and whether extra is needed for the estimation commands) that it could run slowly, but "slow" is relative to how long it takes your computer to do one iteration of work.

      Comment


      • #4
        Thanks Leonard. Do you think it would be efficient to do sample size of 70 million on ever rep? Or would it be ok to use smaller sample sizes? Not sure about the tradeoffs between reps and sample size.

        I could write it my self but my understanding is that boottest has a particularly efficient way to do bootstrapping inferences in some cases. So if I could piggy back to an existing command that already went through several efficiency checks that would definitely be better for me.

        Comment


        • #5
          Efficient is hard to say -- in what sense are you concern about efficiency? If speed, maybe not.

          Speaking generally, it may be possible to draw random subsamples (bootstrap using a size smaller than _N). This would be faster. However, if your data come from complex sampling designs, it wouldn't be valid.

          I haven't used boottest so I can't give you specific recommendations there. You may try to read it's documentation and associated paper to find out if it's right for your use case. I mention writing your own simulation program because that is what it's for.

          Comment


          • #6
            Are you sure you need 1000 replications? The bootstrap docs suggest 50 for most applications while admitting that some will require 1000. Is this one of those hard cases?

            Comment


            • #7
              You could repeatedly use the user-written module -gsample- (available at ssc) to generate the frequencies for a bootstrap (with replacement) sample, or do the same thing on a do-it-yourself basis, and then run each of your regression commands using frequency weights. Here's an illustration of both of these ways to take one bootstrap sample with size = _N:
              Code:
              clear
              set obs 70000000
              gen long id = _n
              //
              local N = _N
              local p = 1/`N'
              // gf will contain the frequency with with each observation was selected for the sample
              gsample `N', gen(gf)    // about 40 sec on my machine
              // Same idea, but taking the do-it-yourself approach using rbinomial() \
              // to generate the frequencies, which was much faster on my machine.
              gen int f = rbinomial(`N', `p') // about 7 sec. on my machine
              tab gf
              tab f
              // Now you could do something like this:
              regress y x1 x2 x3 [fw=f]
              I have not seen the use of rbinomial() for this purpose before but it seems to make sense to me: Each of the N picks for the sample occurs with probability 1/N, each pick is independent, and there are N such picks. And, rbinomial() appears to yield the same expected frequency distribution as -gsample-, which I trust.

              My idea, filling out what Leonardo suggested above, would be that you would use -simulate-, with a frequency-based sampling routine procedure and your two regression commands inside the program that -simulate- calls. Note that the memory requirements here are not bad, just the frequency variable. Any way you look at this it's going to be slow, though. Still, if each rep took a minute including your regression command, you could get quite a few done in a day.


              Comment


              • #8
                Note that using rbinomial() does not guarantee that the sum of the frequencies is _N, although it should be close. (Sorry for the oversight on my part.)
                Nevertheless, perhaps something in that direction might help. (e.g., set the p a little higher and then randomly adjust some frequencies to get down to _N).

                Comment

                Working...
                X