Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • use bsample to generate bootstrap samples

    I am now working on a project in which I need to obtain coefficient SE using bootstrap method. However, the "bootstrap" option is not appropriate for my case. Instead, I am going to generate bootstrap samples first and then do estimations and calculations using these samples.

    My question is whether the built-in command "bsample" generates bootstrap samples. And is there any reference for it? I read Stata's help file, but it seems that it doesn't tell any details about the random number generating process of bsample.

    Thanks!

  • #2
    The first sentence in the helpfile gives you the basics:

    bsample draws bootstrap samples (random samples with replacement) from the data in memory.
    If you want to know more, you need to tell us what you need exactly. Otherwise we'll spend a lot of time writing down all kinds of details of which 99% is useless to you.
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Hi Maarten,

      Thanks for your response! I want to know whether "bsample" is doing simple case resampling, i.e. resampling individual cases - rows of a data set. I also want to know the algorithm (including key parameters, for example, the probability that each individual case is resampled in each bootstrap iteration, it seems "bsample" uses 2/3) used by bsample.

      I want to know whether the following Matlab code can mimic bsample

      B = 100; % number of boot iteration
      G = 3000; % sample size
      fc_draw = zeros(G,B);

      for b = 1:B

      FC = 1:G;
      FC = FC';
      g1 = random('bino', 1, 2/3, [G,1]);
      g1 = (g1==1);
      FC = FC(g1);

      bfc = length(FC);
      fc_draw(1:bfc,b) = FC;

      for s = bfc+1:G
      fc_draw(s,b) = FC(random('unid', bfc, 1));
      end
      end

      Comment


      • #4
        I don't know Matlab, but the code does not look right.

        bsample just samples (by default) N out of N rows in your dataset with replacement and (by default) each row is equaly likely to be drawn. That is all the parameters you need.

        Conceptually you can think of this as looping over observations. So you open a new dataset and want to fill in the first row and do so by randomingly drawing a row from the original dataset, whereby each row in the original dataset has a probability of 1/N to be drawn. You go to the second row in your new dataset and again draw a random row from your original dataset. etc. Drawing with replacement means that the original dataset is unchanged after each draw; the observation that was drawn in the previous round is immediately placed back in the pool, hence the name "with replacement". As a consequence, you can draw the same original observation twice, three times, or even more, while others won't appear at all. This is how we can randomly draw N out of N observations without just getting the same original datset back again.

        You may also be interested in this: http://blog.stata.com/2012/08/29/usi...h-replacement/
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Thanks very much for your comment and the link! I think my Matlab code effectively does the same thing as bsample.

          I agree with your explanation on the sampling with replacement process. However, I find that bsample always results in a sample containing (about) 2/3 unique obs of the original data. I write the Matlab code to mimic this idea.

          In the Matlab code, G is the sample size of original data. The sampling process proceeds in following steps:

          1. create a column vector "fc_draw" of all zeros with the same length as the original data, say G-by-1.
          2. make each row of original data having equal prob to be drawn. I set the prob equal to 2/3 as bsample. For each row (ID) of original data, generate a binary indicator with 2/3 prob equal to 1. Keep all IDs with this indicator =1, which results in a g-by-1 vector of IDs with g < G. This is paragraph 1 within the loop.
          3. paragraph 2 replaces the first g components of the fc_draw vector with the sampled IDs get from step 2. This is paragraph 2 within the loop.
          4. all the rest G - g components of the fc_draw vector are drawn uniformly from the first g components (step 3). This is the last paragraph in the loop.
          5. repeat step 1-4 for B times.

          Comment


          • #6
            I agree with Maarten. Your Matlab code is not correctly sampling with replacement.
            However, I find that bsample always results in a sample containing (about) 2/3 unique obs of the original data.
            The actual average proportion differs with \(n\), but for large \(n\) will be close to 0.632. This follows from the theory of simple random sampling with replacement. If the population size is \(n\), the probability that a sample member will not selected in \(n\) draws is:
            \[
            \left(1 - \frac{1}{n}\right)^n
            \]
            which, as \(n\) gets larger, has limit:
            \[
            \lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = e^{-1} = .36787944
            \]
            So the probability that a sample member will be selected is \(1- \left(1 - \frac{1}{n}\right)^n\), which will converge to \(1- e^{-1} = \) 0.632. The theory also gives the probability that a population unit wil appear \(k\) times, with \(k = 0\dots n\). Your algorithm must match all those probabilities.

            For the algorithm that bsample uses, type "viewsource bsample.ado" and look for the line.
            Code:
            program SWSR
            Last edited by Steve Samuels; 16 May 2015, 19:57.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Here is how to calculate the probabilities that an observation will appear k times in a bootstrap sample of size n. You can check your Matlab algorithm against these.

              In simple random sampling with replacement:

              Code:
              p = single draw probability = 1/n  1-p = 1-1/n = (n-1)/n.
              
              p[k] = P{draw a specified element k times in n draws} =
              
              comb(n,k) * p^k * (1-p)^(n-k) (binomial probability; comb(n,k) = n!/(k!(n-k)!)
              
              p[0]=  =
              (1-1/n)^n ~ exp(-1) = .3679 when n is large
              
              p[1] =
              comb(n,1)*1/n * ((n-1)/n)^(n-1) = n * (1/n) * ((n-1)/n)^(n-1) = p[0]* n/(n-1)~ .3679 for large n
              
              p[2] =
              comb(n,2)*1/n^2 * ((n-1)/n)^(n-2) =  n*(n-1)/2 * (1/n)^2 *((n-1)/n)^(n-2) = p[1]/2
              
              p[3] =
              comb(n,3)*1/n^3 * ((n-1)/n)^(n-3) = n*(n-1)*(n-2)/6 * 1/n^3 * ((n-1)/n)^(n-3) ~ p[2]/3. for large n
              
              Here are some exact values for n = 100
                        Exact
              p[0] = 0.3660
              p[1] = 0.3697
              p[2] = 0.1849
              p[3] = 0.0610
              p[4] = 0.0149
              
              In general, for k>2:
              
              p[k]~ p[k-1]/k
              This approximate relationship was brought to my attention by Victor Zamit, who observed it empirically and shared his observation in a 2011 Statalist post
              Last edited by Steve Samuels; 17 May 2015, 14:44.
              Steve Samuels
              Statistical Consulting
              [email protected]

              Stata 14.2

              Comment


              • #8
                Here's a more readable version.

                p single draw probability for a single observation \( = \dfrac{1}{n}\).


                p[k] = P{draw a specified element k times in the sample}:


                \[
                p[k] = {n \choose k} \left(\frac{1}{n}\right)^k \left(1-\frac{1}{n}\right)^{n-k}
                \]

                \begin{eqnarray*}
                p[0] & = & \left(1-\frac{1}{n}\right)^n = \left(\frac{n-1}{n}\right)^{n} \\
                \\
                p[1] & = & {n\choose 1} \left(\frac{1}{n}\right) \left(\frac{n-1}{n}\right)^{(n-1)} \\
                & = & \left(\frac{n}{n-1}\right)p[0] \\
                \\
                p[2] & = &
                {n\choose 2} \left(\frac{1}{n}\right)^2\left(\frac{n-1}{n}\right)^{(n-2)} = \frac{p[1]}{2} \\
                \\
                p[3] & = & {n\choose 3}\left(\frac{1}{n}\right)^3 \left(\frac{n-1}{n}\right)^{n-3} \\
                & = & \left(\frac{n-2}{n-1}\right)\frac{p[2]}{3} \approx \frac{p[2]}{3}
                \end{eqnarray*}

                You will naturally want to compare the simulation results to exact values of p[k].

                It is interesting that for large \(n\)
                \[
                p[k] \approx \frac{p[k-1]}{k}
                \]

                This relationship was brought to my attention by Victor Zamit, who observed it empirically and shared his discovery in a 2011 Statalist post.
                Last edited by Steve Samuels; 18 May 2015, 17:34.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #9
                  Thanks very much, Steven! Very useful comment. So the Matlab code becomes super easy as it is just:

                  B = 100; % number of boot replication
                  N = 3000; % number of obs
                  boot_draw = random('unid', N, [N,B]);

                  Am I correct?

                  Alternatively, I also want to use Stata's "bsample" to get bootstrap SE (with 100 replications) for a two-stage estimation manually.

                  My Stata code looks like:

                  use original_data, clear

                  forvalues r = 1/100 {

                  bsample bs_data

                  stage 1 estimation
                  stage 2 estimation with some regressor calculated based on results from stage 1

                  estimates store m`r'_`depvar' /* store results for the rth replication */

                  use original_data, clear /* load original data */
                  }

                  Is this correct?

                  Thanks very much!

                  Comment


                  • #10
                    I suggest that you ask Matlab questions on a Matlab forum. Your Stata code is incomplete at best: how are you going to analyze the stored estimates? In any case, there's never a need to loop bsample. Instead, define a program to do the analysis and return results.. One approach is to bsample within the program; then run the program with simulate. In the other approach, the program contains analytic code only and is run with bootstrap, which does all the sampling. Look for examples in the Manual entries for bootstrap and simulate. There's also a nice example of the bsample-simulate approach at the UCLA web site.http://www.ats.ucla.edu/stat/stata/faq/ownboot.htm. I usually use the bootstrap approach myself, but that's just habit.
                    Last edited by Steve Samuels; 19 May 2015, 10:08.
                    Steve Samuels
                    Statistical Consulting
                    [email protected]

                    Stata 14.2

                    Comment

                    Working...
                    X