Is there a command that is equivalent to "bsample more than _N"?

Tom Best

Join Date: Mar 2018

Posts: 11
#1

Is there a command that is equivalent to "bsample more than _N"?

21 Dec 2018, 08:56

The command

bsample B

with B > _N

(e.g., if _N = 100., trying to sample more than 100 times with replacement)

yields
resample size must not be greater than number of observations
r(498);

Is there a command that can sample with replacement more than _N times? I realize I could loop through floor( B / _N) times and then once more on the remainder, but I'm hoping there is a compact command.

Also, if your time allows, I'm curious: why is bsample limited in this way? It samples with replacement...why can't it generate a randomly-selected sample that has more observations than the original dataset?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35642
#2

21 Dec 2018, 09:00

A bootstrap sample is the same size as the original sample. Stata isn't just being restrictive in practice; in principle the desire makes no sense.
Comment
Tom Best

Join Date: Mar 2018

Posts: 11
#3

21 Dec 2018, 09:45

If a bootstrap sample is always the same size as the original sample, then why does bsample allow B < _N?

Also, is there a command that can sample with replacement more than _N times? (I am looking to sample
with equal probability assigned to each observation in the original dataset.)

Thank you

Last edited by Tom Best; 21 Dec 2018, 09:48.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35642
#4

21 Dec 2018, 10:10

Fair point, but where would you expect Stata to store the extra values?
Comment
Tom Best

Join Date: Mar 2018

Posts: 11
#5

21 Dec 2018, 10:23

One option that would work (for my purposes at least) would be to store the counts in my_var, where my_var is submitted to the weight option of bsample (i.e., weight(my_var) )? Then one could use expand to add the drawn observations to the dataset.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

21 Dec 2018, 10:37

I understand that the principle of bootstrap estimation is to simulate the distribution of a function of N observations of data by simulating draws of N observations from the empirical distribution of the data.

it's also the case that one might want to generate draws of, say, M>N observations from the empirical distribution given by the data for purposes other than bootstrap. The bsample command would seem to be the tool, but for the restriction that B<=N. So generating multiple bsamples and appending them appears to be the approach you will need.

Depending on your objective, you might find the discussion at the post below helpful. In responding to it, I had thought about recommending bootstrapping but ran into the same issue you did.

https://www.statalist.org/forums/for...ical-variables

And based on that, I have had an idea.

Code:

clear
set obs 10
generate id = 100+_n
generate f = 0
local B 100
local N = c(N)
set seed 666
forvalue i=1/150 {
    local obs = runiformint(1,`N')
    quietly replace f = f+1 in `obs'
    }
list, clean noobs
tab id [fweight=f]
expand f
tab id

Code:

. list, clean noobs

     id    f  
    101   16  
    102   12  
    103    9  
    104   12  
    105   21  
    106   13  
    107   16  
    108   16  
    109   19  
    110   16  

. tab id [fweight=f]

         id |      Freq.     Percent        Cum.
------------+-----------------------------------
        101 |         16       10.67       10.67
        102 |         12        8.00       18.67
        103 |          9        6.00       24.67
        104 |         12        8.00       32.67
        105 |         21       14.00       46.67
        106 |         13        8.67       55.33
        107 |         16       10.67       66.00
        108 |         16       10.67       76.67
        109 |         19       12.67       89.33
        110 |         16       10.67      100.00
------------+-----------------------------------
      Total |        150      100.00

. expand f
(140 observations created)

. tab id

         id |      Freq.     Percent        Cum.
------------+-----------------------------------
        101 |         16       10.67       10.67
        102 |         12        8.00       18.67
        103 |          9        6.00       24.67
        104 |         12        8.00       32.67
        105 |         21       14.00       46.67
        106 |         13        8.67       55.33
        107 |         16       10.67       66.00
        108 |         16       10.67       76.67
        109 |         19       12.67       89.33
        110 |         16       10.67      100.00
------------+-----------------------------------
      Total |        150      100.00

Last edited by William Lisowski; 21 Dec 2018, 10:49.

Comment

Tom Best

Join Date: Mar 2018

Posts: 11
#7

21 Dec 2018, 11:26

Thank you, this may be faster than my current solution. I also use "if" options, but I think I might still be able to do something with sort and then runiformint.

In jest, I must say though, I would have chosen another seed; call me superstitious.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

21 Dec 2018, 11:50

I have also been annoyed by the inability of -bsample- to generate bootstrap samples bigger than _N on a couple of occasions.

I can speculate that the reason for this feature of -bsample- is that whoever programmed the command was familiar with the literature on sub-sampling (of which drawing bootstrap samples of size _N is just a small sub-field), and did not see applications of drawing samples bigger than _N, or alternatively wanted to save the users from causing themselves harm by attempting drawing samples of size bigger than _N.

My understanding of the literature on sub-sampling (of which again, bootstrap samples of size _N is a subfield) is as follows:

1. There are quite a few occasions on which the standard bootstrap of samples of size _N fails. In some of these cases, drawing subsamples of size M<_N often resolves the problem. (They call this "m out of n" sampling I think.)

2. I am unaware of any case in which drawing samples of size M>_N "works", i.e., achieves any goal in the sense of fixing a problem which drawing subsamples of size _N cannot fix.

These issues are explained in the textbook Politis, D.N., Romano, J.P. and Wolf, M., 1999. Subsampling Springer series in statistics.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

21 Dec 2018, 13:04

My choice of seed was a preemptive acknowledgement to readers who feel there is something unholy about wanting to sample M>N.

For those unfamiliar with the special meaning of 666 being alluded to, see

https://en.wikipedia.org/wiki/Number_of_the_Beast

I appreciate the clarification in post #8 about the reason M<N is included in bootstrap methodology, and it inspires me to think that M>N is excluded because if you used bootstrap commands with M>N to find the empirical distribution and confidence limits for a statistic computed from N observations, the results would be incorrect - as if you had used the t distribution with 20 degrees of freedom to find the critical value for a test of the mean of 11 observations. It's not hard to imagine a newbie wandering down the wrong path, abusing bootstrap methodology in the search for asterisks.

Finally, I learned about the weight() option on bsample, leading to the following revised technique.

Code:

clear
set obs 10
generate id = 100+_n
generate f = 0
generate w = 0
local N `c(N)'
local M 155
set seed 42
forvalue i=1(`N')`M' {
    bsample min(`N',`M'-`i'+1), weight(w)
    quietly replace f = f+w
    }
list, clean noobs
tab id [fweight=f]
expand f
tab id

Code:

. list, clean noobs

     id    f   w  
    110   13   0  
    109   18   0  
    105   13   0  
    103   20   0  
    107   23   3  
    101   12   0  
    102   15   0  
    106   10   0  
    108   17   1  
    104   14   1  

. tab id [fweight=f]

         id |      Freq.     Percent        Cum.
------------+-----------------------------------
        101 |         12        7.74        7.74
        102 |         15        9.68       17.42
        103 |         20       12.90       30.32
        104 |         14        9.03       39.35
        105 |         13        8.39       47.74
        106 |         10        6.45       54.19
        107 |         23       14.84       69.03
        108 |         17       10.97       80.00
        109 |         18       11.61       91.61
        110 |         13        8.39      100.00
------------+-----------------------------------
      Total |        155      100.00

. expand f
(145 observations created)

. tab id

         id |      Freq.     Percent        Cum.
------------+-----------------------------------
        101 |         12        7.74        7.74
        102 |         15        9.68       17.42
        103 |         20       12.90       30.32
        104 |         14        9.03       39.35
        105 |         13        8.39       47.74
        106 |         10        6.45       54.19
        107 |         23       14.84       69.03
        108 |         17       10.97       80.00
        109 |         18       11.61       91.61
        110 |         13        8.39      100.00
------------+-----------------------------------
      Total |        155      100.00

Last edited by William Lisowski; 21 Dec 2018, 13:06.

Comment

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#10

22 Dec 2018, 05:44

Originally posted by William Lisowski View Post

"M>N is excluded because if you used bootstrap commands with M>N to find the empirical distribution and confidence limits for a statistic computed from N observations, the results would be incorrect - as if you had used the t distribution with 20 degrees of freedom to find the critical value for a test of the mean of 11 observations."

....................................
forvalue i=1(`N')`M' {
bsample min(`N',`M'-`i'+1), weight(w)
quietly replace f = f+w
}
.................................
[/CODE]

I thought to say this, but apparently I did not, and I meant exactly what you are saying. Apparently bad things happen when one sets M>_N in the lines you explained, that we are confusing our regressions just as we would confuse them if we -expand- our sample to "gain more observations." This does not mean that M>_N never makes sense, it might mean that people still have not found applications for it, and this is why experts feel that it is a No NO thing.

William, can you please be kind and explain what the loop that I have left from your code does in human words? As you put it in some other thread, this piece of the code is rather "subtle"...
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

22 Dec 2018, 11:26

Here's a mind dump that I won't take the time to edit into something shorter.

At the start of the loop, the local macro N has been set to 10, the number of observations in the dataset, and the local macro M has been set to 155, the number of observations desired in the sample. So the command

Code:

forvalue i=1(`N')`M' {

is interpreted as

Code:

forvalue i=1(10)155 {

Thus as the loop executes, the local macro i will successively take the values 1, 11, 21, ... 151, and then the loop will exit.

The first time through the loop, the command

Code:

bsample min(`N',`M'-`i'+1), weight(w)

is interpreted as

Code:

bsample min(10,155-1+1), weight(w)

The expression following the command tells bsample how many observations to sample on this pass through the loop, and the min function will ensure it is never more than the number of observations in the dataset, nor more than the number of observations remaining to add to the sample we want to create. The last time through the loop, it is interpreted as

Code:

bsample min(10,155-151+1), weight(w)

which says, hey, we've already added 150 observations to our cumulative sample, we only need 5 more, not a full sample of 10.

The great thing is the weight option (do see the output of help bsample for full details), which tells bsample NOT to replace the data in memory with each new sample, but instead replace the value of the variable w with the number of times each observation would have been included in the new sample. So if you look at the output in post #9, you'll see that on the final pass through the loop, id 107 was chosen 3 times, ids 108 and 104 were each chosen 1 time (and to my surprise, it appears that bsample changes the order of the dataset, I should have re-sorted it by id when I exited the loop!).

The command

Code:

quietly replace f = f+w

takes the counts that each run of bsample stores in w and adds them into a cumulative counter I call f which counts the total frequency which which each observation will be included in the final sample.

When I'm done with the loop, the command

Code:

tab id [fweight=f]

show you that if you have an analysis that can be done using frequency weights, you don't even need to generate the full sample of M observations, with in this case 101 copies of id 101, and so forth.

But if you really do want the full set of observations, then

Code:

expand f

expands your dataset so that f-1 copies of each observation are added to the dataset (and again you'd want to sort afterwards, I didn't, so I think all the "new" observations are at the end of the dataset (and help expand will tell you more, I'm sure)). So you see that the weighted tab of the 10-observation dataset is identical to the unweighted tab of the 155-observation dataset.

The inclusion of the weight option, and the fact that it replaces an existing variable rather than creates a new variable (so I didn't have to drop w after adding it into f), leads me to suspect that it was added precisely to allow bsample to be used in the manner I show to create samples with M>N within minimal effort, while preventing users from doing so in the bootstrap framework.

With regard to the purpose of M>N, while I think that will never be appropriate in the bootstrap framework, I think it is perfectly appropriate in the context of simulation. The topic I linked to in post #7 that you and I participated in was a simulation: the user wanted to simulate data drawn from the same distribution as the sample but with increasing numbers of observations to see what effect increasing sample size would have.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#12

22 Dec 2018, 15:19

I have on several occasions needed to generate a sample of size M, sampling with replacement, from a starting sample of size N, with M >> N. Not for bootstrapping, but to generate a simulated population from a stipulated joint distribution of variables. The procedure I use is this.

The starting sample of size N, which represents the stipulated joint distribution of variables in the population to be simulated, is saved in its own dataset, along with a variable that uniquely identifies observations. Let's call that variable obs_no (as usually I create it to be the observation number in the data set. Let's call the data set pjd.dta

Code:

quietly des using pjd local N = r(N) clear set obs `M' // PREVIOUSLY DEFINED TO CONTAIN DESIRED SAMPLE SIZE set seed 13 // STILL FEELING SUPERSTITIOUS? gen obs_no = runiformint(1, `N') merge m:1 obs_no using pjd, keep(match) nogenerate
3 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#13

23 Dec 2018, 04:22

Notwithstanding the great code that was presented above, from which I learnt a lot in terms of techniques and how other programmers are thinking of the problem, I think we are over-thinking the problem and complicating unnecessarily. (Apart from Clyde, whose solution is reasonably simple and practical.)

I think the following is even simpler than Clyde's solution, and would do the trick. If bootstrap sample M>_N is desired, simply

1. Expand the sample with equal weight on each observation using -expand P/_N- to some P>M, any P would do really, it just needs to be bigger than M
2. Sample of the size M using -bsample M-,

In the example that William was using

Code:

. clear . set obs 10 number of observations (_N) was 0, now 10 . gen id = 100 + _n . set seed 42 . expand 16 (150 observations created) . bsample 155 . tab id id | Freq. Percent Cum. ------------+----------------------------------- 101 | 13 8.39 8.39 102 | 26 16.77 25.16 103 | 13 8.39 33.55 104 | 13 8.39 41.94 105 | 18 11.61 53.55 106 | 12 7.74 61.29 107 | 12 7.74 69.03 108 | 21 13.55 82.58 109 | 16 10.32 92.90 110 | 11 7.10 100.00 ------------+----------------------------------- Total | 155 100.00 .

In this example any expansion -expand 16-, -expand 25-, -expand 30- etc., that results in sampling population bigger than 155 would do the trick. The key (it seems to me) is that when we expand like this, the relative frequency of each observation in the expanded sample is the same as the relative frequency of this observation in the original sample. As long as this is true, bootstrap sampling from the expanded sample gives the correct equal weight to each observation in the M bootstrap sample.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#14

23 Dec 2018, 06:19

I personally prefer Clyde's approach, with the caveat that if I were doing it in a loop, I'd probably see if the frequency weighting approach would work for my purposes, and if so, modify his approach to collapse the M-observation dataset into a dataset of frequencies.
Comment
Tom Best

Join Date: Mar 2018

Posts: 11
#15

11 Jan 2019, 14:50

Belated thanks to all for their thorough responses and suggested solutions. Clyde's approach is what I will use for my application.
Comment

Announcement