Bootstrapping r(mean) - Always Getting the Same Result

John Grove

Join Date: Aug 2016

Posts: 34
#1

Bootstrapping r(mean) - Always Getting the Same Result

05 Mar 2017, 14:57

I think there is something I don't understand about how -bootstrap- works.

I have a dataset with 200 observations; one of the variables is height. There are no missing values.

The mean height is 60.5.

When I run

Code:

bootstrap meanHeight=r(mean), size(10) reps(10) noisily: summarize height

The Observed Coef. of the output is always 60.5.

Since I have used the noisily option, the 10 random draws of 10 observations and the mean height for each draw (repetition) are displayed on the screen. If I use the "saving" option, the resulting file contains the mean produced by each repetition.

How do I get Stata to display the mean of those means? Do you have to open the saved file and take the mean of means?

Can't Stata display that mean in the output of -bootstrap-? I thought that was the whole point.

Thank you.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

05 Mar 2017, 15:40

How do I get Stata to display the mean of those means? Do you have to open the saved file and take the mean of means?

If that is what you want to do, that is how you have to do it.

Can't Stata display that mean in the output of -bootstrap-? I thought that was the whole point.

Actually, no. Not only is that not the whole point, it isn't even appropriate to do. Bootstrapping's purpose is to give an estimate of the standard error when the usual normal theory approximations are not applicable or no closed-form estimator exists. But bootstrapping does not improve your estimation of the mean. The overall sample mean is actually a less biased estimate of the population mean than the mean of the means in the bootstrap samples. The -bootstrap- section of the online manual explains this in detail. You'l find it in the Introduction section of the -bootstrap- chapter of [R].

Last edited by Clyde Schechter; 05 Mar 2017, 15:43. Reason: Correct typos. Reduce verbosity.
Comment
John Grove

Join Date: Aug 2016

Posts: 34
#3

05 Mar 2017, 15:46

Thank you for that explanation. I wasn't suggesting that the resulting mean of means is more accurate. This is mostly an exercise.

Is there a way to obtain the mean of the randomly drawn means?

Do I need a loop for that?

Should the mean of each draw be stored in a macro, tempfile, array, etc. and then aggregated?

Thanks again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

05 Mar 2017, 16:00

So, run your -bootstrap- with the -saving()- option. Whether you use a -tempfile- or a regular file is up to you. If you will have additional use for the individual bootstrap results after your current do-file runs, then a regular file makes sense. If they are of no further use, then a -tempfile- will keep you from cluttering up your file system. No loops, nor anything else complicated needed.

Code:

tempfile bootstrap_output bootstrap meanHeight=r(mean), saving(`bootstrap_output') size(10) reps(10) noisily: summarize height use `bootstrap_output' summarize meanHeight

The other odd thing, by the way, is doing bootstrap samples of size 10 when you have 200 observations in your data. This is perfectly legal, but inefficient. Usually the bootstrap sample size is taken to be the size of the entire sample. Do you have a specific reason for this?
Comment
John Grove

Join Date: Aug 2016

Posts: 34
#5

05 Mar 2017, 16:21

Thank you, Clyde. That worked. There isn't really a specific reason to take a sample size of 10 out of 200 observations. I was mostly trying to get the technique down. I chose 10 just for the purpose of exercising this since it's a small enough number that I could easily check.

A followup question has to do with bootstrapping linear regression coefficients and plotting the resulting regression lines all on the same scatter plot. Is that doable?

Should I use bootstrap : regress of regress...vce?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

05 Mar 2017, 17:39

Use -regress, vce(bootstrap)- to bootstrap linear regression coefficients.
Comment
John Grove

Join Date: Aug 2016

Posts: 34
#7

05 Mar 2017, 17:48

Ok, I got that to run. Can I superimpose the regression line that results from each sample on one scatter plot?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#8

05 Mar 2017, 18:31

Not easily. Again, this is not what bootstrapping is for, but I understand you are just doing this as an exercise. So you can specify a -saving()- suboption in the -vce(bootstrap)- option that will lead to all of the regression coefficients being saved in a temporary or permanent data file, one observation for each regression. Then you can load that file into memory and generate the graphs you want that way looping over the reps and using -graph twoway function- You probably also want to specify a -reps()- suboption in your -vce(bootstrap)- option as well, because the default is 50 reps and if you try to overlay 50 regression lines on a single graph you will probably just have a mess.
Comment
John Grove

Join Date: Aug 2016

Posts: 34
#9

05 Mar 2017, 18:36

Thank you. I will give that a try.
Comment
John Grove

Join Date: Aug 2016

Posts: 34
#10

05 Mar 2017, 18:47

For context, this is what I am trying to accomplish (without the animation):

https://www.stat.auckland.ac.nz/~wil...otstrap4-1.mp4

https://www.stat.auckland.ac.nz/~wil...otstrap5-1.mp4

I am trying to end up with the graph on the bottom right, again, without the animation. Just a static graph that overlays the regression lines.

I think it can be done in R fairly easily, but I am trying to do it Stata...
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#11

05 Mar 2017, 20:10

OK. I don't have your data set, so I'll show you with the built-in auto.dta data set.

Code:

set more off clear* sysuse auto summ mpg local mpg_lo = r(min) local mpg_hi = r(max) regress price mpg local b_cons = _b[_cons] local b_mpg = _b[mpg] local nreps 10 tempfile bootstrap_results regress price mpg, vce(bootstrap, saving(`bootstrap_results') reps(`nreps')) use `bootstrap_results', clear local graph_cmd forvalues i = 1/`nreps' { local graph_cmd `graph_cmd' || function y = _b_cons[`i'] + _b_mpg[`i']*x, range(`mpg_lo' `mpg_hi') } display `"`graph_cmd'"' graph twoway function y = `b_cons' + `b_mpg'*x, /// range(`mpg_lo' `mpg_hi') legend(off) lwidth(thick) /// `graph_cmd'

So the logic is that you first do the sample regression and store the coefficients, as well as the range of the independent variable in local macros. Then do the regression with bootstrap vce and store the coefficients. Load in the coefficients and build up a long chain of -graph twoway function- commands, one for each bootstrap replication. Then run the graph twoway function command, with the long chain included. I distinguished the overall sample regression by making the line thicker, rather than setting the colors. You can use the various -graph twoway- options to modify that or other aspects of the graph. But this is the gist of it.
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4421

#12

05 Mar 2017, 20:10

You can start with something like the following. It uses the second link's dataset and creates a graph of the slopes looking like the one at the end of the run. I didn't bother, but you can use graph twoway's text box option to write in the numbers shown there. There probably are shortcuts to a lot of what I show below, and others can point you to them.

Code:

version 14.2

clear *
set more off
set seed 1376973

input double Growth int Volume
0.36  22
0.09   6
0.67  93
0.44  62
0.72  84
0.24  14
0.33  52
0.61  69
0.66 104
0.8  100
end

// Get bootstrap slopes
tempfile tmpfil0
bootstrap y1 = _b[Volume], saving(`tmpfil0') reps(30) level(90) nodots: regress Growth c.Volume
estat bootstrap, percentile

// Get slope
tempname B
matrix define `B' = e(b)
drop _all
svmat double `B', name(y)
generate byte bs = 0

// Put slopes together
append using `tmpfil0'
quietly replace bs = missing(bs)
sort bs y1 // Sorting for 90% confidence bounds
quietly replace bs = sum(bs)

// Create the two points foreach slope
generate byte x1 = 1
generate byte x0 = 0
generate double y0 = 0
quietly reshape long x y, i(bs) j(seq)

// Graph them
local command graph twoway
summarize bs, meanonly
forvalues i = 1/`=r(max)' {
    local color = cond(inlist(`i', 2, 29), "blue", "red")
    local command `command' line y x if bs == `i', lcolor(`color') ||
}
local command `command' line y x if !bs, lcolor(blue)

`command' ytitle("") xtitle("") ylabel( , angle(horizontal) nogrid) legend(off)

exit

Comment

John Grove

Join Date: Aug 2016

Posts: 34
#13

05 Mar 2017, 20:53

Originally posted by Clyde Schechter View Post

So the logic is that you first do the sample regression and store the coefficients, as well as the range of the independent variable in local macros. Then do the regression with bootstrap vce and store the coefficients. Load in the coefficients and build up a long chain of -graph twoway function- commands, one for each bootstrap replication. Then run the graph twoway function command, with the long chain included. I distinguished the overall sample regression by making the line thicker, rather than setting the colors. You can use the various -graph twoway- options to modify that or other aspects of the graph. But this is the gist of it.

This is brilliant, Clyde. Thank you very much for taking the time to craft that. It works well, and I have learned a lot!

Joseph, thank you as well. Your example is harder for me to grasp but I will try to work through it too.
Comment

Announcement

Bootstrapping r(mean) - Always Getting the Same Result

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment