Bootstrapping questions

Thomas Nielsen

Join Date: Aug 2015

Posts: 11
#1

Bootstrapping questions

11 Aug 2015, 00:59

Hi guys,

I'm using Stata 13.1 and lets use the example datasheet "auto" for my two questions. I have not done bootstrapping before, but read the bootstrapping chapter in "Microeconometrics using Stata".

1) What is the difference between:
regress mpg weight gear foreign, vce(bootstrap, reps(100) seed(1))
bootstrap, reps(100) seed(1): regress mpg weight gear foreign

It gives me the same result (which is not surprising, please see attached) - but is "methodology" behind the code the same? Thats my only concern.

2) Isnt it possible to save the "bootstrapped" dataset of, for example, 2.000 reps, i.e. the simulated data?
I would really like this, because I find it easier to do hypothesis testing, etc. if I have the "new" dataset.

3) Same as Q2, just with the residuals bootstrap approach:
With help from the Microeconometrics book, mentioned above, I use the following code:

use auto, clear
quietly regress mpg trunk price
predict uhat, resid
keep uhat
save residuals, replace
program bootresidual
version 11
drop _all
use residuals
bsample
merge using auto.dta
regress mpg trunk price
predict xb
generate ystar=xb+uhat
regress ystar trunk price
end

**
simulate _b, seed(1) reps (400) nodots: bootresidual
sum

But as for Q2, I would really like a "new" bootstrapped dataset - is it possible? And when would you prefer 1) > 2)
Attached Files
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17703
#2

11 Aug 2015, 01:52

Thomas:
as far as your questions 2 and 3 are concerned, perhaps what you're looking for is the -saving- option in -bootstrap- command.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Thomas Nielsen

Join Date: Aug 2015

Posts: 11
#3

11 Aug 2015, 03:08

@Carlo: Thanks. And you are probably right - can you help me how to implement it? For example, if I want to save the "bootstrapped" data in a new file?
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#4

11 Aug 2015, 05:07

As of 1), the two commands execute exactly the same procedure. If you run this in Stata 13.1:

Code:

sysuse auto, clear set trace on set tracedepth 2 regress mpg weight gear foreign, vce(bootstrap, reps(100) seed(1))

You'll see that the command called to obtain the bootstrapped standard errors is:

Code:

version 13.1: bootstrap , reps(100) seed(1) : regress mpg weight gear_ratio foreign

which is the second command.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17703

11 Aug 2015, 05:57

Thomas:
elaborating a bit on one of your codes:

Code:

. use auto.dta, clear
(1978 Automobile Data)

. regress mpg weight gear foreign

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  3,    70) =   46.73
       Model |  1629.67805     3  543.226016           Prob > F      =  0.0000
    Residual |  813.781411    70  11.6254487           R-squared     =  0.6670
-------------+------------------------------           Adj R-squared =  0.6527
       Total |  2443.45946    73  33.4720474           Root MSE      =  3.4096

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   -.006139   .0007949    -7.72   0.000    -.0077245   -.0045536
  gear_ratio |   1.457113   1.541286     0.95   0.348    -1.616884     4.53111
     foreign |  -2.221682   1.234961    -1.80   0.076    -4.684735    .2413715
       _cons |   36.10135   6.285984     5.74   0.000     23.56435    48.63835
------------------------------------------------------------------------------

. bootstrap, reps(100) saving(C:\Users\Carlo Lazzaro\Desktop\bootstrap.dta, replace) seed(1) : regress mpg weight gear foreign
(running regress on estimation sample)
(note: file C:\Users\Carlo Lazzaro\Desktop\bootstrap.dta not found)

Bootstrap replications (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

Linear regression                               Number of obs      =        74
                                                Replications       =       100
                                                Wald chi2(3)       =    111.96
                                                Prob > chi2        =    0.0000
                                                R-squared          =    0.6670
                                                Adj R-squared      =    0.6527
                                                Root MSE           =    3.4096

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
         mpg |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |   -.006139   .0006498    -9.45   0.000    -.0074127   -.0048654
  gear_ratio |   1.457113   1.297786     1.12   0.262    -1.086501    4.000727
     foreign |  -2.221682   1.162728    -1.91   0.056    -4.500587    .0572236
       _cons |   36.10135    4.71779     7.65   0.000     26.85465    45.34805
------------------------------------------------------------------------------
. use "C:\Users\Carlo Lazzaro\Desktop\bootstrap.dta", clear
(bootstrap: regress)

Kind regards,
Carlo
(Stata 19.0)

Comment

Thomas Nielsen

Join Date: Aug 2015

Posts: 11
#6

11 Aug 2015, 06:19

Thanks a lot, @Carlo. It's working, but isnt it just a lof of beta estimates, it saves? Please see attached:

I was more thinking of a "new" dataset. For example, if my data for gear was:
1, 10, 12, 13

I would now be:
1, 10, 12, 13, 23, 2, 17, 4, 2...

- the same goes for my other variables. Does it make sense?
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#7

11 Aug 2015, 06:45

}
2) Isnt it possible to save the "bootstrapped" dataset of, for example, 2.000 reps, i.e. the simulated data?
I would really like this, because I find it easier to do hypothesis testing, etc. if I have the "new" dataset.

It seems that what you want is a dataset with each one of the bootstrapped copies of the original data, as opposed to what Carlo provided, which are the beta estimates from each of the replication datasets. While this should be possible, I don't immediately see the usefulness of having this dataset. It will be a large dataset, with N*reps observation, where N is the number of observations and rep is the number of repetitions.

Can you provide an example of a hypothesis test that would be easier if you had this dataset in memory?

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
1 like
Comment
Thomas Nielsen

Join Date: Aug 2015

Posts: 11
#8

11 Aug 2015, 08:58

Jorge Eduardo Perez Perez : For example, how do I perform one-way ANOVA on the bootstrapped "dataset"? (I have a country-variable in my original dataset)
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#9

11 Aug 2015, 10:02

To avoid confusion, lets call the dataset with the copies of the data for each bootstrap replication, the "replication dataset" and the dataset with the beta estimates, "the beta dataset"

It is not clear to me why you would want to perform any kind of analysis on the replications dataset as opposed to the beta dataset, so I can't answer that question unless you provide more details. Each replication is a sample with replacement of the original dataset. The only purpose of these copies is to obtain new beta estimates of your original estimation command, in order to obtain standard errors of these betas. Running an analysis on the full replication dataset seems meaningless. Moreover, this replication dataset changes with the seed of your random draws.

Here's some code to obtain the replication dataset, however:

Code:

sysuse auto, clear glo reps=10 reg price mpg, vce(bootstrap, reps($reps) seed(350)) * Label this original dataset repetition 0 gen rep=0 * The "bootstrapped dataset" that originates this the chain of copies of the original dataset * Each copy will be labeled by its replication number in rep set seed 350 forv i=1(1)$reps { preserve sysuse auto, clear bsample gen rep=`i' tempfile b save `b' restore append using `b' }

Last edited by Jorge Eduardo Perez Perez; 11 Aug 2015, 10:08.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment
Thomas Nielsen

Join Date: Aug 2015

Posts: 11
#10

16 Aug 2015, 04:23

Jorge Eduardo Perez Perez: Thanks for providing the code, really appreciate. My problem is that I can't figure out the difference between the data inbuilt bootstrap function and the resample residuals code (i.e. the difference between 1) and 3) above).

When I use the code for the resampled residuals I only get the beta-estimates but not the t-statistics, i.e. I don't know if the beta estimates are significant.

So, in general would you just suggest the inbuilt stata command or the resampled residuals approach? Or does it depends on the purpose of the estimation?
Comment
Jorge Eduardo Perez Perez

Join Date: Mar 2014

Posts: 429
#11

17 Aug 2015, 02:01

The bootstrap implemented in Stata is the "pairs" or "design matrix" bootstrap, where the whole data is resampled, as opposed to the "residuals" bootstrap, where only the residuals are resampled and reassigned to the original data observations.

You may want to look at section 13.2 of the Microeconometrics using Stata textbook you referenced. The residual bootstrap makes assumptions about the model, such as linearity and i.i.d errors in your example, whereas the pairs bootstrap does not make these assumptions. Either can be appropriate depending on the application.

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com
Comment
Thomas Nielsen

Join Date: Aug 2015

Posts: 11
#12

25 Aug 2015, 06:37

Jorge Eduardo Perez Perez: Thanks. I had some time to look into it. However, I still dont know when to use either the resampling or the "pairs" bootstrap . Can you provide an example?
Comment

Jorge Eduardo Perez Perez

Join Date: Mar 2014
Posts: 429

#13

26 Aug 2015, 01:14

Here's a simple example: heteroskedasticity. Under heteroskedasticity, the residual bootstrap standard errors are not as close to the correct robust standard errors, because residuals with high error variance may be assigned to observations with low error variance. This example shows that in this setting, pairs bootstrap standard errors are closer to the correct robust standard errors:

Code:

clear
* Generate example data
set seed 946
set obs 100
gen x=uniform()
* Heteroskedasticity: variance of error is larger for 51/100
gen y=x+0.1*rnormal() in 1/50
replace y=x+0.5*rnormal() in 51/100

* OLS s.e are not correct
reg y x
est store ols
* Save data residuals for later
predict res, resid

preserve
keep res
tempfile res
save `res'
restore

preserve
drop res
tempfile data
save `data'
restore

* View heteroskedasticity
gen id=_n
gen ressq=res^2
scatter ressq id
est store ols

* Should have robust se
reg y x, r
est store robust

* Now we look at the se from both bootstrapping schemes
* See which is closer to robust s.e

* Pairs bootstrap
reg y x, vce(bs, rep(400))
est store pairs

* Residual bootstrap
cap program drop bootresidual
program bootresidual
version 11
use `1', clear
bsample
merge 1:1 _n using `2'
reg y x
predict xb
generate ystar=xb+res
reg ystar x
end

* bootresidual `res' `data'

**
simulate _b, reps (400) nodots: bootresidual `res' `data'
ren _b_x x
est restore ols
mat b=e(b)
bstat, stat(b)
est store residual


* Tab and compare estimates
est tab *, b se keep(x)

Jorge Eduardo Pérez Pérez
www.jorgeperezperez.com

Comment

Thomas Nielsen

Join Date: Aug 2015

Posts: 11
#14

27 Aug 2015, 13:52

Jorge Eduardo Perez Perez : Thank you so much for clearing this out and thank you for providing an example. I think it has been very hard to find answers somewhere else on this topic (or, if I was a Ph.D. it might have been easier )
Comment

Announcement