Bootstrapping of mean in a subsample

Troels Kristensen

Join Date: Oct 2021

Posts: 23
#1

Bootstrapping of mean in a subsample

07 Oct 2021, 04:12

I'm analyzing a sample of GPs. I want to describe the GP's with extreme values of a variable. This means the subgroups were the variable is above P90 and below P10. I want to bootstrap the entire sample and use the bootstrapped p90 and p10 results as the definitions of the groups above p90 and below p10. Next I want to know the mean and confidence intervals for the mean for these two groups. How can this be done in stata?

I am able to use: bootstrap r(p90), rep(1000): summarize var, detail

However, this only gives the P90 value. I also want to know the bootstrapped mean and CI for the group above p90 :-) I hope this is possible?
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17714

07 Oct 2021, 04:34

Troels:
welcome to this forum.
As far as your first question is concerned (I'm not clear with the reason why you want to -bootstrap- r(p10) and r(p90), but it's surely my fault), you may want to try:

Code:

. use "C:\Program Files\Stata16\ado\base\a\auto.dta"
(1978 Automobile Data)

. quietly sum price, d

. bootstrap r(p10) r(p90), reps(1000) bca ties : sum price, d

warning: Because summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which
         observations are used in calculating the statistics and so assumes that all observations are used. This means that no
         observations will be excluded from the resampling because of missing values or other reasons.

         If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure
         that the dataset in memory contains only the relevant data.
(running summarize on estimation sample)

Jackknife replications (74)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
........................

Bootstrap replications (1000)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50
..................................................   100
..................................................   150
..................................................   200
..................................................   250
..................................................   300
..................................................   350
..................................................   400
..................................................   450
..................................................   500
..................................................   550
..................................................   600
..................................................   650
..................................................   700
..................................................   750
..................................................   800
..................................................   850
..................................................   900
..................................................   950
..................................................  1000

Bootstrap results                               Number of obs     =         74
                                                Replications      =      1,000

      command:  summarize price, d
        _bs_1:  r(p10)
        _bs_2:  r(p90)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P&gt;|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |       3895   99.38821    39.19   0.000     3700.203    4089.797
       _bs_2 |      11385   1344.008     8.47   0.000     8750.793    14019.21
------------------------------------------------------------------------------

.

Once you've defined the GPs group you're interested in, you can -bootstrap- the (subsample) mean of the variable of interest, along with a set of -bootstrap- 95% CIs.

Kind regards,
Carlo
(Stata 19.0)

Comment

Troels Kristensen

Join Date: Oct 2021

Posts: 23
#3

07 Oct 2021, 05:44

Hi Carlo
Glad to join this community. Your response is helpful :-) I am able to do what you have suggested. However, my intention is to use the individual bootstrap results to bootstrap the mean of the variable over P90 and below P10 (and other variables later) - but based on the bootstrap of the entire sample rather than the subsample. This means based on each bootstrap rep of the entire sample I want to describe the group of GPs in the top and bottom - at least via a bootstrapped mean from these extreme groups. So In the end - I want to describe how the average GPs are in these extreme groups. I get at subsample of N= 39 of the data over P90 and N=46 below P10 . I can bootstrap based on these subsample but I believe I need to use the entire sample of about 500 (rather than e.g. 39) and next derive the groups below p10 and above p90 including the estimated means and CIs - that I am looking for. Can you help me to get a bit closer to these bootstrapped descriptives for the subsamples based on the entire sample of around 500? Maybe I need to write a program or so? just looking for the most efficient way?
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#4

07 Oct 2021, 05:52

Basically I want to make at table with the following heading: High variable GPs Low variable GPs
Mean CI CV CI Mean CI CV CI
Variable..
Variable..

Besides the boostrapped Mean and CI to describe the high and low extreme groups - I also plan to calculate coefficient of variation and related CI. I'm just not able to do it all in one operation?
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#5

07 Oct 2021, 06:05

In terms of your price variable example. What is the bootstrapped mean price and CI for the group of prices below P10 and above P90, respectively? Does this makes sense?
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 714
#6

07 Oct 2021, 06:19

I understand this as follows, see this example:

Code:

sysuse nlsw88, clear sum wage, det gen flag = 1 if wage < r(p10) bootstrap r(mean), reps(999) seed(123) nodots: sum wage if flag == 1, meanonly estat bootstrap, bc

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#7

07 Oct 2021, 07:03

Hi Felix
The challenge is that your suggestion drops a part of the original sample. This means the entire sample is not used for the bootstrap of the mean in the subsample. In case of bootstrapping of the entire sample the subsamples and their means will be different than if the bootstrap procedure is conducted only on a part of the sample - as far as I understand it.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 714
#8

07 Oct 2021, 07:12

As far as I see it, you either want this solution or Carlo's solution from post #2. Do you have an example of what you intend, in a paper or anything?

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#9

07 Oct 2021, 07:26

Just checked you suggestion. The bootstrap will be based on the subsample (N=46) rather than the entire sample that may influence the bootstrapped values for the 46 observations in the bottom(because the can be switched).
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#10

07 Oct 2021, 11:01

Among other things, the question is whether it is okay to bootstrap subgroups of extreme values rather than the entire sample to create means and confidence intervals for ekstreme groups. HOW CAN THE BOOSTRAPPED MEANS AND CIs BE ESTIMATED BASED ON THE COMPLETE SAMPLE IN STATA - if it makes sense?
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#11

07 Oct 2021, 11:22

If you bootstrap based on a subsample you can estimate CIs for the mean, coefficient of variation etc via stata. However, It may be more appropriate to bootstrap the entire sample and calculate CI for the extreme groups based on the entire data set?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#12

07 Oct 2021, 15:17

Seems to me that since you are trying to estimate quantities in the tails of a distribution, and these are based on the empirical CDF, you should be bootstrapping the entire sample first, and then computing quantities in those tails. I think the only way to do this is by writing a custom program.
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2403

#13

07 Oct 2021, 15:39

This might get you started in a useful direction.

Code:

clear *
cls

set seed 17

sysuse auto

cap progam drop myprog
program define myprog, rclass
  version 17
  syntax varlist(max=1) [if] [in]
 
  unab v : `varlist'
  confirm numeric var `v'
  marksample touse
 
  tempname p10 p90 p10_mean p90_mean
  qui summ `v', det
  scalar `p10' = r(p10)
  scalar `p90' = r(p90)
  return scalar p10 = `p10'
  return scalar p90 = `p90'

  qui mean `v' if `touse' & inrange(`v', ., `p10')
  scalar `p10_mean' = r(table)["b",1]
  return scalar p10_mean = `p10_mean'
 
  qui mean `v' if `touse' & inrange(`v', `p90', .)
  scalar `p90_mean' = r(table)["b",1]  
  return scalar p90_mean = `p90_mean'
end

myprog price
ret list

bootstrap p10_mean=r(p10_mean) p90_mean=r(p90_mean) : myprog price

Results

Code:

. bootstrap p10_mean=r(p10_mean) p90_mean=r(p90_mean) : myprog price
(running myprog on estimation sample)

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50

Bootstrap results                                           Number of obs =  8
                                                            Replications  = 50

      Command: myprog price
     p10_mean: r(p10_mean)
     p90_mean: r(p90_mean)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
    p10_mean |    3665.75   165.8556    22.10   0.000     3340.679    3990.821
    p90_mean |   13166.63   851.6963    15.46   0.000     11497.33    14835.92
------------------------------------------------------------------------------

Comment

Troels Kristensen

Join Date: Oct 2021

Posts: 23
#14

08 Oct 2021, 08:38

Thank you very much. I hope this will be useful. However, so far - I have encountered a type mismatch error in my version of the program, see below. I use the sysuse auto data set.
The code "ret list" gives some of the values. But something is wrong with the code. Can you help me? I think price is a numeric type. This means it should be okay for bootstrap calculations but does not work yet?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17714

#15

08 Oct 2021, 08:58

Troels:
if you do not have Stata 17 yet, change the number 17 with the one of your actual Stata release.
With ths trick, Leonardo's code works for me like a charm:

Code:

. sysuse auto
(1978 Automobile Data)

.
.
.
. cap progam drop myprog

.
. program define myprog, rclass
  1.
.   version 16
  2.
.   syntax varlist(max=1) [if] [in]
  3.
. 
.
.   unab v : `varlist'
  4.
.   confirm numeric var `v'
  5.
.   marksample touse
  6.
. 
.
.   tempname p10 p90 p10_mean p90_mean
  7.
.   qui summ `v', det
  8.
.   scalar `p10' = r(p10)
  9.
.   scalar `p90' = r(p90)
 10.
.   return scalar p10 = `p10'
 11.
.   return scalar p90 = `p90'
 12.
.
.
.   qui mean `v' if `touse' & inrange(`v', ., `p10')
 13.
.   scalar `p10_mean' = r(table)["b",1]
 14.
.   return scalar p10_mean = `p10_mean'
 15.
. 
.
.   qui mean `v' if `touse' & inrange(`v', `p90', .)
 16.
.   scalar `p90_mean' = r(table)["b",1] 
 17.
.   return scalar p90_mean = `p90_mean'
 18.
. end

.
.
.
. myprog price

.
. ret list

scalars:
           r(p90_mean) =  13166.625
           r(p10_mean) =  3665.75
                r(p90) =  11385
                r(p10) =  3895

.
.
.
. bootstrap p10_mean=r(p10_mean) p90_mean=r(p90_mean) : myprog price
(running myprog on estimation sample)

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50

Bootstrap results                               Number of obs     =          8
                                                Replications      =         50

      command:  myprog price
     p10_mean:  r(p10_mean)
     p90_mean:  r(p90_mean)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    p10_mean |    3665.75    269.372    13.61   0.000     3137.791    4193.709
    p90_mean |   13166.63   927.0146    14.20   0.000     11349.71    14983.54
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Announcement