Bootstrapping of mean in a subsample

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#16

08 Oct 2021, 09:04

Thanks, Carlo. It's a good idea to begin programs with a -version #- statement, typically the version you are using, so that Stata knows what minimum version is needed to run the program and that it will run in future versions. There's nothing special her that requires version 17, so you can replace 17 with your version to get it to run (or simply omit the statement altogether for learning purposes, but not for serious work).

Note I also have a typo in the command and should be corrected with the text in red.

Code:

cap progam drop myprog program define myprog, rclass version 17 syntax varlist(max=1) [if] [in] unab v : `varlist' confirm numeric var `v' marksample touse tempname p10 p90 p10_mean p90_mean qui summ `v' if `touse', det scalar `p10' = r(p10) scalar `p90' = r(p90) return scalar p10 = `p10' return scalar p90 = `p90' qui mean `v' if `touse' & inrange(`v', ., `p10') scalar `p10_mean' = r(table)["b",1] return scalar p10_mean = `p10_mean' qui mean `v' if `touse' & inrange(`v', `p90', .) scalar `p90_mean' = r(table)["b",1] return scalar p90_mean = `p90_mean' end
1 like
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#17

08 Oct 2021, 16:37

Hi Carlo,
I have stata 17 on my labtop. I figured out that the program works on my laptop computer when I copy the code. The code problem (mismatch) appears to be that I have changed the text when I tapped it into our secure server (I need to figure out what the typo or similar was?) where I plan to use this program. I have tried to capture the bootstrapped confidence interval via e(ci_normal) or similar. I need to do that to use in my do-files at the server. However, I am not sure all of these "stored" functions work after the customized program. Do you know whether the bootstrapped CIs can be captured via e(?)? Besides the mean CI I also need to bootstrap the Coefficient of variation(CV) and the related CI (like it was the case for the mean). Is it easy to extend the program to do that as well?? Thank you for your kind help in advance :-)))) It is very useful for me and probably others :-)) I think the problem has a general nature!
Best Troels
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#18

08 Oct 2021, 16:50

Hi Leonardo
Thank you for pointing the typo out. I will adjust my version accordingly. I forgot that it was you who made the program when i responded to Carlo. Do you think the program can be extended? I am not sure all of these "stored" functions work after the customized program. Do you know whether the bootstrapped CIs can be captured via e(?)? Besides the mean CI I also need to bootstrap the Coefficient of variation(CV) and the related CI (like it was the case for the mean). Is it easy to extend the program to do that as well?? (I do not think these programs are easy :-) Thank you for your kind help in advance :-)))) It is very useful for me and probably others :-)) I think the problem has a general nature as mentioned to Carlo above!
Best Troels
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#19

08 Oct 2021, 17:15

You're welcome.

Originally posted by Troels Kristensen View Post

Do you think the program can be extended? Besides the mean CI I also need to bootstrap the Coefficient of variation(CV) and the related CI (like it was the case for the mean). Is it easy to extend the program to do that as well?

Yes, the program can be extended. In fact, it's a general purpose strategy to create such a custom program for use with -bootstrap- or -simulate- (among others) when the procedures you want are either multiple or require more than just a single command. You can also make more than one program, if that makes more sense. In your case, you could add the CV to the program in #16.

I am not sure all of these "stored" functions work after the customized program.

I'm not sure what you mean. The program persists as long as the Stata session is in existence, so you can run it multiple times.

Do you know whether the bootstrapped CIs can be captured via e(?)?

The output of -help bootstrap- will tell you where things are returned. The display table is returned in -r(table)- and most other statistics are return in -e()-, such as -e(b_bs)- for the point estimates and there are similar matrices for the CIs.

Code:

I do not think these programs are easy

It's quite normal to think so when starting out, and I certainly wouldn't fault you for thinking they were difficult. In order to learn how these programs are made and to get better at them, and later tweaking or making your own, I recommend that you rad the PDF User's Guide manual, and paying particular attention to Chapter 18: Programming Stata. It will no doubt take some time, but it will pay back dividends.
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2403

#20

08 Oct 2021, 17:19

This is how you could add the CV estimation in the full sample to the same program that then takes the means of the tails.

Code:

cap progam drop myprog
program define myprog, rclass
  version 17
  syntax varlist(max=1) [if] [in]
 
  unab v : `varlist'
  confirm numeric var `v'
  marksample touse
 
  tempname mean var cv p10 p90 p10_mean p90_mean
  qui summ `v' if `touse', det
  scalar `p10' = r(p10)
  scalar `p90' = r(p90)
  scalar `mean' = r(mean)
  scalar `var' = r(Var)
  scalar `cv' = sqrt(`var') / `mean'
  return scalar p10 = `p10'
  return scalar p90 = `p90'
  return scalar cv = `cv'

  qui mean `v' if `touse' & inrange(`v', ., `p10')
  scalar `p10_mean' = r(table)["b",1]
  return scalar p10_mean = `p10_mean'
 
  qui mean `v' if `touse' & inrange(`v', `p90', .)
  scalar `p90_mean' = r(table)["b",1]  
  return scalar p90_mean = `p90_mean'
end

Comment

Troels Kristensen

Join Date: Oct 2021

Posts: 23
#21

09 Oct 2021, 02:13

Hi Leonardo,
Thank you - this is useful. However to decribe the extreme groups, there should be two CVs and related CIs - one for each extreme group above p90 and below p10 - if possible :-)

Just a refection question: How can it be seen that we start bootstraping based on the entire sample of prices (N=74) - in the end the extreme groups comprise N=8 based on the price data. Just to be sure that the results are different than a bootstrap based on the subsample. Best Troels
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#22

09 Oct 2021, 02:53

Apparently, the "ereturn list" is working now (had problems using it after one of the previous versions). This means the results of the program can be captured in a do-file and processed further as I understand it - as required.

In relation to the CV and related confidence interval for the two extreme groups (above p90 and below p10) I tried the following
qui cv `v' if `touse' & inrange(`v', ., `p10')
scalar `p10_cv' = r(table)["b",1]
return scalar p10_cv = `p10_cv'

but this does not work - probably because cv is unknown by stata and I do not know the programming language? Hope you can help :-)
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#23

09 Oct 2021, 09:20

This is an edited version to give the CV in each tail to demonstrate how to do it.

Code:

cap progam drop myprog program define myprog, rclass version 17 syntax varlist(max=1) [if] [in] unab v : `varlist' confirm numeric var `v' marksample touse tempname p10 p10_mean p10_sd p10_cv /// p90 p90_mean p90_sd p90_cv qui summ `v' if `touse', det scalar `p10' = r(p10) scalar `p90' = r(p90) return scalar N = r(N) return scalar p10 = `p10' return scalar p90 = `p90' qui summ `v' if `touse' & inrange(`v', ., `p10'), detail scalar `p10_mean' = r(mean) scalar `p10_sd' = r(sd) scalar `p10_cv' = `p10_sd' / `p10_mean' return scalar p10_N = r(N) return scalar p10_mean = `p10_mean' return scalar p10_sd = `p10_sd' return scalar p10_cv = `p10_cv' qui summ `v' if `touse' & inrange(`v', `p90', .), detail scalar `p90_mean' = r(mean) scalar `p90_sd' = r(sd) scalar `p90_cv' = `p90_sd' / `p90_mean' return scalar p90_N = r(N) return scalar p90_mean = `p90_mean' return scalar p90_sd = `p90_sd' return scalar p90_cv = `p90_cv' end

Apparently, the "ereturn list" is working now (had problems using it after one of the previous versions). This means the results of the program can be captured in a do-file and processed further as I understand it

No, not quite. The above program returns r() results, not in e(). You can see this by the fact that it's an r-class program (highlighted in red). You can access results in r() or e() just as easily. After running this several times using -bootstrap-, those results are stored in e(), but have asked -bootstrap- to gather the results of -myprog- from r(). bootstrap computes the confidence intervals for you, based on the bootstrap samples.

I strongly recommend reading up more on how to program using Stata from the helpful documentation. I've done enough to show you how you can make such programs and you'll benefit greatly from a more solid foundation after reading the documentation.

How can it be seen that we start bootstraping based on the entire sample of prices (N=74) - in the end the extreme groups comprise N=8 based on the price data. Just to be sure that the results are different than a bootstrap based on the subsample.

I added the estimation sample size to be returned in r(N), but you don't need this in this case. -bootstrap- works on the overall sample, so will bootstrap the whole sample if running something like

Code:

bootstrap: myprog varname
2 likes
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#24

10 Oct 2021, 02:52

Hi Leonardo
Thank you and well done. From my point of view , we managed to make a useful application via our dialogue. At least it makes the calculations that I wanted to perform. Nevertheless, as stated above I the ability to describe extreme groups of sample has a general interest/nature.

You are right I need to study the programming materials - to try to improve my skills :-) Anyway, I think you help and advice was very helpful! /(I am impressed! How are your activities financed?)

.
myprog price

. ret list

scalars:
r(p90_cv) = .1180803594571282
r(p90_sd) = 1554.719812837211
r(p90_mean) = 13166.625
r(p90_N) = 8
r(p10_cv) = .0648969476026494
r(p10_sd) = 237.8959856744119
r(p10_mean) = 3665.75
r(p10_N) = 8
r(p90) = 11385
r(p10) = 3895
r(N) = 74

.
. bootstrap p10_mean=r(p10_mean) p90_mean=r(p90_mean) p10_cv=r(p10_cv) p90_cv=r(p90_cv), reps(50)
> : myprog price
(running myprog on estimation sample)

warning: myprog does not set e(sample), so no observations will be excluded from the resampling
because of missing values or other reasons. To exclude observations, press Break, save
the data, drop any observations that are to be excluded, and rerun bootstrap.

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50

Bootstrap results Number of obs = 74
Replications = 50

Command: myprog price
p10_mean: r(p10_mean)
p90_mean: r(p90_mean)
p10_cv: r(p10_cv)
p90_cv: r(p90_cv)

------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| coefficient std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
p10_mean | 3665.75 119.4233 30.70 0.000 3431.685 3899.815
p90_mean | 13166.63 801.2914 16.43 0.000 11596.12 14737.13
p10_cv | .0648969 .0152537 4.25 0.000 .0350002 .0947937
p90_cv | .1180804 .031155 3.79 0.000 .0570177 .179143
------------------------------------------------------------------------------

.
end of do-file
Comment
Troels Kristensen

Join Date: Oct 2021

Posts: 23
#25

15 Oct 2021, 04:25

Hi Leonardo,
I think I realized that my challenge have not been fully solved. The programme we have produced can describe P10_mean , P90_mean etc for a specific variable price - but does not describe the other variables in the data set (such as weight, mpg headroom etc) based on the same bootstrap procedure. I think, the latter is required if you want to describe the cars in p10 and p90 for price. Can this be done? This means besides the p10_mean, p90_mean etc for price also to include the same descriptives for other variables in the same operations/program. I think it is required to be able to describe the extreme groups >p90 and <p10 based on one variable such as price and as a result of- one bootstrapping procedure. At the moment it is possible to the describe the extreme groups via bootstrapping of one variable at a time. I do not think this is the same result as if all is done in one procedure. This means e.g. 1000 new samples which is used to calclulate all variables rather than 1000 new samples for each variable?
Do you understand my challenge? (woundering what is most appropriate from a statistical point if view.
Best troels
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#26

15 Oct 2021, 07:28

i think this is now a different problem, but in theory, you should be able to extend the framework in the above program to do what you want with each bootstrap sample. Whether you need to do this, or if it makes sense, I can't say in your case.
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#27

15 Oct 2021, 07:49

This is a good course to get started with programming Stata.

HTML Code:

https://www.stata.com/netcourse/writing-own-commands-nc251/
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment