Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bootstrapping of mean in a subsample

    I'm analyzing a sample of GPs. I want to describe the GP's with extreme values of a variable. This means the subgroups were the variable is above P90 and below P10. I want to bootstrap the entire sample and use the bootstrapped p90 and p10 results as the definitions of the groups above p90 and below p10. Next I want to know the mean and confidence intervals for the mean for these two groups. How can this be done in stata?

    I am able to use: bootstrap r(p90), rep(1000): summarize var, detail

    However, this only gives the P90 value. I also want to know the bootstrapped mean and CI for the group above p90 :-) I hope this is possible?

  • #2
    Troels:
    welcome to this forum.
    As far as your first question is concerned (I'm not clear with the reason why you want to -bootstrap- r(p10) and r(p90), but it's surely my fault), you may want to try:
    Code:
    . use "C:\Program Files\Stata16\ado\base\a\auto.dta"
    (1978 Automobile Data)
    
    . quietly sum price, d
    
    . bootstrap r(p10) r(p90), reps(1000) bca ties : sum price, d
    
    warning: Because summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which
             observations are used in calculating the statistics and so assumes that all observations are used. This means that no
             observations will be excluded from the resampling because of missing values or other reasons.
    
             If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure
             that the dataset in memory contains only the relevant data.
    (running summarize on estimation sample)
    
    Jackknife replications (74)
    ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
    ..................................................    50
    ........................
    
    Bootstrap replications (1000)
    ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
    ..................................................    50
    ..................................................   100
    ..................................................   150
    ..................................................   200
    ..................................................   250
    ..................................................   300
    ..................................................   350
    ..................................................   400
    ..................................................   450
    ..................................................   500
    ..................................................   550
    ..................................................   600
    ..................................................   650
    ..................................................   700
    ..................................................   750
    ..................................................   800
    ..................................................   850
    ..................................................   900
    ..................................................   950
    ..................................................  1000
    
    Bootstrap results                               Number of obs     =         74
                                                    Replications      =      1,000
    
          command:  summarize price, d
            _bs_1:  r(p10)
            _bs_2:  r(p90)
    
    ------------------------------------------------------------------------------
                 |   Observed   Bootstrap                         Normal-based
                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           _bs_1 |       3895   99.38821    39.19   0.000     3700.203    4089.797
           _bs_2 |      11385   1344.008     8.47   0.000     8750.793    14019.21
    ------------------------------------------------------------------------------
    
    .
    Once you've defined the GPs group you're interested in, you can -bootstrap- the (subsample) mean of the variable of interest, along with a set of -bootstrap- 95% CIs.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Hi Carlo
      Glad to join this community. Your response is helpful :-) I am able to do what you have suggested. However, my intention is to use the individual bootstrap results to bootstrap the mean of the variable over P90 and below P10 (and other variables later) - but based on the bootstrap of the entire sample rather than the subsample. This means based on each bootstrap rep of the entire sample I want to describe the group of GPs in the top and bottom - at least via a bootstrapped mean from these extreme groups. So In the end - I want to describe how the average GPs are in these extreme groups. I get at subsample of N= 39 of the data over P90 and N=46 below P10 . I can bootstrap based on these subsample but I believe I need to use the entire sample of about 500 (rather than e.g. 39) and next derive the groups below p10 and above p90 including the estimated means and CIs - that I am looking for. Can you help me to get a bit closer to these bootstrapped descriptives for the subsamples based on the entire sample of around 500? Maybe I need to write a program or so? just looking for the most efficient way?

      Comment


      • #4
        Basically I want to make at table with the following heading: High variable GPs Low variable GPs
        Mean CI CV CI Mean CI CV CI
        Variable..
        Variable..

        Besides the boostrapped Mean and CI to describe the high and low extreme groups - I also plan to calculate coefficient of variation and related CI. I'm just not able to do it all in one operation?

        Comment


        • #5
          In terms of your price variable example. What is the bootstrapped mean price and CI for the group of prices below P10 and above P90, respectively? Does this makes sense?

          Comment


          • #6
            I understand this as follows, see this example:


            Code:
            sysuse nlsw88, clear
            sum wage, det
            gen flag = 1 if wage < r(p10)
            bootstrap r(mean), reps(999) seed(123) nodots: sum wage if flag == 1, meanonly
            estat bootstrap, bc
            Best wishes

            (Stata 16.1 MP)

            Comment


            • #7
              Hi Felix
              The challenge is that your suggestion drops a part of the original sample. This means the entire sample is not used for the bootstrap of the mean in the subsample. In case of bootstrapping of the entire sample the subsamples and their means will be different than if the bootstrap procedure is conducted only on a part of the sample - as far as I understand it.

              Comment


              • #8
                As far as I see it, you either want this solution or Carlo's solution from post #2. Do you have an example of what you intend, in a paper or anything?
                Best wishes

                (Stata 16.1 MP)

                Comment


                • #9
                  Just checked you suggestion. The bootstrap will be based on the subsample (N=46) rather than the entire sample that may influence the bootstrapped values for the 46 observations in the bottom(because the can be switched).

                  Comment


                  • #10
                    Among other things, the question is whether it is okay to bootstrap subgroups of extreme values rather than the entire sample to create means and confidence intervals for ekstreme groups. HOW CAN THE BOOSTRAPPED MEANS AND CIs BE ESTIMATED BASED ON THE COMPLETE SAMPLE IN STATA - if it makes sense?
                    Click image for larger version

Name:	statalist.jpg
Views:	1
Size:	629.2 KB
ID:	1630798

                    Comment


                    • #11
                      If you bootstrap based on a subsample you can estimate CIs for the mean, coefficient of variation etc via stata. However, It may be more appropriate to bootstrap the entire sample and calculate CI for the extreme groups based on the entire data set?

                      Comment


                      • #12
                        Seems to me that since you are trying to estimate quantities in the tails of a distribution, and these are based on the empirical CDF, you should be bootstrapping the entire sample first, and then computing quantities in those tails. I think the only way to do this is by writing a custom program.

                        Comment


                        • #13
                          This might get you started in a useful direction.

                          Code:
                          clear *
                          cls
                          
                          set seed 17
                          
                          sysuse auto
                          
                          cap progam drop myprog
                          program define myprog, rclass
                            version 17
                            syntax varlist(max=1) [if] [in]
                           
                            unab v : `varlist'
                            confirm numeric var `v'
                            marksample touse
                           
                            tempname p10 p90 p10_mean p90_mean
                            qui summ `v', det
                            scalar `p10' = r(p10)
                            scalar `p90' = r(p90)
                            return scalar p10 = `p10'
                            return scalar p90 = `p90'
                          
                            qui mean `v' if `touse' & inrange(`v', ., `p10')
                            scalar `p10_mean' = r(table)["b",1]
                            return scalar p10_mean = `p10_mean'
                           
                            qui mean `v' if `touse' & inrange(`v', `p90', .)
                            scalar `p90_mean' = r(table)["b",1]  
                            return scalar p90_mean = `p90_mean'
                          end
                          
                          myprog price
                          ret list
                          
                          bootstrap p10_mean=r(p10_mean) p90_mean=r(p90_mean) : myprog price
                          Results

                          Code:
                          . bootstrap p10_mean=r(p10_mean) p90_mean=r(p90_mean) : myprog price
                          (running myprog on estimation sample)
                          
                          Bootstrap replications (50)
                          ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
                          ..................................................    50
                          
                          Bootstrap results                                           Number of obs =  8
                                                                                      Replications  = 50
                          
                                Command: myprog price
                               p10_mean: r(p10_mean)
                               p90_mean: r(p90_mean)
                          
                          ------------------------------------------------------------------------------
                                       |   Observed   Bootstrap                         Normal-based
                                       | coefficient  std. err.      z    P>|z|     [95% conf. interval]
                          -------------+----------------------------------------------------------------
                              p10_mean |    3665.75   165.8556    22.10   0.000     3340.679    3990.821
                              p90_mean |   13166.63   851.6963    15.46   0.000     11497.33    14835.92
                          ------------------------------------------------------------------------------

                          Comment


                          • #14
                            Thank you very much. I hope this will be useful. However, so far - I have encountered a type mismatch error in my version of the program, see below. I use the sysuse auto data set.
                            The code "ret list" gives some of the values. But something is wrong with the code. Can you help me? I think price is a numeric type. This means it should be okay for bootstrap calculations but does not work yet?

                            Click image for larger version

Name:	Capture.JPG
Views:	1
Size:	80.6 KB
ID:	1630958

                            Comment


                            • #15
                              Troels:
                              if you do not have Stata 17 yet, change the number 17 with the one of your actual Stata release.
                              With ths trick, Leonardo's code works for me like a charm:
                              Code:
                              . sysuse auto
                              (1978 Automobile Data)
                              
                              .
                              .
                              .
                              . cap progam drop myprog
                              
                              .
                              . program define myprog, rclass
                                1.
                              .   version 16
                                2.
                              .   syntax varlist(max=1) [if] [in]
                                3.
                              . 
                              .
                              .   unab v : `varlist'
                                4.
                              .   confirm numeric var `v'
                                5.
                              .   marksample touse
                                6.
                              . 
                              .
                              .   tempname p10 p90 p10_mean p90_mean
                                7.
                              .   qui summ `v', det
                                8.
                              .   scalar `p10' = r(p10)
                                9.
                              .   scalar `p90' = r(p90)
                               10.
                              .   return scalar p10 = `p10'
                               11.
                              .   return scalar p90 = `p90'
                               12.
                              .
                              .
                              .   qui mean `v' if `touse' & inrange(`v', ., `p10')
                               13.
                              .   scalar `p10_mean' = r(table)["b",1]
                               14.
                              .   return scalar p10_mean = `p10_mean'
                               15.
                              . 
                              .
                              .   qui mean `v' if `touse' & inrange(`v', `p90', .)
                               16.
                              .   scalar `p90_mean' = r(table)["b",1] 
                               17.
                              .   return scalar p90_mean = `p90_mean'
                               18.
                              . end
                              
                              .
                              .
                              .
                              . myprog price
                              
                              .
                              . ret list
                              
                              scalars:
                                         r(p90_mean) =  13166.625
                                         r(p10_mean) =  3665.75
                                              r(p90) =  11385
                                              r(p10) =  3895
                              
                              .
                              .
                              .
                              . bootstrap p10_mean=r(p10_mean) p90_mean=r(p90_mean) : myprog price
                              (running myprog on estimation sample)
                              
                              Bootstrap replications (50)
                              ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
                              ..................................................    50
                              
                              Bootstrap results                               Number of obs     =          8
                                                                              Replications      =         50
                              
                                    command:  myprog price
                                   p10_mean:  r(p10_mean)
                                   p90_mean:  r(p90_mean)
                              
                              ------------------------------------------------------------------------------
                                           |   Observed   Bootstrap                         Normal-based
                                           |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                              -------------+----------------------------------------------------------------
                                  p10_mean |    3665.75    269.372    13.61   0.000     3137.791    4193.709
                                  p90_mean |   13166.63   927.0146    14.20   0.000     11349.71    14983.54
                              ------------------------------------------------------------------------------
                              
                              .
                              Kind regards,
                              Carlo
                              (StataNow 18.5)

                              Comment

                              Working...
                              X