Possible bug in -bootstrap-

daniel klein

Join Date: Mar 2014

Posts: 3859
#1

Possible bug in -bootstrap-

27 Oct 2019, 03:17

Currently, I only have access to Stata 11 to 14. In these releases, there appears to be a bug in bootstrap. Here is an example

Code:

// example data webuse rate2 , clear // we only keep the relevant variables keep rada radb describe // estimate kappa coefficient kap rada radb // correct results with -bootstrap- bootstrap kappa = r(kappa) : kap rada radb // incorrect results with -bootstrap- // probably because -bootstrap- includes its temporary variables bootstrap kappa = r(kappa) : kap *

bootstrap appears to unabbreviate the passed variable list after it adds its own temporary variables to the dataset. This messes up the estimated coefficient(s). I have found, and reported to tech-support, a similar bug in egen. while ago. The latter has been fixed. Could someone replicate the above with Stata 15 and/or 16 and confirm that this is a bug?

Best
Daniel
Tags: bootstrap, bug

Martyn Sherriff

Join Date: Mar 2014
Posts: 119

27 Oct 2019, 04:19

Hello Daniel,

Here is the output from Stata 16 IC, Current update level: 16 Oct 2019

Code:

. // estimate kappa coefficient
. kap rada radb

             Expected
Agreement   Agreement     Kappa   Std. Err.         Z      Prob>Z
-----------------------------------------------------------------
  63.53%      30.82%     0.4728     0.0694       6.81      0.0000

. 
. // correct results with -bootstrap-
. bootstrap kappa = r(kappa) : kap rada radb
(running kap on estimation sample)

Warning:  Because kap is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations
          are used.  This means that no observations will be excluded from the resampling because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that the dataset in memory contains only the relevant data.

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50

Bootstrap results                               Number of obs     =         85
                                                Replications      =         50

      command:  kap rada radb
        kappa:  r(kappa)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       kappa |      0.473      0.068    6.985   0.000        0.340       0.605
------------------------------------------------------------------------------

. 
. // incorrect results with -bootstrap-
. // probably because -bootstrap- includes its temporary variables
. bootstrap kappa = r(kappa) : kap *
(running kap on estimation sample)

Warning:  Because kap is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations
          are used.  This means that no observations will be excluded from the resampling because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that the dataset in memory contains only the relevant data.

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50

Bootstrap results                               Number of obs     =         85
                                                Replications      =         50

      command:  kap *
        kappa:  r(kappa)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       kappa |      0.062      0.041    1.526   0.127       -0.018       0.142
------------------------------------------------------------------------------

Martyn

Comment

daniel klein

Join Date: Mar 2014

Posts: 3859
#3

27 Oct 2019, 04:46

Martyn, thanks for confirming this. In my view, there is no question that this is a bug.

I have not looked into the code but here is what I think probably happens: bootstrap creates a temporary variable to mark the estimation sample. In the example, that variable is constant and selects all observations; this is what the warning tells us. That temporary variable is added to the dataset before bootstrap, or more likely, the command that is bootstrapped expands the variable list, here: *. I can indeed replicate the observed coefficient, adding a constant variable that holds value 1 for all observations to the dateset

Code:

// example data webuse rate2 , clear // we only keep the relevant variables keep rada radb // add a constant generate byte one = 1 describe // replicate the wrong kappa coefficient kap rada radb one

yields

Code:

(output omitted) . kap rada radb one There are 3 raters per subject: Outcome | Kappa Z Prob>Z -----------------+------------------------------- 1 | -0.0255 -0.41 0.6581 2 | 0.0628 1.00 0.1579 3 | 0.1905 3.04 0.0012 4 | 0.2380 3.80 0.0001 -----------------+------------------------------- combined | 0.0622 1.39 0.0816

Obviously, we would neither expect nor want this to happen.

Note that in this example kap, when combined with bootstrap, even estimates a different coefficient, namely Fleiss kappa, when the original call to kap estimated Cohen's kappa.

I will now report this to tech-support.

Edit:

Thanks to Carlo, too.

Best
Daniel
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

27 Oct 2019, 04:46

Daniel and Martyn:
this is what I get from Stata 15.1 (all files updated as per 27 Oct 2019):

Code:

. webuse rate2 , clear
(Altman p. 403)

. kap rada radb

             Expected
Agreement   Agreement     Kappa   Std. Err.         Z      Prob>Z
-----------------------------------------------------------------
  63.53%      30.82%     0.4728     0.0694       6.81      0.0000

. bootstrap kappa = r(kappa) : kap rada radb
(running kap on estimation sample)

Warning:  Because kap is not an estimation command or does not set e(sample), bootstrap has no way to
          determine which observations are used in calculating the statistics and so assumes that all
          observations are used.  This means that no observations will be excluded from the resampling
          because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the observations that
          are to be excluded.  Be sure that the dataset in memory contains only the relevant data.

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50

Bootstrap results                               Number of obs     =         85
                                                Replications      =         50

      command:  kap rada radb
        kappa:  r(kappa)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       kappa |   .4727891   .0705139     6.70   0.000     .3345845    .6109937
------------------------------------------------------------------------------

. bootstrap kappa = r(kappa) : kap *
(running kap on estimation sample)

Warning:  Because kap is not an estimation command or does not set e(sample), bootstrap has no way to
          determine which observations are used in calculating the statistics and so assumes that all
          observations are used.  This means that no observations will be excluded from the resampling
          because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the observations that
          are to be excluded.  Be sure that the dataset in memory contains only the relevant data.

Bootstrap replications (50)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..................................................    50

Bootstrap results                               Number of obs     =         85
                                                Replications      =         50

      command:  kap *
        kappa:  r(kappa)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       kappa |   .0622045   .0338795     1.84   0.066    -.0041982    .1286071
------------------------------------------------------------------------------

Kind regards,
Carlo
(Stata 19.0)

Comment

daniel klein

Join Date: Mar 2014
Posts: 3859

27 Oct 2019, 05:04

I have reported this to tech-support. Here is more evidence that my diagnostic concerning the cause of the problem is correct:

Code:

// example data
webuse rate2 , clear

// we only keep the relevant variables
keep rada radb
describe

// wrong kappa coefficient
// note that variable __000000 does not (yet) exist in the dataset
bootstrap kappa = r(kappa) : kap rada radb __000000

Best
Daniel

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

27 Oct 2019, 05:06

Thanks Daniel for pointing this out.

Kind regards,
Carlo
(Stata 19.0)
Comment
daniel klein

Join Date: Mar 2014

Posts: 3859
#7

29 Oct 2019, 13:54

Stata tech-support and the developers have looked into this. They basically confirm my diagnostics of the problem but they do not see a quick and easy fix. They might add a warning message if bootstrap encounters * as part of the arguments of a command. They have also pointed out that jackknife, permute, and statsby will behave in the same way as bootstrap.

Best
Daniel
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#8

30 Oct 2019, 01:28

Thanks Daniel for sharing.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Possible bug in -bootstrap-

Comment

Comment

Comment

Comment

Comment

Comment

Comment