Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • bootstrap Cox regression

    Hi All,

    I am trying to replicate Michael Klein' s 1996 paper, "Timing Is All: Elections and the Duration of United States Business Cycles", published in Journal of Money, Credit and Banking, pp.84-101. The author mentioned that bootstrap techniques was used in the study. Specifically, the point estimates reported are the mean value of the respective estimates for 500 re-samples from the original data and bootstrap standard errors are used to test for the significance of the coefficients. I tried vce(bootstrap, reps(500)) but the bootstrap standard error reported by stata is very different from the ones reported in the paper. I am not sure what casues the big difference here. Is there anyone who would kindly share thoughts and ideas to help me solve the issue here? Many thanks in advance!!! Here are my complete code: stset t1, failure(complete) id(id); stcox a9 war, vce(bootstrap, reps(500))
    Attached Files

  • #2
    Fan:
    - -bootstrap- is a pseudo-random resampling with re-introduction technique. Hence, unless you set the same -seed- before -bootstrap- even on the same machine and setting the same number of replications, in all likelihood you will get different results every time you run the code;
    That said, it is difficult to judge about the presumed difference on the ground of a qualitative statement only (
    ...the bootstrap standard error reported by [S]tata is very different from the ones reported in the paper...
    ;
    as per FAQ, please post what you typed and what Stata gave you back to increase your chances of getting helpful replies. Thanks.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Dear Prof. Carlo,

      Thank you for your reply. Here is my code: stset t1, failure(complete) id(id); stcox a9 war, vce(robust) and the results from STATA is same as the ones reported in author's initial version for ESTIMATES FOR HAZARD FUNCTION FOR EXPANSION for post world war II period

      . stcox a9 war, vce(robust)

      failure _d: complete
      analysis time _t: t1
      id: id

      Iteration 0: log pseudolikelihood = -12.801827
      Iteration 1: log pseudolikelihood = -11.22811
      Iteration 2: log pseudolikelihood = -11.013794
      Iteration 3: log pseudolikelihood = -11.013306
      Refining estimates:
      Iteration 0: log pseudolikelihood = -11.013306

      Cox regression -- no ties

      No. of subjects = 9 Number of obs = 41
      No. of failures = 9
      Time at risk = 449
      Wald chi2(2) = 4.40
      Log pseudolikelihood = -11.013306 Prob > chi2 = 0.1110

      (Std. Err. adjusted for 9 clusters in id)

      Robust
      _t Haz. Ratio Std. Err. z P>z [95% Conf. Interval]

      a9 4.222463 2.920978 2.08 0.037 1.088252 16.38334
      war .5063957 .3990092 -0.86 0.388 .1080923 2.372386


      . stcox, nohr

      Cox regression -- no ties

      No. of subjects = 9 Number of obs = 41
      No. of failures = 9
      Time at risk = 449
      Wald chi2(2) = 4.40
      Log pseudolikelihood = -11.013306 Prob > chi2 = 0.1110

      (Std. Err. adjusted for 9 clusters in id)

      Robust
      _t Coef. Std. Err. z P>z [95% Conf. Interval]

      a9 1.440419 .691771 2.08 0.037 .0845723 2.796265
      war -.6804368 .7879396 -0.86 0.388 -2.22477 .8638963


      In the published version, author says that "The relatively small number of business cycles, especially in the subsamples, makes inference based upon standard asymptotic suspect. Therefore we use boot-strap techniques. The point estimates presented in Tables 4 through 8 represent the mean value of the respective estimates for five hundred resamples from the original
      data. We resample "clusters" of observations from the original data set choosing the set of observations corresponding to an entire business cycle as one "draw." Each resample therefore has the same number of business cycles as the original sample."

      To obtain the bootstrap standard error, I tried

      stcox a9 war, vce(bootstrap, reps(500))
      (running stcox on estimation sample)

      Cox regression -- no ties

      No. of subjects = 9 Number of obs = 41
      No. of failures = 9
      Time at risk = 449
      Wald chi2(2) = 0.01
      Log likelihood = -11.013306 Prob > chi2 = 0.9959

      (Replications based on 9 clusters in id)
      ------------------------------------------------------------------------------
      | Observed Bootstrap Normal-based
      _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      a9 | 4.222463 72.92649 0.08 0.934 8.40e-15 2.12e+15
      war | .5063957 6.253905 -0.06 0.956 1.56e-11 1.65e+10
      ------------------------------------------------------------------------------

      . stcox, nohr

      Cox regression -- no ties

      No. of subjects = 9 Number of obs = 41
      No. of failures = 9
      Time at risk = 449
      Wald chi2(2) = 0.01
      Log likelihood = -11.013306 Prob > chi2 = 0.9959

      (Replications based on 9 clusters in id)
      ------------------------------------------------------------------------------
      | Observed Bootstrap Normal-based
      _t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      a9 | 1.440419 17.27108 0.08 0.934 -32.41027 35.29111
      war | -.6804368 12.34984 -0.06 0.956 -24.88567 23.5248
      ------------------------------------------------------------------------------

      . stcox, nohr

      Cox regression -- no ties

      No. of subjects = 9 Number of obs = 41
      No. of failures = 9
      Time at risk = 449
      Wald chi2(2) = 0.01
      Log likelihood = -11.013306 Prob > chi2 = 0.9959

      (Replications based on 9 clusters in id)
      ------------------------------------------------------------------------------
      | Observed Bootstrap Normal-based
      _t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      a9 | 1.440419 17.27108 0.08 0.934 -32.41027 35.29111
      war | -.6804368 12.34984 -0.06 0.956 -24.88567 23.5248
      ------------------------------------------------------------------------------
      the bootstrap std.Err reported by STATA is much larger than ones reported in the paper. The coefficient for a9 is 1.30 and bootstrap S.E is 0.80 and coefficient for war is -0.50 with bootstrap S.E equals to 1.04 in the final version. I am not clear what could cause this big difference in the bootstrap S.E. I am looking for help to understand how to solve to the huge discrepancy here.

      Best Regards,
      Fan


      (p.s. attached the sample data here)

      Attached Files

      Comment


      • #4
        Fan:
        as far as I can see your outcomes (that are difficult to read; in the future, please use CODE delimiters, as recommended by FAQ. Thanks), it seems that you have bootstrapped only the SE, whereas in the paper you quoted point estimates were bootstrapped as well.
        Does the paper you mentioned include supplementary materials reporting Stata (or other software) codes?
        As an aside, please note (as per FAQ again) that attaching spreadsheets is far from being encouraged on this forum (the risk of active contents makes most of te listers -me, too- reluctant to download files coming from unknown sources). Use -dataex- instead (please, type -search dataex- to install it). Thanks.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Dear Prof. Carlo,

          Thank you for your quick reply and suggestions. I tried but failed to find any supplementary materials reporting what codes the author used for the study or codes could use to generate the results in the study. I see the difference between what I did and what the author did. I understand that the author bootstrapped the sample first, and then used bootstrap samples to obtain and estimated coefficients. The author repeated the process for 500 times and then take the means of the estimated coefficients and their standard errors. I tried following code to bootstrap the sample first as the author did but received an error message. I know I did something wrong but I am not clear how to fix it. Could you please kindly help to solve the problem?

          Code:
           capture program drop myboot
           program define myboot, rclass
           preserve
           bsample, cluster(id)
          stset t1, failure(complete) id(id)
          stcox a9 war
          return scalar coeff1 = _b[a9]
          return scale coeff2 = _b[war]
          restore
          end
          simulate coeff1=r(coeff1) coeff2=r(coeff2), reps(500) seed(1): myboot
          The error message
          no variables defined
          in option cluster()
          an error occurred when simulate executed myboot
          r(111);
          Best regards

          Comment


          • #6
            Fan:
            I do hope that what follows can be of some help (you should -exp()- the results to convert them into HR metric):

            Code:
            . use http://www.stata-press.com/data/r14/drugtr.dta
            (Patient Survival in Drug Trial)
            
            . stcox drug age
            
                     failure _d:  died
               analysis time _t:  studytime
            
            Iteration 0:   log likelihood = -99.911448
            Iteration 1:   log likelihood = -83.551879
            Iteration 2:   log likelihood = -83.324009
            Iteration 3:   log likelihood = -83.323546
            Refining estimates:
            Iteration 0:   log likelihood = -83.323546
            
            Cox regression -- Breslow method for ties
            
            No. of subjects =           48                  Number of obs    =          48
            No. of failures =           31
            Time at risk    =          744
                                                            LR chi2(2)       =       33.18
            Log likelihood  =   -83.323546                  Prob > chi2      =      0.0000
            
            ------------------------------------------------------------------------------
                      _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                    drug |   .1048772   .0477017    -4.96   0.000     .0430057    .2557622
                     age |   1.120325   .0417711     3.05   0.002     1.041375     1.20526
            ------------------------------------------------------------------------------
            
            . bootstrap _b[drug] _b[age], reps(500) nodots : stcox drug age
            
            Bootstrap results                               Number of obs     =         48
                                                            Replications      =        500
            
                  command:  stcox drug age
                    _bs_1:  _b[drug]
                    _bs_2:  _b[age]
            
            ------------------------------------------------------------------------------
                         |   Observed   Bootstrap                         Normal-based
                         |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                   _bs_1 |  -2.254965   1.636406    -1.38   0.168    -5.462261    .9523312
                   _bs_2 |   .1136186   .0399128     2.85   0.004      .035391    .1918461
            ------------------------------------------------------------------------------
            
            . estat bootstrap, all
            
            Bootstrap results                               Number of obs     =         48
                                                            Replications      =        500
            
                  command:  stcox drug age
                    _bs_1:  _b[drug]
                    _bs_2:  _b[age]
            
            ------------------------------------------------------------------------------
                         |    Observed               Bootstrap
                         |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                   _bs_1 |  -2.2549651  -.1465739   1.6364057   -5.462261   .9523312   (N)
                         |                                      -3.439155  -1.444328   (P)
                         |                                      -3.163508  -1.356078  (BC)
                   _bs_2 |   .11361855   .0059653   .03991276     .035391   .1918461   (N)
                         |                                        .049578   .1995662   (P)
                         |                                       .0469213   .1904171  (BC)
            ------------------------------------------------------------------------------
            (N)    normal confidence interval
            (P)    percentile confidence interval
            (BC)   bias-corrected confidence interval
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Dear Prof. Carlo,

              Thank you for the reply and the example. After reading STATA manual, I see the reason why bootstrap coefficients are not reported. I think I found the reason why the bootstrap standard error reported by STATA is so different from the ones reported by the paper. I guess the difference might be caused by some of the bootstrap samples can't estimate one or more parameters.

              Code:
              bootstrap _b[a9]  _b[war], reps(500) seed(1) : stcox a9 war
              (running stcox on estimation sample)
              
              Bootstrap replications (500)
              ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
              .............x....................................    50
              ...........................x......................   100
              ..................x...............................   150
              .................................x................   200
              ..................x...............................   250
              ..................................................   300
              ...................x..............................   350
              ..................................................   400
              ...........x......................................   450
              ......................x...........................   500
              
              Bootstrap results                               Number of obs      =        41
                                                              Replications       =       492
              
                    command:  stcox a9 war
                      _bs_1:  _b[a9]
                      _bs_2:  _b[war]
              
              ------------------------------------------------------------------------------
                           |   Observed   Bootstrap                         Normal-based
                           |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                     _bs_1 |   1.440419    20.2506     0.07   0.943    -38.25003    41.13086
                     _bs_2 |  -.6804368   14.16383    -0.05   0.962    -28.44103    27.08016
              ------------------------------------------------------------------------------
              Note: one or more parameters could not be estimated in 8 bootstrap replicates;
                    standard-error estimates include only complete replications.
              I guess this is because the covariates used are both dummy variables, some of the bootstrap samples may consist observations with dummy variables have zero values. Is there anyway I can fix this issue? I think dummy variables are commonly used to describe characters of observations in Cox regression such as patient's race, gender and so on. Could you please kindly share your thought and advice with me ? The following are the sample data.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input byte id int t1 byte(a9 a12 a24 complete war _st _d) int(_t _t0)
              1   1 0 1 1 0 0 1 0   1   0
              1  13 0 0 1 0 0 1 0  13   1
              1  37 0 0 0 1 0 1 1  37  13
              2   1 0 1 1 0 1 1 0   1   0
              2  13 0 0 1 0 1 1 0  13   1
              2  37 0 0 0 0 1 1 0  37  13
              2  45 1 1 1 1 1 1 1  45  37
              3   6 0 0 1 0 0 1 0   6   0
              3  30 0 0 0 0 0 1 0  30   6
              3  39 1 1 1 1 0 1 1  39  30
              4   7 0 0 1 0 0 1 0   7   0
              4  24 0 0 0 1 0 1 1  24   7
              5   6 1 1 1 0 1 1 0   6   0
              5   9 0 1 1 0 1 1 0   9   6
              5  21 0 0 1 0 1 1 0  21   9
              5  45 0 0 0 0 1 1 0  45  21
              5  54 1 1 1 0 1 1 0  54  45
              5  57 0 1 1 0 1 1 0  57  54
              5  69 0 0 1 0 1 1 0  69  57
              5  93 0 0 0 0 1 1 0  93  69
              5 102 1 1 1 0 1 1 0 102  93
              5 105 0 1 1 0 1 1 0 105 102
              5 106 0 0 1 1 1 1 1 106 105
              6  24 0 0 0 0 1 1 0  24   0
              6  33 1 1 1 0 1 1 0  33  24
              6  36 0 1 1 1 1 1 1  36  33
              7  20 0 0 0 0 0 1 0  20   0
              7  29 1 1 1 0 0 1 0  29  20
              7  32 0 1 1 0 0 1 0  32  29
              7  44 0 0 1 0 0 1 0  44  32
              7  58 0 0 0 1 0 1 1  58  44
              8   4 0 0 0 0 0 1 0   4   0
              8  12 1 1 1 1 0 1 1  12   4
              9  24 0 0 0 0 0 1 0  24   0
              9  33 1 1 1 0 0 1 0  33  24
              9  36 0 1 1 0 0 1 0  36  33
              9  48 0 0 1 0 0 1 0  48  36
              9  72 0 0 0 0 0 1 0  72  48
              9  81 1 1 1 0 0 1 0  81  72
              9  84 0 1 1 0 0 1 0  84  81
              9  92 0 0 1 1 0 1 1  92  84
              end
              ------------------ copy up to and including the previous line ------------------

              Listed 41 out of 41 observations

              .



              [/CODE]

              Comment


              • #8
                Fan:
                the reason may also rests on -cluster()- option .
                See Example 7 under -bootstrap- entry in Stata .pdf manual.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  Dear Prof. Carlo

                  Thank you for the suggestion. I read the example. I understand that the example is for matched data. I see that the in the example infants are paired matched on mother's age, however, I am not clear aboyt the meaning of "ALL GROUPS, DEFINED BY THE PAIRID VARIABLE, HAVE 1:1 MATCHING." My understanding is that this 1:1 matching design is to make sure in each replication, each grouppair can only be sampled once, but babies within each grouppair can be random selected. For example, baby 1 in group 1 can be selected twice for group 1 and baby 2 in group 2 can be selected twice for group 2 and so on for one replication. Could you please kindly correct me if I am wrong? In addition, I found that if the parameters could not be estimated in 1 or more bootstrap replicates, then the bootstrap standard error worsens significantly. If this is the case, what is the advantage to bootstrap.

                  Code:
                   bootstrap ratio=exp(_b[smoke]-_b[ptd]), seed(1) reps(500) cluster(pairid) idcluster(newpairid): clogit low lwt smoke pt
                  > d ht ui i.race, group(newpairid)
                  (running clogit on estimation sample)
                  
                  Bootstrap replications (500)
                  ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
                  ..................................................    50
                  ..................................................   100
                  ..................................................   150
                  ...................................x..............   200
                  ..................................................   250
                  ..................................................   300
                  ..................................................   350
                  ..................................................   400
                  ..................................................   450
                  ..................................................   500
                  
                  Bootstrap results                               Number of obs      =       112
                                                                  Replications       =       499
                  
                        command:  clogit low lwt smoke ptd ht ui i.race, group(newpairid)
                          ratio:  exp(_b[smoke]-_b[ptd])
                  
                                                   (Replications based on 56 clusters in pairid)
                  ------------------------------------------------------------------------------
                               |   Observed   Bootstrap                         Normal-based
                               |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                  -------------+----------------------------------------------------------------
                         ratio |   .6654095   3.07e+07     0.00   1.000    -6.02e+07    6.02e+07
                  ------------------------------------------------------------------------------
                  Note: one or more parameters could not be estimated in 1 bootstrap replicate;
                        standard-error estimates include only complete replications.
                  P.S. The data set used in Prof. Klein's study is similar to the data set in the HEART TRANSPLANT DATA, EXAMPLE 5, under STCOX, with dummy variables being time-varying. I tried the example 5 and found that amazingly, during the 500 replications all parameters can be estimated even though SURGERY and POSTTRAN are dummy are time-varying dummy variables?
                  Code:
                  (Heart transplant data)
                  stcox age surgery posttran, vce(bootstrap, reps(500) seed(1))
                  (running stcox on estimation sample)
                  Bootstrap replications (500)
                  ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
                  ..................................................    50
                  ..................................................   100
                  ..................................................   150
                  ..................................................   200
                  ..................................................   250
                  ..................................................   300
                  ..................................................   350
                  ..................................................   400
                  ..................................................   450
                  ..................................................   500
                  
                  Cox regression -- Breslow method for ties
                  
                  No. of subjects =          103                     Number of obs   =       172
                  No. of failures =           75
                  Time at risk    =      31938.1
                                                                     Wald chi2(3)    =      4.73
                  Log likelihood  =   -291.12672                     Prob > chi2     =    0.1927
                  
                                                      (Replications based on 103 clusters in id)
                  ------------------------------------------------------------------------------
                               |   Observed   Bootstrap                         Normal-based
                            _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
                  -------------+----------------------------------------------------------------
                           age |   1.033265   .0158303     2.14   0.033       1.0027    1.064762
                       surgery |   .3305094   .9521246    -0.38   0.701      .001167     93.6068
                      posttran |   1.001496   .3226724     0.00   0.996     .5325992    1.883208
                  ------------------------------------------------------------------------------
                  Could you please kindly share me your thoughts on why similar data set could have so different result after bootstrap?
                  Best regards
                  Fan

                  Comment


                  • #10
                    Fan:
                    if you try:
                    Code:
                    bootstrap _b[a9]  _b[war], reps(10) seed(1) noisily: stcox a9 war
                    You can see that some of the SEs are not estimated.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Dear Prof. Carlo:

                      Thank you for the tip. I also found the same problem. I guess this may be due to data replication process. During the process of data replication, some of the replicates may not be able to estimate the parameter(s) and standard error(s). Is this common in survival data analysis? And what we can use to solve this issue?

                      Thank you.

                      Fan

                      Comment


                      • #12
                        Fan:
                        as far as I can remember, I never came across such an issue.
                        I'm afraid there's nothing you can do to work that around, but using another option for standard errors, such as:
                        Code:
                        . stcox a9 war, vce(cluster id)
                        
                                 failure _d:  complete
                           analysis time _t:  t1
                                         id:  id
                        
                        Iteration 0:   log pseudolikelihood = -12.801827
                        Iteration 1:   log pseudolikelihood =  -11.22811
                        Iteration 2:   log pseudolikelihood = -11.013794
                        Iteration 3:   log pseudolikelihood = -11.013306
                        Refining estimates:
                        Iteration 0:   log pseudolikelihood = -11.013306
                        
                        Cox regression -- no ties
                        
                        No. of subjects      =            9             Number of obs    =          41
                        No. of failures      =            9
                        Time at risk         =          449
                                                                        Wald chi2(2)     =        4.40
                        Log pseudolikelihood =   -11.013306             Prob > chi2      =      0.1110
                        
                                                             (Std. Err. adjusted for 9 clusters in id)
                        ------------------------------------------------------------------------------
                                     |               Robust
                                  _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                  a9 |   4.222463   2.920978     2.08   0.037     1.088252    16.38334
                                 war |   .5063957   .3990092    -0.86   0.388     .1080923    2.372386
                        ------------------------------------------------------------------------------
                        .

                        PS: Please, call me Carlo as the other listers do. Thanks.
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          HI Carlo,

                          Thanks for the reply. I am wondering whether it is possible to use residual re-sampling method instead? Any advice ?

                          Regards

                          Comment


                          • #14
                            Fan:
                            what is the rationale behind that approach, in your case?
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Hi Carlo,

                              Since bootstrapping sample will create problem that some of the SEs are not estimated, I am wondering whether I can reconstruct the dependent variable by bootstrapping residuals. The final goal is to obtain better statistical inference. In fact, I am building one of my dissertation chapters based on the idea of Klein's study to test whether we can use the occurrence and the outcome of a presidential election to predict the the turning points in the U.S. stock market. (Klein's study examines the turning points in business cycle and presidential elections.) The problem is that there are not many bulls and bears in my sample from 07/1877 to 03/ 2009, only 19 bulls and 19 bears. The small sample size makes inference based upon standard asymptotic suspect. Therefore I am thinking to use boot-strap techniques to generate better statistical inference for my study

                              Comment

                              Working...
                              X