Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Out of sample survival time prediction--Gompertz regression

    Hi
    I am trying to assist a colleague who has conducted a survival analysis using Gompertz regression.
    After streg, we can use predict to calculate the survival times for any given individual.
    However, we would like to take the regression results, and use a non-Stata tool to provide an estimated survival time for any given individual, even if not in the original sample.For some background, the dataset has 7480 individuals.
    There are 20 predictors, which include age (the only numerical variable) and 19 categorical variables (gender, smoker, etc.).
    Given that number of categorical predictors, we have 524288 (2^19) possible combinations for each year of age.
    This is the output we get from Stata:

    Gompertz reg output:


    Gompertz regression -- log relative-hazard form

    No. of subjects = 7,480 Number of obs = 7,480
    No. of failures = 2,456
    Time at risk = 85978.26441
    LR chi2(21) = 3661.73
    Log likelihood = -4994.5731 Prob > chi2 = 0.0000

    ------------------------------------------------------------------------------
    _t | Coef. Std. Err. z P>|z| [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    1.male | .4559516 .0435201 10.48 0.000 .3706538 .5412495
    age_wv4a | .0912029 .0279085 3.27 0.001 .0365033 .1459025
    agesq | .000071 .0001897 0.37 0.708 -.0003008 .0004428
    1.marital2 | .3445814 .1001494 3.44 0.001 .1482922 .5408706
    r4hibpe | .1241166 .0418009 2.97 0.003 .0421883 .2060448
    r4diabe | .4531136 .0588017 7.71 0.000 .3378644 .5683629
    r4cancre | .2513523 .053305 4.72 0.000 .1468765 .3558282
    r4lunge | .5372277 .0718719 7.47 0.000 .3963614 .6780941
    r4hearte | .2590508 .0491937 5.27 0.000 .1626328 .3554688
    chf_p2y | .3213181 .1025171 3.13 0.002 .1203883 .5222478
    r4psyche | .1324422 .0758445 1.75 0.081 -.0162103 .2810946
    r4memrye | .7246787 .1210191 5.99 0.000 .4874855 .9618718
    walksevblk | .302861 .0533489 5.68 0.000 .1982991 .4074228
    stairssev | .1633768 .0492362 3.32 0.001 .0668757 .2598779
    vigact_no | .1562754 .0454623 3.44 0.001 .0671709 .24538
    currsmoker | .6843489 .0645428 10.60 0.000 .5578473 .8108505
    bmi_lt25 | .1570624 .0433098 3.63 0.000 .0721768 .2419481
    r4hosp | .1700956 .045846 3.71 0.000 .0802391 .259952
    r4shlt3 | .2254694 .0516955 4.36 0.000 .1241481 .3267906
    r4shlt4 | .3441239 .0635972 5.41 0.000 .2194758 .4687721
    r4shlt5 | .7401511 .0826021 8.96 0.000 .578254 .9020482
    _cons | -12.39047 1.016682 -12.19 0.000 -14.38313 -10.39781
    -------------+----------------------------------------------------------------
    /gamma | .1446181 .0056125 25.77 0.000 .1336178 .1556184
    ------------------------------------------------------------------------------


    How do we take those regression coefficients and build an estimating equation to predict survival for any given individual?

    Thanks

  • #2
    Jacques:
    welcome to the list.
    Generalizing coefficients obtained by a given sample to a different one, should be handled with care (see http://www.bmj.com/content/bmj/317/7155/409.full.pdf).
    That said, Stata can give you in-sample and out-of-sample predictions for each individual included in the same dataset, as you can see from the following toy-example:
    Code:
    . use http://www.stata-press.com/data/r14/kva
    (Generator experiment)
    . streg load bearings if failtime>=30, distribution(gompertz)
    
             failure _d:  1 (meaning all fail)
       analysis time _t:  failtime
    
    Fitting constant-only model:
    
    Iteration 0:   log likelihood = -12.127509 
    Iteration 1:   log likelihood = -9.5129998 
    Iteration 2:   log likelihood = -7.3747905 
    Iteration 3:   log likelihood = -7.3675129 
    Iteration 4:   log likelihood =  -7.367507 
    Iteration 5:   log likelihood =  -7.367507 
    
    Fitting full model:
    
    Iteration 0:   log likelihood =  -7.367507 
    Iteration 1:   log likelihood = -4.3798903 
    Iteration 2:   log likelihood =  7.2950569 
    Iteration 3:   log likelihood =  8.9983715 
    Iteration 4:   log likelihood =  9.2486779 
    Iteration 5:   log likelihood =  9.2515131 
    Iteration 6:   log likelihood =  9.2515148 
    Iteration 7:   log likelihood =  9.2515148 
    
    Gompertz regression -- log relative-hazard form
    
    No. of subjects =           11                  Number of obs    =          11
    No. of failures =           11
    Time at risk    =          874
                                                    LR chi2(2)       =       33.24
    Log likelihood  =    9.2515148                  Prob > chi2      =      0.0000
    
    ------------------------------------------------------------------------------
              _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            load |   2.093645   .4342168     3.56   0.000     1.394331    3.143693
        bearings |   .0434769   .0438715    -3.11   0.002     .0060163    .3141851
           _cons |   1.64e-15   1.43e-14    -3.90   0.000     5.99e-23    4.49e-08
    -------------+----------------------------------------------------------------
          /gamma |   .1742146    .046893     3.72   0.000     .0823061    .2661231
    ------------------------------------------------------------------------------
    
    . drop pred1 pred2
    
    . predict pred1 if e(sample)
    (option median time assumed; predicted median time)
    (1 missing value generated)
    
    . predict pred2
    (option median time assumed; predicted median time)
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thank you Carlo, and thank you also for the response to the question.
      I am aware that there could be a certain degree of error, and that does lead me to a question which I had intended to ask yesterday as well, which is whether it is possible to calculate a confidence interval around the predicted survival time?
      However, reducing that error is one of the reasons that the Gompertz distribution was chosen, and the ages of individuals for whom the predictions will be calculated will be restricted. Even Altman and Bland finish that note by saying "Nevertheless, it is reasonable to use such models to make predictions for patients whose important characteristics are within the range in the original data" (emphasis added).
      Which brings me back to the question, how do I construct an equation, given only Stata's output, that will give me the same calculation as Stata's predict?

      Comment


      • #4
        Jacques:
        judging whether or not important characteristics do overlap among different patient samples is obviously up to researchers.
        That said, if you're interested in linear prediction after -streg- what follows might be useful:
        Code:
        use "http://www.stata-press.com/data/r14/drugtr.dta", clear
        . streg i.drug c.age, dis(gomp) nohr
        
                 failure _d:  died
           analysis time _t:  studytime
        
        Fitting constant-only model:
        
        Iteration 0:   log likelihood = -61.342985 
        Iteration 1:   log likelihood = -60.975094 
        Iteration 2:   log likelihood = -60.971068 
        Iteration 3:   log likelihood = -60.971068 
        
        Fitting full model:
        
        Iteration 0:   log likelihood = -60.971068 
        Iteration 1:   log likelihood = -54.383548 
        Iteration 2:   log likelihood = -43.480023 
        Iteration 3:   log likelihood = -43.402766 
        Iteration 4:   log likelihood = -43.402668 
        Iteration 5:   log likelihood = -43.402668 
        
        Gompertz regression -- log relative-hazard form
        
        No. of subjects =           48                  Number of obs    =          48
        No. of failures =           31
        Time at risk    =          744
                                                        LR chi2(2)       =       35.14
        Log likelihood  =   -43.402668                  Prob > chi2      =      0.0000
        
        ------------------------------------------------------------------------------
                  _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
              1.drug |  -2.305757   .4443439    -5.19   0.000    -3.176655   -1.434859
                 age |   .1136661   .0365283     3.11   0.002      .042072    .1852601
               _cons |  -9.111651   2.129053    -4.28   0.000    -13.28452   -4.938784
        -------------+----------------------------------------------------------------
              /gamma |   .0763246   .0234854     3.25   0.001     .0302941    .1223551
        ------------------------------------------------------------------------------
        
        . list age drug pred in 1
        
             +------------------------+
             | age   drug        pred |
             |------------------------|
          1. |  61      0   -2.178021 |
             +------------------------+
        
        . di -9.111651 + 61*(.1136661) + (-2.305757*0)
        -2.1780189
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Hi Carlo
          Apologies for the late response--I had some other more pressing issues to attend to.
          Using the same variables as listed in my first post, we get:

          . list male age_wv4a agesq marital2 r4hibpe r4diabe r4cancre r4lunge r4hearte chf_p2y r4psyche r4memrye walksevblk stairssev vigact_no currsmoker bmi_lt25 r4hosp r4shlt3 r4shlt4 r4shlt5 predlifeexpec in 1

          +----------------------------------------------------------------------------------------------------------------------------------------------+
          1. | male | age_wv4a | agesq | marital2 | r4hibpe | r4diabe | r4cancre | r4lunge | r4hearte | chf_p2y | r4psyche | r4memrye | walkse~k | stairs~v |
          | 1 | 62 | 3844 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
          |-----------------+----------------------------------------------------------------------------------------------------------------------------|
          | vigact~o | currsm~r | bmi_lt25 | r4hosp | r4shlt3 | r4shlt4 | r4shlt5 | predli~c |
          | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 22.55135 |
          +----------------------------------------------------------------------------------------------------------------------------------------------+

          But that gives is a predicted life expectancy of 22.55, whereas if I calculate the equation, we get -5.52
          If I do your example, then calculating the question manually gives the same answer as Stata.
          In any case I am not sure that the final answer should be negative.
          Can you please elaborate a bit, because I am trying to understand why your example works, but the same method does not work when we apply it to our data?

          thanks

          Comment


          • #6
            Jacques:
            your Stata output is unreadable (by the way, please confirm that the hardly readable coefficients reported in your first post still hold for the calculation).
            Please use CODE deliimiters (as per FAQ) to share it. Thanks.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Thanks Carlo
              Let me try.
              Here are the coefficients:
              Code:
              Gompertz regression -- log relative-hazard form
               
              No. of subjects =        7,480                  Number of obs    =       7,480
              No. of failures =        2,456
              Time at risk    =  85978.26441
                                                              LR chi2(21)      =     3661.73
              Log likelihood  =   -4994.5731                  Prob > chi2      =      0.0000
               
              ------------------------------------------------------------------------------
                        _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                    1.male |   .4559516   .0435201    10.48   0.000     .3706538    .5412495
                  age_wv4a |   .0912029   .0279085     3.27   0.001     .0365033    .1459025
                     agesq |    .000071   .0001897     0.37   0.708    -.0003008    .0004428
                1.marital2 |   .3445814   .1001494     3.44   0.001     .1482922    .5408706
                   r4hibpe |   .1241166   .0418009     2.97   0.003     .0421883    .2060448
                   r4diabe |   .4531136   .0588017     7.71   0.000     .3378644    .5683629
                  r4cancre |   .2513523    .053305     4.72   0.000     .1468765    .3558282
                   r4lunge |   .5372277   .0718719     7.47   0.000     .3963614    .6780941
                  r4hearte |   .2590508   .0491937     5.27   0.000     .1626328    .3554688
                   chf_p2y |   .3213181   .1025171     3.13   0.002     .1203883    .5222478
                  r4psyche |   .1324422   .0758445     1.75   0.081    -.0162103    .2810946
                  r4memrye |   .7246787   .1210191     5.99   0.000     .4874855    .9618718
                walksevblk |    .302861   .0533489     5.68   0.000     .1982991    .4074228
                 stairssev |   .1633768   .0492362     3.32   0.001     .0668757    .2598779
                 vigact_no |   .1562754   .0454623     3.44   0.001     .0671709      .24538
                currsmoker |   .6843489   .0645428    10.60   0.000     .5578473    .8108505
                  bmi_lt25 |   .1570624   .0433098     3.63   0.000     .0721768    .2419481
                    r4hosp |   .1700956    .045846     3.71   0.000     .0802391     .259952
                   r4shlt3 |   .2254694   .0516955     4.36   0.000     .1241481    .3267906
                   r4shlt4 |   .3441239   .0635972     5.41   0.000     .2194758    .4687721
                   r4shlt5 |   .7401511   .0826021     8.96   0.000      .578254    .9020482
                     _cons |  -12.39047   1.016682   -12.19   0.000    -14.38313   -10.39781
              -------------+----------------------------------------------------------------
                    /gamma |   .1446181   .0056125    25.77   0.000     .1336178    .1556184
              ------------------------------------------------------------------------------
              Then here is the .List output:
              Code:
              . list male age_wv4a agesq marital2 r4hibpe r4diabe r4cancre r4lunge r4hearte chf_p2y r4psyche r4memrye walksevblk stairssev vigact_no currsmoker bmi_lt25 r4hosp r4shlt3 r4shlt4 r4shlt5 predlifeexpec in 1
               
                   +----------------------------------------------------------------------------------------------------------------------------------------------+
                1. | male | age_wv4a | agesq | marital2 | r4hibpe | r4diabe | r4cancre | r4lunge | r4hearte | chf_p2y | r4psyche | r4memrye | walkse~k | stairs~v |
                   |    1 |       62 |  3844 |        0 |       0 |       0 |        0 |       0 |        1 |       0 |        0 |        0 |        0 |        0 |
                   |-----------------+----------------------------------------------------------------------------------------------------------------------------|
                   |    vigact~o     |    currsm~r     |    bmi_lt25     |    r4hosp     |     r4shlt3     |     r4shlt4     |     r4shlt5     |     predli~c     |
                   |           0     |           0     |           0     |         0     |           1     |           0     |           0     |     22.55135     |
                   +----------------------------------------------------------------------------------------------------------------------------------------------+

              Comment


              • #8
                Jacques:
                please show the code for -predlifeexpec-. Thanks.
                Kind regards,
                Carlo
                (Stata 19.0)

                Comment


                • #9
                  HI Carlo
                  It was simply
                  . predict predlifeexpec , time
                  If I may ask, how did you calculate your pred variable shown in your list command? That may be where the discrepancy is in my own calculation.
                  Thanks

                  Comment


                  • #10
                    Jacques:
                    my calculation of -predict- by hand was as follows (#4):
                    -_cons (-9.111651)+age(61 years)*age coefficient(.1136661)+ drug coded 0*drug coefficient (-2.305757)
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      HI Carlo:
                      Yes, but in
                      Code:
                      . list age drug pred in 1
                      You have pred in Stata. Did Stata calculate that for you?
                      My challenge is that when we do your calculation with your coefficients, we get the same answer as you do.
                      But when we do the same linear calculation on our values, we get -5.52, while Stata is giving us 22.55 to
                      Code:
                      . predict predlifeexpec , time
                      It is that discrepancy that we are trying to understand, and, if possible, gain an understanding of what calculation Stata is doing to calculate its prediction.

                      On a secondary note (and at the risk of hijacking my own thread), I still don't grasp how a predicted life expectancy (in terms of added years) can be negative.
                      I understand that it can indicate that an individual has exceeded the mean or median life expectancy of their cohort, but when age itself is provided, surely it should indicate remaining years for the cohort that has survived up to that point, which may approach zero, but should not be less than zero?

                      thank you

                      Comment


                      • #12
                        Jacques:
                        my fault.
                        The missing line of code was:
                        Code:
                        predict predict, xb
                        Kind regards,
                        Carlo
                        (Stata 19.0)

                        Comment


                        • #13
                          Thank you Carlo for your expert input so far! I was working on the same problem as Jacques and your advice has been very useful. I was just wondering: the postestimation for streg provides the predicted median survival time, but is there a way to compute 95% CI?
                          Even better if we could compute the point estimate and 95% CI for the difference between two predicted median survival times

                          I'm asking this question because I'm trying to emulate a recent NEJM paper: https://www.nejm.org/doi/full/10.1056/NEJMoa2002449 which, as you can see, used out-of-sample Gompertz predictions to extrapolate the median life expectancy. And whilst I am able to compute the point estimate of the difference between the two predicted median life expectancies, I can't seem to get a 95% CI around it..

                          Hope you will be able to advise and thank you very much for your help!

                          Comment


                          • #14
                            Nichiolas:
                            welcome to this forum.
                            A temptative answer (that can be tweaked according to your needs):
                            Code:
                            . webuse kva
                            (Generator experiment)
                            
                            . stset failtime
                            
                                 failure event:  (assumed to fail at time=failtime)
                            obs. time interval:  (0, failtime]
                             exit on or before:  failure
                            
                            ------------------------------------------------------------------------------
                                     12  total observations
                                      0  exclusions
                            ------------------------------------------------------------------------------
                                     12  observations remaining, representing
                                     12  failures in single-record/single-failure data
                                    896  total analysis time at risk and under observation
                                                                            at risk from t =         0
                                                                 earliest observed entry t =         0
                                                                      last observed exit t =       140
                            
                            . streg load bearings, d(gompertz)
                            
                                     failure _d:  1 (meaning all fail)
                               analysis time _t:  failtime
                            
                            Fitting constant-only model:
                            
                            Iteration 0:   log likelihood = -13.666193 
                            Iteration 1:   log likelihood = -11.025444 
                            Iteration 2:   log likelihood = -9.6720715 
                            Iteration 3:   log likelihood = -9.6599087 
                            Iteration 4:   log likelihood = -9.6599004 
                            Iteration 5:   log likelihood = -9.6599004 
                            
                            Fitting full model:
                            
                            Iteration 0:   log likelihood = -9.6599004 
                            Iteration 1:   log likelihood = -7.3010298 
                            Iteration 2:   log likelihood =  6.8696578 
                            Iteration 3:   log likelihood =  8.8658999 
                            Iteration 4:   log likelihood =  9.1747027 
                            Iteration 5:   log likelihood =    9.17623 
                            Iteration 6:   log likelihood =  9.1762304 
                            
                            Gompertz PH regression
                            
                            No. of subjects =           12                  Number of obs    =          12
                            No. of failures =           12
                            Time at risk    =          896
                                                                            LR chi2(2)       =       37.67
                            Log likelihood  =    9.1762304                  Prob > chi2      =      0.0000
                            
                            ------------------------------------------------------------------------------
                                      _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
                            -------------+----------------------------------------------------------------
                                    load |   2.060335   .4004514     3.72   0.000     1.407656    3.015639
                                bearings |   .0472065    .047521    -3.03   0.002     .0065634    .3395267
                                   _cons |   1.95e-15   1.63e-14    -4.05   0.000     1.47e-22    2.59e-08
                            -------------+----------------------------------------------------------------
                                  /gamma |   .1755645   .0456888     3.84   0.000     .0860161    .2651129
                            ------------------------------------------------------------------------------
                            Note: _cons estimates baseline hazard.
                            
                            . predict time, time
                            (option median time assumed; predicted median time)
                            
                            . centile time, centile(50)
                            
                                                                                   -- Binom. Interp. --
                                Variable |       Obs  Percentile    Centile        [95% Conf. Interval]
                            -------------+-------------------------------------------------------------
                                    time |        12         50    76.39389        38.67568      114.12
                            
                            .
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment

                            Working...
                            X