How to compare survival of a cohort with the age and gender matched population

Stefano Mastrobuoni

Join Date: Sep 2014

Posts: 11
#1

How to compare survival of a cohort with the age and gender matched population

03 Sep 2014, 14:24

Dear all,

I am studying the survival of a cohort of patients who received a specific treatment. I would like to compare their survival with that of an age and gender matched population from Belgium. I found the mortality rate of Belgium in the WHO website. However I do not know how I should proceed. How do I convert the mortality rate of the general population (which changes grossly every 5 years) in a survival curve? Further, how do I do the age and gender match? Should I take for every patient of my cohort a match of same age and gender from the general population or should I just use the mean age of the cohort and match with the same age of the general population? And then how do I put this curve together with my cohort survival curve?
Thank you in advance, every help is greatly appreciated.

Regards
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

03 Sep 2014, 15:29

In order for anyone to answer your question, they will need more information about the data. Show us an informative sample of your patient cohort data, and an informative sample of the Belgium data. Do this by -list-ing a reasonable number of observations from each data set and copying and pasting the Stata results into a code block. (Select the underlined A button on the forum's editor, and then click on the # button to create a code block. Paste between the delimiters that appear.)
Comment
Svend Juul

Join Date: Apr 2014

Posts: 515
#3

03 Sep 2014, 15:43

You probably should not use this forum for advertising. On the other hand: In An Introduction to Stata for Health Researchers (Stata Press) we in section 13.8 describe indirect standardization, which is what you are asking about.
Comment

Stefano Mastrobuoni

Join Date: Sep 2014
Posts: 11

04 Sep 2014, 14:52

Thank you very much Clyde Schechter for your help. The dataset of my cohort is pretty easy: I have for each patient an indicator for the time since the treatment and an indicator for the status (whether censored or dead). With this I am doing my survival analysis and building my survival curve (with stset etc..).

On the other hand the life table of the general Belgian population looks like this: [

AGE15-19	15-19	BTSX	Both sexes	0.001
AGE15-19	15-19	BTSX	Both sexes	0.001
AGE15-19	15-19	BTSX	Both sexes	0.000
AGE20-24	20-24	BTSX	Both sexes	0.001
AGE20-24	20-24	BTSX	Both sexes	0.001
AGE40-44	40-44	BTSX	Both sexes	0.001
AGE45-49	45-49	BTSX	Both sexes	0.003
AGE45-49	45-49	BTSX	Both sexes	0.003
AGE45-49	45-49	BTSX	Both sexes	0.002
AGE70-74	70-74	BTSX	Both sexes	0.028
AGE70-74	70-74	BTSX	Both sexes	0.021
AGE75-79	75-79	BTSX	Both sexes	0.054
AGE75-79	75-79	BTSX	Both sexes	0.044
AGE100+	100+	BTSX	Both sexes	0.510
AGE100+	100+	BTSX	Both sexes	0.483
AGE100+	100+	BTSX	Both sexes	0.460

][/CODE]
where the last column is the death rate (there rare 3 values for each age group corresponding to the death rate calculated in 1990-2000-2012).
So, I would like to add the survival curve of the matched population to the survival curve of my cohort and check if they are indeed different. Do you have any suggestions? Thanks a lot

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

04 Sep 2014, 15:01

I find the Belgian life table confusing: why are there three rows for each age group that are identical except for the final column. And what is the final column: is it an annual mortality probability, or what?
Comment
Stefano Mastrobuoni

Join Date: Sep 2014

Posts: 11
#6

05 Sep 2014, 14:08

The 3 rows are the estimates for the years 1990-2000-2012 for each age group; the last column is indeed the death rate. These are the data reported in the WHO website (http://apps.who.int/gho/data/view.main.LT61950?lang=en).
So, I know that survival is 1- death rate but how which command should I use in STATA to get a survival curve? And how can I match with my cohort?
Thanks everyone for help
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#7

05 Sep 2014, 14:34

The death rate in the final column: is that annual probability of death or, less likely, probability of dying before the end of the age-group? Let me assume the former. The next step is to -reshape- your data so that the three years, 1990, 2000, and 2012 are separate variables: it makes no sense to combine them into a single survival curve. And you will need to also find out what the death rates are between 25 and 40, 50 and 69, and 80 and 99: those seem to be absent. (I don't know what the upper age is for the 100+ group, 110 or 120 is typical--you should find out.) The next step will be to convert your age range into a pair of numeric variables, lower to upper age. So I'm going to assume that you have already re-arranged your data to the following layout:

Code:

age_group lower_age upper_age m1990 m2000 m2012

Then you can generate survival curves for 1990, 2000, and 2012 as follows:

Code:

foreach y of numlist 1990 2000 2012 { // CALCULATE PROBABILITY OF DYING DURING AGE PERIOD gen period_m`y' = 1 - (1-m`y')^(upper_age-lower_age+1) } // NOW CALCULATE KAPLAN-MEIR ESTIMATOR OF SURVIVAL FUNCTION sort lower_age gen S`y' = 1 foreach y of numlist 1990 2000 2012 { replace S`y' = S`y'[_n-1]*(1-period_m`y'[_n-1]) if _n > 1 }

These survival curves will be step functions due to the width of the age-groups for which you have death rates. If you had 1 year death rates by single year ages you could get something that would be a bit smoother.
Comment
Stefano Mastrobuoni

Join Date: Sep 2014

Posts: 11
#8

10 Sep 2014, 14:20

Dear Clyde I am sorry to bother you again but I have tried the code you gave me and I am still having problems. I reshaped the data as follow:

age_group lower_age upper_age m1990 m2000 m2012
<1 0 1 .008 .005 .003
1-4 1 4 0 0 0
5-9 5 9 0 0 0
10-14 10 14 0 0 0
15-19 15 19 .001 .001 0
20-24 20 24 .001 .001 0
25-29 25 29 .001 .001 0
30-34 30 34 .001 .001 .001
35-39 35 39 .001 .001 .001
40-44 40 44 .002 .002 .001
45-49 45 49 .003 .003 .002
50-54 50 50 .005 .005 .004
55-59 55 59 .008 .007 .006
60-64 60 60 .012 .01 .009
65-69 65 69 .02 .017 .013
70-74 70 70 .032 .028 .021
75-79 75 79 .054 .044 .034
80-84 80 80 .09 .078 .063
85-89 85 89 .148 .132 .115
90-94 90 90 .234 .217 .199
95-99 95 99 .345 .323 .316
>100 100 120 .516 .483

then I use the code but it gives me back an error:

[CODE]. foreach y of numlist 1990 2000 2012 {
2.
. gen period_m`y' = 1 - (1-m`y')^(upper_age-lower_age+1)
3. }
(1 missing value generated)

.
. sort lower_age

. gen S`y' = 1

. foreach y of numlist 1990 2000 2012 {
2. replace S`y' = S`y'[_n-1]*(1-period_m`y'[_n-1]) if _n > 1
3. }
variable S1990 not found
r(111);
/CODE]

any insight again?

thanks a lot

SM
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30101

10 Sep 2014, 14:55

Sorry, my mistake. The gen S`y' = 1 statement needs to be inside, and the first command of, the second foreach loop, thus:

Code:

foreach y of numlist 1990 2000 2012 {
    // CALCULATE PROBABILITY OF DYING DURING AGE PERIOD
    gen period_m`y' = 1 - (1-m`y')^(upper_age-lower_age+1)
}
   //  NOW CALCULATE KAPLAN-MEIR ESTIMATOR OF SURVIVAL FUNCTION
   sort lower_age
   
  foreach y of numlist 1990 2000 2012 {
    gen S`y' = 1
    replace S`y' = S`y'[_n-1]*(1-period_m`y'[_n-1]) if _n > 1
}

Comment

Paul Dickman

Join Date: Apr 2014

Posts: 294
#10

10 Sep 2014, 15:23

Have a look at this post for another approach (using -strs-). You can get general population mortality rates in 1-year ages for the Belgian population from mortality.org. Later in the thread I linked to you can see a description of how to transform the data from mortality.org into the format required by -strs-.
Comment
Stefano Mastrobuoni

Join Date: Sep 2014

Posts: 11
#11

12 Sep 2014, 03:00

Thank you very much Clyde Schechter, it works fine now.

Thank you Paul Dickman, it took me some time to go through all the steps but it is a very nice approach. However when I draw the survival curve of my cohort with sts graph it looks a bit different compared to the graph I get using your code, have a look please:

this is the full cohort with sts grapgh

and this is the graph I got with the code you wrote:

strs using BelgianPopMort, br(0(0.5)20) mergeby(_year male _age) by(male agegroup) notables save(replace)

use grouped if male==1 & agegroup==3, clear

twoway (line cp end, lw(medthick)) (line cp_e2 end, lw(medthick)), yti("Survival") ylabel(0(0.2)1, format(%3.1f)) xti("Years from diagnosis") xla(0(1)20) legend(order(1 "Observed" 2 "Expected") ring(0) pos(7) col(1))

why is it? And (I am sorry but I am a beginner with STATA) how can I draw a smoother observed curve?

Thanks a lot

SM
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#12

14 Sep 2014, 01:25

Hi Stefano,

I couldn't see the graphs in your post, but here are some general comments.

sts graph, by default, estimates the survivor function using the Kaplan-Meier method whereas -strs- uses the actuarial (life table) method. The two estimates should be very similar. Both methods effectively divide the follow-up time into subintervals, the difference being the Kaplan-Meier method creates a new subinterval at each event time whereas for the life table approach they are pre-specified (6 months in your example). As such, you can make the approaches more similar (and the observed curve more smooth) by specifying shorter intervals.

Be aware the -strs- requires time to be in years, so use scale(365.24) when you stsplit if you have time in days. This is because the expected survival proportions are also specified in days.

If you can't find the source of the difference then have a look at the table of survival estimates. sts list will give you the Kaplan-Meier estimates and removing the "notables" option for -strs- will show the lifetables. Both tables will show the number at risk, number of events, etc. and you should be able top spot where the difference is.

Paul
Comment
Stefano Mastrobuoni

Join Date: Sep 2014

Posts: 11
#13

14 Sep 2014, 13:52

Hi Paul,

thank you very much again for your help. I also realized that I was using the if option, therefore I was restricting y analysis to a specific subset while I was interested in the full cohort.
I have a last question and hope to bother you no more. How can I finally test if the 2 curves are statistically different, like with the Log-rank test? I looked into the models.do file and seems quite difficult for me… Thanks in advance

Stefano
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#14

15 Sep 2014, 03:44

An easy way to compare the two curves is to look at the relative survival ratio (RSR), which is simply the ratio of the two curves (observed/expected). This is calculated by strs (together with 95% confidence intervals) and saved in the grouped data file. strs calculates both the cumulative RSR (stored in cr_e2) and the conditional RSR (stored in r). When studying the survival of cancer patients (the typical application of relative survival) it's very common to use the 5-year RSR as a summary of survival.

If your patients have the same survival as the general population then RSR will be 1. If the 95% CI for the cumulative RSR does not contain 1 then this is evidence of a statistically significant difference.

You could look at the estimates and CIs in the life tables or use the following code to plot the cumulative RSR based on the data save to grouped.dta.

Code:

twoway (rarea lo_cr_e2 hi_cr_e2 end, sort) /// (line cr end, sort), /// yti("Relative Survival") legend(off) /// ylabel(0(0.2)1, format(%3.1f)) /// xti("Years from diagnosis") xla(0(1)10)

For conditional RSR, substitute r, lo_r and hi_r. I find the plots of the conditional RSR very informative since they show at what point in the follow-up the differences occur. For example, you might find the RSR is initially less than 1 but then after some time it returns to 1 (the surviving patients now have the same mortality as a comparable general population).

For an example of this type of analysis, have a look at Hultcrantz et al, J Clin Oncol. 2012 30(24):2995-3001.

Note that there is a slight difference between this application and, for example, comparing the survival of patients in two treatment arms. In your application (as is standard in relative survival) we are assuming that the expected survival is fixed and known (i.e., no random error). Like most assumptions made in statistics, we know this is not perfectly true but we are willing to make the assumption.

The modelling is not as hard as it looks (the code in model.do compares different ways of modelling and you only need one in practice). If all you want to do is compare observed to expected for your cohort then you can do without modelling. If, however, you want to compare, for example, if RSR differs across treatments (or by sex or agegroup) then you will need modelling.
Comment
Stefano Mastrobuoni

Join Date: Sep 2014

Posts: 11
#15

15 Sep 2014, 13:16

Thanks Paul, it is just awesome!
Comment

Announcement