Time dependent covariates

Buyadaa Oyunchimeg

Join Date: Jan 2018
Posts: 195

Time dependent covariates

06 Jul 2018, 00:04

Dear all,

I am trying to assess associations between follow-up levels of risk factors with outcome (event). In order to do this first, each participant's follow-up period should be divided into a series of intervals defined by the 2 yr visits (in my case); second, the presence or absence of event by group should be documented during each interval and coupled to the time-weighted average of risk factors recorded during the interval before the development of event.

I've generated time-weighted average of some risk factors (blood pressure, lipid levels), but I am having difficulty in further analysis (I am very new to this kind of analysis). Any suggestions on how to calculate HR with time-dependent covariates?

Any help will be appreciated.
Thank you.

Data looks as below.

HTML Code:

clear
input byte(studytime event baseline_age id) float(id1 days_from_baseline) double(fpg screat) int(ldl hdl sbp dbp) float(t_years sumw sumwsbp meansbp sumwdbp meandbp sumwldl meanldl sumwhdl meanhdl gr)
1 1 61 1  1    0 215  1.9 105 46 115 60 0 0    0 114.44444   0 65.77778   0 101.77778   0 47.11111 1
1 1 61 1  2  730 198 1.99 122 45 108 62 2 2  216 114.44444 124 65.77778 244 101.77778  90 47.11111 1
1 1 61 1  3 1095  99  1.6  88 46 114 64 3 5  558 114.44444 316 65.77778 508 101.77778 228 47.11111 1
1 1 61 1  4 1460  95  1.6 102 49 118 69 4 9 1030 114.44444 592 65.77778 916 101.77778 424 47.11111 1
3 0 56 2  5    0  88 1.85 109 44 110 60 0 0    0 110.22222   0 61.55556   0  98.77778   0       44 1
3 0 56 2  6  730  96  1.7 120 51 127 77 2 2  254 110.22222 154 61.55556 240  98.77778 102       44 1
3 0 56 2  7 1095  79 2.07  91 42 110 56 3 5  584 110.22222 322 61.55556 513  98.77778 228       44 1
3 0 56 2  8 1460 150 1.94  94 42 102 58 4 9  992 110.22222 554 61.55556 889  98.77778 396       44 1
2 1 76 3  9    0  41   .6 136 51 121 65 0 0    0     119.4   0       67   0      91.2   0     49.6 2
2 1 76 3 10  730  52  .52  78 55 123 73 2 2  246     119.4 146       67 156      91.2 110     49.6 2
2 1 76 3 11 1095  61   .6 100 46 117 63 3 5  597     119.4 335       67 456      91.2 248     49.6 2
4 1 79 4 12    0 149  .54  88 65 137 70 0 0    0 128.44444   0       71   0  95.88889   0 52.88889 2
4 1 79 4 13  730  98   .6  87 48 126 73 2 2  252 128.44444 146       71 174  95.88889  96 52.88889 2
4 1 79 4 14 1095 119   .6 103 48 124 67 3 5  624 128.44444 347       71 483  95.88889 240 52.88889 2
4 1 79 4 15 1460  75  .61  95 59 133 73 4 9 1156 128.44444 639       71 863  95.88889 476 52.88889 2
end
[/CODE]

Tags: None

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17704
#2

06 Jul 2018, 00:18

Buyada:
see -tvc- option under -stcox-.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30073
#3

07 Jul 2018, 10:31

Carlo's response does not address the issues raised in the first paragraph of #1. I don't blame him. The data presented is full of variables whose relationship to the question at hand is unclear. Is the time variable here studytime, or days_from_baseline, or t_years? What are those sums and means near the end of the data set? Why is there a "variable" gr that is always 1? For that matter, as an economist, Carlo cannot be expected to know which of these variables represent blood pressure and lipids: ldl, hdl, sbp, dbp, and screat are medical jargon. Nor would a non-medical person recognize what fpg is.

I suggest you clarify for everybody what variables are what here. Perhaps then somebody will be able to help you with the two-year aggregations.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

07 Jul 2018, 20:07

it's always a good idea to provide a minimal example that illustrates a problem. In your case, one time-dependent variable should be sufficient. Other time-dependent variables and baseline variables like age can be omitted from datatex and from your stcox models.

You say you have "difficulty in further analysis", but you don't show the commands and results illustrate this difficulty. For example, you've not shown your stset statement, the result of stdes, and any of the stcox statements and the results. Be sure to place all commands and results between CODE delimiters. I would also want to see a minimal illustration of the "time-weighted" average for the single time-dependent covariate.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Buyadaa Oyunchimeg

Join Date: Jan 2018

Posts: 195
#5

08 Jul 2018, 17:56

Dear prof.Schechter and Mr.Samuels.

Thank you for response and sorry for unclear post.

studytime- time (years) to event ; event- (0=censored; 1=event) days_from_baseline-Days from baseline until the end (exit) of the study. fpg-Fasting plasma glucose; screat-Serum creatinine; ldl- Low-density lipoprotein; hdl-High-density lipoprotein; sbp-Systolic blood presuure; dbp-Diastolic blood pressure.

I've generated 3 groups based on baseline albuminuria.

What are those sums and means near the end of the data set

I've tried to generate time-weighted average of sbp, dbp and lipoproteins during the follow-up.
As suggested some unnecessary variables have been removed and data looks now as below.

PHP Code:

clear clear input byte(studytime event baseline_age id) float days_from_baseline double(fpg screat) int(ldl hdl sbp dbp) float(meansbp meandbp meanldl meanhdl gr) 1 1 61 1 0 215 1.9 105 46 115 60 114.44444 65.77778 101.77778 47.11111 1 1 1 61 1 730 198 1.99 122 45 108 62 114.44444 65.77778 101.77778 47.11111 1 1 1 61 1 1095 99 1.6 88 46 114 64 114.44444 65.77778 101.77778 47.11111 1 1 1 61 1 1460 95 1.6 102 49 118 69 114.44444 65.77778 101.77778 47.11111 1 3 0 56 2 0 88 1.85 109 44 110 60 110.22222 61.55556 98.77778 44 1 3 0 56 2 730 96 1.7 120 51 127 77 110.22222 61.55556 98.77778 44 1 3 0 56 2 1095 79 2.07 91 42 110 56 110.22222 61.55556 98.77778 44 1 3 0 56 2 1460 150 1.94 94 42 102 58 110.22222 61.55556 98.77778 44 1 2 1 76 3 0 41 .6 136 51 121 65 119.4 67 91.2 49.6 2 2 1 76 3 730 52 .52 78 55 123 73 119.4 67 91.2 49.6 2 2 1 76 3 1095 61 .6 100 46 117 63 119.4 67 91.2 49.6 2 4 1 79 4 0 149 .54 88 65 137 70 128.44444 71 95.88889 52.88889 3 4 1 79 4 730 98 .6 87 48 126 73 128.44444 71 95.88889 52.88889 3 4 1 79 4 1095 119 .6 103 48 124 67 128.44444 71 95.88889 52.88889 3 4 1 79 4 1460 75 .61 95 59 133 73 128.44444 71 95.88889 52.88889 3 end

Last edited by Buyadaa Oyunchimeg; 08 Jul 2018, 18:32.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30073

08 Jul 2018, 18:37

So with this data set, you can do the following, which will get you your two-year groupings.

Code:

//     IDNETIFY TWO-YEAR GROUPINGS OF OBSERVATIONS WITHIN ID
gen int group2yr = ceil(days_from_baseline/730)

//    VERIFY THAT DAYS_FROM_BASELINE IS ALWAYS A MULTIPLE OF 365
assert days_from_baseline == 365*int(days_from_baseline/365)

//    VERIFY THAT STUDY GROUP IS CONSTANT WITHIN ID
by id (gr), sort: assert gr[1] == gr[_N]

//    VERIFY BASELINE AGE IS CONSTANT THROUGH ALL OBSERVATIONS ON AN INDIVIDUAL
by id (baseline_age), sort: assert baseline_age[1] == baseline_age[_N]

//    CALCULATE TIME WEIGHTED AVERAGES OF VARIABLES IN TWO YEAR INTERVALS
//    NOTE THAT EACH TWO YEAR INTERVAL WILL CONTAIN EITHER ONE OR TWO
//    OBSERVATIONS.  IN EITHER CASE, THE OBSERVATIONS ARE EQUALLY SEPARATED
//    IN TIME, AND THEREFORE THE TIME WEIGHTED AVERAGE IS JUST THE SIMPLE AVERAGE
//    AND REDUCE TO ONE OBSERVATION PER 2 YEAR PERIOD.  ALSO RETAIN WHETHER
//    OR NOT AN EVENT OCCURRS IN THE TWO YEAR PERIOD.
collapse (mean) fpg screat ldl hdl sbp dbp (max) event (first) baseline_age gr ///
    (max) days_from_baseline, by(id group2yr)

The subject of time-weighted covariates is complicated and I think too unwieldy to cover in a Forum post. I would refer you instead to the -stcox- section of the PDF manuals, which has some worked examples using either the -tvc()- option or using multiple records per person. (Your data is already multiple records per person, so apart from speed considerations, this might be the best way for you to go.)

All of that said, I do not understand why you are aggregating your data up to two year intervals. You are just throwing information away by doing this, and I think you would be better off analyzing the data as it already is.

Comment

Buyadaa Oyunchimeg

Join Date: Jan 2018

Posts: 195
#7

08 Jul 2018, 20:26

Dear prof.Schechter,

Thank you so much for codes and valuable comments. I would like to this as an additional analysis (to compare with long-term prediction of event which utilizes only the baseline-data and ignores subsequent repeated measures of covariates).

As suggested I will try -tvc()- option.

Last edited by Buyadaa Oyunchimeg; 08 Jul 2018, 20:35.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

09 Jul 2018, 15:12

I too question the two-year interval approach. Moreover, I doubt that your data can be analyzed as shown.

1. For the people in the sample listing, the event occurs either in every time interval or in no interval. This seems unlikely to me; usually the event occurs, if at all, in the last interval of followup. For examples of what multiple record data should look like, see the examples in the manual entry for stset. (The snapspan command may also be relevant).

2. There are only three two-year intervals in your sample listing. If only the interval in which an event took place is known, then the data are too heavily grouped for a valid Cox analysis. For alternatives, consult the entry discrete in the Survival manual and the references to Stephen Jenkins's teaching materials in this post.

3. If, however, the calendar dates of events are known, then the timescale should be days from baseline and you should analyze exact days for event occurrence and for exit from study. For a Cox model, you'll probably want to use stsplit, at(failures). For other commands (eg. stpm2), you may need stfill.

I don't think that you need the tvc option unless there is evidence of non-proportional hazards.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Buyadaa Oyunchimeg

Join Date: Jan 2018

Posts: 195
#9

10 Jul 2018, 19:10

Dear Mr.Samuels,

Thank you so much for valuable comments and guidance.
As suggested I've tried "discrete time models (logistic and cloglog)" and the author illustrates this using the following example.

PHP Code:

webuse cancer, clear ge id = _n lab var id "subject identifier" 2 Lesson 6 recode drug 1=0 2/3=1 (48 changes made) lab var drug "receives drug?" lab def drug 0 "placebo" 1 "drug" lab val drug drug expand studytime (696 observations created) bysort id: ge j = _n * spell month identifier, by subject lab var j "spell month" bysort id: ge dead = died==1 & _n==_N lab var dead "binary depvar for discrete hazard model" ge lnj = ln(j) ta j, ge(d) ds d* ge e1 = j < 9 ge e2 = j >= 9 & j <= 17 ge e3 = j >= 18 & j <. ta j dead cloglog dead drug age e1 e2 e3, nocons nolog logit dead drug age e2 e3, nolog logit, or

1. In above example id=per person per observation and 'studytime'=Months to death or end of exp. presented as whole number. But in my real dataset multiple records per person (repeated id) and time to events are presented in years + decimals (ex: 1.23; 4.65 etc.). In this case should I still expand time to event and how to deal with repeated id? What would you suggest?

2. Based on 2 baseline variables I've created 4 groups and would like to compare the risk of developing events in these 4 groups. But when I tested the proportional hazard assumption using stphplot, by(gr); estat phtest command, I realized that the PH assumption is not met. Is this because of small sample size gr 3 and 4? what would be best solution in this case?

Here is what I got:

HTML Code:

estat phtest, detail Test of proportional-hazards assumption Time: Time ---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 1b gr| . . 1 . 2.gr| 0.05908 0.70 1 0.4030 3.gr| 0.15640 5.60 1 0.0180 4.gr| 0.06555 1.03 1 0.3107

Data looks as below:

PHP Code:

clear input float id int(x6 x7 time) byte(event baseline_age) float gr int(hdl sbp days_from_baseline) 1 48 66 1814 1 61 3 51 127 0 1 48 66 1814 1 61 3 51 125 349 1 48 66 1814 1 61 3 42 102 1823 1 48 66 1814 1 61 3 42 110 2558 2 0 0 2018 1 61 1 47 109 0 2 0 0 2018 1 61 1 46 117 364 2 0 0 2018 1 61 1 51 134 1832 2 0 0 2018 1 61 1 59 133 2556 3 0 0 712 1 61 1 43 141 0 3 0 0 712 1 61 1 45 147 371 3 0 0 712 1 61 1 49 139 1828 3 0 0 712 1 61 1 46 158 2568 4 0 0 1807 1 61 1 34 139 0 4 0 0 1807 1 61 1 28 126 359 4 0 0 1807 1 61 1 29 127 1822 4 0 0 1807 1 61 1 33 137 2557 6 0 13 448 1 56 4 30 136 0 6 0 13 448 1 56 4 40 136 370 6 0 13 448 1 56 4 45 148 1834 6 0 13 448 1 56 4 40 141 2557 7 181 0 2172 0 56 2 33 120 0 7 181 0 2172 0 56 2 21 117 357 7 181 0 2172 0 56 2 36 127 1825 7 181 0 2172 0 56 2 40 137 2562 8 192 25 2161 0 56 3 36 159 0 8 192 25 2161 0 56 3 39 140 357 8 192 25 2161 0 56 3 45 113 1819 8 192 25 2161 0 56 3 42 143 2532 9 0 59 471 0 76 4 43 129 0 9 0 59 471 0 76 4 43 112 366 9 0 59 471 0 76 4 41 112 1848 9 0 59 471 0 76 4 45 109 2558 10 0 3 2014 0 76 1 45 121 0 10 0 3 2014 0 76 1 45 125 369 10 0 3 2014 0 76 1 49 87 1832 10 0 3 2014 0 76 1 50 101 2562 end label values event cens label values gr gr label def gr 1 "ER-/Pr-", modify label def gr 2 "ER+/PR-", modify label def gr 3 "ER+/PR+", modify label def gr 4 "ER-/PR+", modify

Thank you in advance.
Best wishes,
Oyun
Attached Files

Last edited by Buyadaa Oyunchimeg; 10 Jul 2018, 19:14.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#10

11 Jul 2018, 16:45

Thanks for exploring the alternatives. I need more information to recommend a course of action.

1. PLease describe your study and cohort: the source of the data; how many people were enrolled; what the event of interest is; how many had the event how long was followup; the reasons for study exit: end of study? loss to followup? death from the primary cause of interest? death from competing causes?

2. You present grouped data, which I recommended if you don't have dates. So I'll ask again: do you have dates for entry, exit, events, tests?

3. Is the measurement of albumin good enough for the values to be treated as continuous?

You stated that you want your time-dependent analysis to supplement a baseline analysis. My main suggestion is to do the baseline analysis first. That will be difficult enough; then build on that for the time-dependent analysis.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Buyadaa Oyunchimeg

Join Date: Jan 2018

Posts: 195
#11

11 Jul 2018, 17:56

Thank you for reply Mr.Samuels.

I will try to explain my data.

1. It is clinical trial data and around 10'000 people were enrolled the study. At the end of the study 8500 were alive or did not suffer from a primary outcome (cardiovascular event) or death. Mean follow-up period - 5.2 yrs (max - 6.9 yrs).
The event of interest is the end stage kidney disease (in my case). There were recorded 285 event during the study period.
Reasons for study exit: event or study end . I've not calculated yet death as competing risk.

2. Data does not have exact calendar date of entry or exit. But it contains the following variables: days from baseline, exit record (days to exit from baseline), days to event and visit (days from baseline to visit).

I agree with you. First I would like to solve the problem with baseline data analysis (question 2), then move to time-dependent analysis.

Thank you.
Best wishes,
Oyun

Last edited by Buyadaa Oyunchimeg; 11 Jul 2018, 17:59.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#12

12 Jul 2018, 14:37

Thanks for the explanation. Days from baseline is fine. How is the onset of end-stage kidney disease ascertained? Only at clinic visits? O Do you also know days from baseline to onset? f not, what information do you have about onset? I'm concerned that if it were ascertained only at clinic visits, then observation must be censored at the last visit.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Buyadaa Oyunchimeg

Join Date: Jan 2018

Posts: 195
#13

12 Jul 2018, 17:39

Thank you for response Mr.Samuels.
Onset of end-stage kidney disease was assessed every 4 months and defined if there was initiation of dialysis or transplantation, end-stage kidney disease or doubling serum creatinine. We do have days from baseline to onset.

Sincerely,
Oyun

Last edited by Buyadaa Oyunchimeg; 12 Jul 2018, 17:45.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#14

12 Jul 2018, 20:00

Great! Then you won't need a grouped data analysis. Your next step is to create a data set with one record per person and with days to event or to exit from study, and the status at end of study. This not be just 0-1. You may want several non-zero indicators. You can start with 1 for renal failure, 0 for other statuses, but you will eventually want positive integers for statuses like death or deaths of different causes

Last edited by Steve Samuels; 12 Jul 2018, 20:31.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Buyadaa Oyunchimeg

Join Date: Jan 2018

Posts: 195
#15

20 Jul 2018, 01:20

Thank you for Mr.Samuels for your help.

I found a similar study in which Poisson-log linear regression model was used to assess the risk for each outcomes associated with baseline kidney function and albuminuria. Associations between follow-up levels of albumin and kidney function and other risk factors were assessed using the pooling of repeated observations method.

http://jasn.asnjournals.org/content/20/8/1813.short

Is this can be considered as a good alternative to Cox model (when PH assumption is not met) ?

Thank you.
Regards,
Oyun
Comment

Announcement

Time dependent covariates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment