Hello, I am looking for guidance on whether a data setup I am looking to follow would be valid. Specifically, I am interested in the effect of time varying variables on a recurrent dependent variable. I am looking to follow the Andersen-Gill gap time setup and subjects are observed on a weekly basis.
My data closely follow the Andersen-Gill Model setup described here: https://www.stata.com/support/faqs/s...ure-time-data/
where:
To implement the Andersen and Gill model using the results from the bladder cancer study, the data are set up as follows: for each patient there must be one observation per event or time interval. For example, if a subject has one event, then there will be two observations for that subject. The first observation will cover the time span from entry into the study until the time of the event, and the second observation spans the time from the event to the end of follow-up. The data for the nine subjects listed above is
. list if id!=10, noobs
In the original data, subjects 1 through 4 had no tumors recur, thus, each of these 4 patients has only one censored (status=0) observation spanning from time0=0 to end of follow-up (time=futime}). Patient 5 ( id=5) had one tumor recur at 6 months and was followed until month 10. This patient has two observations in the final dataset; one from time0=0 to tumor recurrence (time=6), ending in an event (status=1), and another from time0=6 to end of follow-up (time=10), ending as censored (status =0).
We stset the data with the command
. stset time, fail(status) exit(time .) id(id) enter(time0)
and we fit the Andersen–Gill Cox model as
. stcox group size number, nohr efron vce(robust) nolog Cox regression — Efron method for ties No. of subjects = 85 Number of obs = 178 No. of failures = 112 Time at risk = 2480 Wald chi2(3) = 11.41 Log likelihood = -449.98064 Prob > chi2 = 0.0097 (standard errors adjusted for clustering on id)
My data closely follow the data structure displayed above. However, I have a dichotomous time-varying variable of interest that can change at subsequent follow-ups. For instance, it might be 0 at the first event occurrence of the DV, 1 at the second event occurrence of the DV, and then 0 when observed at the end of the observation period. Each subject has the same dataset entry and end time (i.e., follows the same observation period). However, according to the prescribed Andersen-Gill data, structure subjects have more dataset rows if they experience more events.
This Andersen-Gill setup is different than the survival analysis data structure that I’m usually familiar with that creates a data frame where every row represents an event (or right censoring if there is no event). In normal survival analysis, subjects with 0 events have one row only, subjects with 1 event would have 2 rows, etc. Individual covariates are in columns along with necessary start and stop time information, patient ID, etc.
The general data structure for time-varying covariates that I’m familiar with splits the rows according to when the time-varying covariate changes from 0 to 1 (or 1 to 0) – that is, the rows are defined by the variable and not by the event as in the data structure for recurrent events in the Andersen-Gill data structure shown above.
My question is --- is it valid to use the Andersen-Gill Data Structure shown above, if I am interested in estimating the effect of a time varying variable (coded as 0/1) that varies depending on when the recurrent event is observed?
In other words, my dataset would look like:

But would have another column where the time-varying variable would be, for example, 0 for the first observation of Patient Id 8, 1 for the second observation of Patient Id 8, 0 for the first observation of Patient ID 9, 1 for the second observation of Patient ID 9, and 0 for the third observation of Patient ID 10?
Hope this is clear. TIA.
My data closely follow the Andersen-Gill Model setup described here: https://www.stata.com/support/faqs/s...ure-time-data/
where:
To implement the Andersen and Gill model using the results from the bladder cancer study, the data are set up as follows: for each patient there must be one observation per event or time interval. For example, if a subject has one event, then there will be two observations for that subject. The first observation will cover the time span from entry into the study until the time of the event, and the second observation spans the time from the event to the end of follow-up. The data for the nine subjects listed above is
. list if id!=10, noobs
We stset the data with the command
. stset time, fail(status) exit(time .) id(id) enter(time0)
id: id failure event: status != 0 & status != . obs. time interval: (time[_n-1], time] enter on or after: time time0 exit on or before: time time | ||
178 total obs. 0 exclusions | ||
178 obs. remaining, representing 85 subjects 112 failures in multiple failure-per-subject data 2480 total analysis time at risk, at risk from t = 0 earliest observed entry t = 0 last observed exit t = 59 |
. stcox group size number, nohr efron vce(robust) nolog Cox regression — Efron method for ties No. of subjects = 85 Number of obs = 178 No. of failures = 112 Time at risk = 2480 Wald chi2(3) = 11.41 Log likelihood = -449.98064 Prob > chi2 = 0.0097 (standard errors adjusted for clustering on id)
_t | Robust | |
_d | Coefficient Std. err. z P>|z| [95% conf. interval] | |
group | -.464687 .2671369 -1.740 0.082 -.9882656 .0588917 | |
size | -.0436603 .0780767 -0.559 0.576 -.1966879 .1093673 | |
number | .1749604 .0634147 2.759 0.006 .0506699 .2992509 |
My data closely follow the data structure displayed above. However, I have a dichotomous time-varying variable of interest that can change at subsequent follow-ups. For instance, it might be 0 at the first event occurrence of the DV, 1 at the second event occurrence of the DV, and then 0 when observed at the end of the observation period. Each subject has the same dataset entry and end time (i.e., follows the same observation period). However, according to the prescribed Andersen-Gill data, structure subjects have more dataset rows if they experience more events.
This Andersen-Gill setup is different than the survival analysis data structure that I’m usually familiar with that creates a data frame where every row represents an event (or right censoring if there is no event). In normal survival analysis, subjects with 0 events have one row only, subjects with 1 event would have 2 rows, etc. Individual covariates are in columns along with necessary start and stop time information, patient ID, etc.
The general data structure for time-varying covariates that I’m familiar with splits the rows according to when the time-varying covariate changes from 0 to 1 (or 1 to 0) – that is, the rows are defined by the variable and not by the event as in the data structure for recurrent events in the Andersen-Gill data structure shown above.
My question is --- is it valid to use the Andersen-Gill Data Structure shown above, if I am interested in estimating the effect of a time varying variable (coded as 0/1) that varies depending on when the recurrent event is observed?
In other words, my dataset would look like:
But would have another column where the time-varying variable would be, for example, 0 for the first observation of Patient Id 8, 1 for the second observation of Patient Id 8, 0 for the first observation of Patient ID 9, 1 for the second observation of Patient ID 9, and 0 for the third observation of Patient ID 10?
Hope this is clear. TIA.