Hi, I am at my wit’s end and hoping that somebody here can help with a problem I’m having getting a hazard model to run properly in STATA. The problem involves a time-varying independent dummy 'treatment' variable. Here are some relevant details:
The study is on the relationship between family visits in prison and prison violence. Essentially, we’re trying to look at whether risk of being violent in prison is lower (or higher) after family visits. We are not that interested in the timing between visits and violence, just changes in risk.
The data are prisoner-level data taken over a 2-year observation period (2009-20010). We have left and right censoring (some prisoners were already in prison when the observation period began, and some prisoners were still in prison when the observation period ended). The main problem we seem to be having is that many prisoners have multiple visits, some of them actually a lot, sometimes multiple visits in a week.The mean number of visits is 22, with a maximum of 121, minimum 0.
For every prisoner, we have dates for entry into prison, exit from prison, visits and violent events, plus a bunch of static controls. The data are organized by prisoner-event. So for each prisoner, we have a row for entry data and control variables, start of observation period, visit dates (if any), date of violent event, and exit from observation (either left prison or observation period ended).
So the data are structured as prisoner-event data, which looks like this (with value labels shown instead of numeric values):
IDvar timevar event_type age_at_entry crime_type visit_dummy
Prisoner1 1/1/09 entry 26 violent 0
Prisoner1 2/1/09 visit 26 violent 1
Prisoner1 3/1/09 visit 26 violent 1
Prisoner1 3/15/09 violence 26 violent 0
Prisoner1 6/1/09 release 26 violent 0
Prisoner2 2/16/10 entry 35 property 0
Prisoner2 3/8/10 visit 35 property 1
Prisoner2 9/1/10 release 35 property 0
.
.
There are 28,463 unique prisoners in the data and 366,477 observations (rows). There were 3440 violent events. Some prisoners have repeated risk for violence, but for now we are only including their first event. If we can get it to work then we might include subsequent events.
Here is the stset command and output:
stset timevar, failure(event_type==4) id(IDvar) origin(event_type ==1) enter(event_type ==5) exit(event_type ==6)
id: IDvar
failure event: event_type == 4
obs. time interval: (timevar[_n-1], timevar]
enter on or after: event_type ==5
exit on or before: event_type ==6
t for analysis: (time-origin)
origin: event_type ==1
----------------------------------------------------------------------
366477 total obs.
37241 multiple records at same instant PROBABLE ERROR
(timevar[_n-1]==timevar)
20475 obs. end on or before enter()
34 obs. begin on or after exit
----------------------------------------------------------------------
308727 obs. remaining, representing
28460 subjects
3440 failures in multiple failure-per-subject data
7862465 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 12710
Here are two cox regressions. The first one has two time-constant variables, age and offense type (as shown in data above). The second one adds the visit_dummy variable shown in the example data above (=1 when there is a visit, 0 otherwise).
. stcox age crime_type
failure _d: event_type== 4
analysis time _t: (timevar-origin)
origin: event_type ==1
enter on or after: event_type ==5
exit on or before: event_type ==6
id: IDNT_MAASAR
Iteration 0: log likelihood = -30421.854
Iteration 1: log likelihood = -30084.938
Iteration 2: log likelihood = -30077.562
Iteration 3: log likelihood = -30077.553
Refining estimates:
Iteration 0: log likelihood = -30077.553
Cox regression -- Breslow method for ties
No. of subjects = 28429 Number of obs = 308477
No. of failures = 3440
Time at risk = 7857475
LR chi2(2) = 688.60
Log likelihood = -30077.553 Prob > chi2 = 0.0000
----------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------+--------------------------------------------------
age | .9610433 .0017798 -21.46 0.000 .9575613 .964538
crime| 1.462947 .051784 10.75 0.000 1.364893 1.568046
So far, so good. Higher age associated with lower risk of violence, violent crime conviction associated with higher risk of violence.
Now here is the same model with the time-varying, frequently repeated visit dummy:
. stcox age crime_type visit_dummy
failure _d: event_type == 4
analysis time _t: (timevar-origin)
origin: event_type ==1
enter on or after: event_type ==5
exit on or before: event_type ==6
id: IDNT_MAASAR
Iteration 0: log likelihood = -30421.854
Iteration 1: log likelihood = -25230.114
Iteration 2: log likelihood = -25092.321
Iteration 3: log likelihood = -25049.136
Iteration 4: log likelihood = -25033.56
Iteration 5: log likelihood = -25027.871
Iteration 6: log likelihood = -25025.783
Iteration 7: log likelihood = -25025.015
Iteration 8: log likelihood = -25024.733
Iteration 9: log likelihood = -25024.629
Iteration 10: log likelihood = -25024.591
Iteration 11: log likelihood = -25024.577
Iteration 12: log likelihood = -25024.572
Iteration 13: log likelihood = -25024.57
Iteration 14: log likelihood = -25024.569
.
.
.
Iteration 50: log likelihood = -25024.569
Refining estimates:
Iteration 0: log likelihood = -25024.569
Cox regression -- Breslow method for ties
No. of subjects = 28429 Number of obs = 308477
No. of failures = 3440
Time at risk = 7857475
LR chi2(2) = 10794.57
Log likelihood = -25024.569 Prob > chi2 = 0.0000
-----------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------+---------------------------------------------------------
age | .9489049 .0018013 -27.63 0.000 .9453809 .9524419
crime | 1.910263 .0677006 18.26 0.000 1.782076 2.047671
visit_dummy | 7.90e-24 . . . . .
-----------------------------------------------------------------------------
The hazard ratio coefficient for visit_dummy is tiny, but more problematically it doesn’t compute standard errors. Is it possible that there are simply too many visits for it to deal with? Or too many ties (as indicated by the “PROBABLY ERROR” warning I got in the stset command?).
I have looked everywhere for resources on how to deal with this. It seems that time-varying dummy variables are not frequently used in survival models. I have seen suggestions that data with time-varying variables should be separated into separate observations for each status change. But having a visit doesn't really represent a lasting change in status, so this doesn't seem relevant in this case. Also given the number of visits, doing so would make the dataset very large. I would essentially have to make the data a prisoner-day dataset (or I might get by with prisoner-week). But I thought that STATA’s flexibility with dealing with stdata would obviate this necessity. Or maybe the modeling strategy is wrong and I don't even need a survival model?.
I would have thought that this would be a fairly common modeling issue in medical studies, since it is analogous to patients being treated, sometimes repeatedly, with some medication at particular intervals (except in my case it is visits rather than medications). But I couldn’t find anything.
Any suggestions you wizards of stata have will be extremely appreciated!!
EDIT: Sorry this got posted twice. The first time, it told me there was an error so I re-posted. Can't see how to delete one of them.
The study is on the relationship between family visits in prison and prison violence. Essentially, we’re trying to look at whether risk of being violent in prison is lower (or higher) after family visits. We are not that interested in the timing between visits and violence, just changes in risk.
The data are prisoner-level data taken over a 2-year observation period (2009-20010). We have left and right censoring (some prisoners were already in prison when the observation period began, and some prisoners were still in prison when the observation period ended). The main problem we seem to be having is that many prisoners have multiple visits, some of them actually a lot, sometimes multiple visits in a week.The mean number of visits is 22, with a maximum of 121, minimum 0.
For every prisoner, we have dates for entry into prison, exit from prison, visits and violent events, plus a bunch of static controls. The data are organized by prisoner-event. So for each prisoner, we have a row for entry data and control variables, start of observation period, visit dates (if any), date of violent event, and exit from observation (either left prison or observation period ended).
So the data are structured as prisoner-event data, which looks like this (with value labels shown instead of numeric values):
IDvar timevar event_type age_at_entry crime_type visit_dummy
Prisoner1 1/1/09 entry 26 violent 0
Prisoner1 2/1/09 visit 26 violent 1
Prisoner1 3/1/09 visit 26 violent 1
Prisoner1 3/15/09 violence 26 violent 0
Prisoner1 6/1/09 release 26 violent 0
Prisoner2 2/16/10 entry 35 property 0
Prisoner2 3/8/10 visit 35 property 1
Prisoner2 9/1/10 release 35 property 0
.
.
There are 28,463 unique prisoners in the data and 366,477 observations (rows). There were 3440 violent events. Some prisoners have repeated risk for violence, but for now we are only including their first event. If we can get it to work then we might include subsequent events.
Here is the stset command and output:
stset timevar, failure(event_type==4) id(IDvar) origin(event_type ==1) enter(event_type ==5) exit(event_type ==6)
id: IDvar
failure event: event_type == 4
obs. time interval: (timevar[_n-1], timevar]
enter on or after: event_type ==5
exit on or before: event_type ==6
t for analysis: (time-origin)
origin: event_type ==1
----------------------------------------------------------------------
366477 total obs.
37241 multiple records at same instant PROBABLE ERROR
(timevar[_n-1]==timevar)
20475 obs. end on or before enter()
34 obs. begin on or after exit
----------------------------------------------------------------------
308727 obs. remaining, representing
28460 subjects
3440 failures in multiple failure-per-subject data
7862465 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 12710
Here are two cox regressions. The first one has two time-constant variables, age and offense type (as shown in data above). The second one adds the visit_dummy variable shown in the example data above (=1 when there is a visit, 0 otherwise).
. stcox age crime_type
failure _d: event_type== 4
analysis time _t: (timevar-origin)
origin: event_type ==1
enter on or after: event_type ==5
exit on or before: event_type ==6
id: IDNT_MAASAR
Iteration 0: log likelihood = -30421.854
Iteration 1: log likelihood = -30084.938
Iteration 2: log likelihood = -30077.562
Iteration 3: log likelihood = -30077.553
Refining estimates:
Iteration 0: log likelihood = -30077.553
Cox regression -- Breslow method for ties
No. of subjects = 28429 Number of obs = 308477
No. of failures = 3440
Time at risk = 7857475
LR chi2(2) = 688.60
Log likelihood = -30077.553 Prob > chi2 = 0.0000
----------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------+--------------------------------------------------
age | .9610433 .0017798 -21.46 0.000 .9575613 .964538
crime| 1.462947 .051784 10.75 0.000 1.364893 1.568046
So far, so good. Higher age associated with lower risk of violence, violent crime conviction associated with higher risk of violence.
Now here is the same model with the time-varying, frequently repeated visit dummy:
. stcox age crime_type visit_dummy
failure _d: event_type == 4
analysis time _t: (timevar-origin)
origin: event_type ==1
enter on or after: event_type ==5
exit on or before: event_type ==6
id: IDNT_MAASAR
Iteration 0: log likelihood = -30421.854
Iteration 1: log likelihood = -25230.114
Iteration 2: log likelihood = -25092.321
Iteration 3: log likelihood = -25049.136
Iteration 4: log likelihood = -25033.56
Iteration 5: log likelihood = -25027.871
Iteration 6: log likelihood = -25025.783
Iteration 7: log likelihood = -25025.015
Iteration 8: log likelihood = -25024.733
Iteration 9: log likelihood = -25024.629
Iteration 10: log likelihood = -25024.591
Iteration 11: log likelihood = -25024.577
Iteration 12: log likelihood = -25024.572
Iteration 13: log likelihood = -25024.57
Iteration 14: log likelihood = -25024.569
.
.
.
Iteration 50: log likelihood = -25024.569
Refining estimates:
Iteration 0: log likelihood = -25024.569
Cox regression -- Breslow method for ties
No. of subjects = 28429 Number of obs = 308477
No. of failures = 3440
Time at risk = 7857475
LR chi2(2) = 10794.57
Log likelihood = -25024.569 Prob > chi2 = 0.0000
-----------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------+---------------------------------------------------------
age | .9489049 .0018013 -27.63 0.000 .9453809 .9524419
crime | 1.910263 .0677006 18.26 0.000 1.782076 2.047671
visit_dummy | 7.90e-24 . . . . .
-----------------------------------------------------------------------------
The hazard ratio coefficient for visit_dummy is tiny, but more problematically it doesn’t compute standard errors. Is it possible that there are simply too many visits for it to deal with? Or too many ties (as indicated by the “PROBABLY ERROR” warning I got in the stset command?).
I have looked everywhere for resources on how to deal with this. It seems that time-varying dummy variables are not frequently used in survival models. I have seen suggestions that data with time-varying variables should be separated into separate observations for each status change. But having a visit doesn't really represent a lasting change in status, so this doesn't seem relevant in this case. Also given the number of visits, doing so would make the dataset very large. I would essentially have to make the data a prisoner-day dataset (or I might get by with prisoner-week). But I thought that STATA’s flexibility with dealing with stdata would obviate this necessity. Or maybe the modeling strategy is wrong and I don't even need a survival model?.
I would have thought that this would be a fairly common modeling issue in medical studies, since it is analogous to patients being treated, sometimes repeatedly, with some medication at particular intervals (except in my case it is visits rather than medications). But I couldn’t find anything.
Any suggestions you wizards of stata have will be extremely appreciated!!
EDIT: Sorry this got posted twice. The first time, it told me there was an error so I re-posted. Can't see how to delete one of them.
Comment