Proportional hazard rate Cox model in the discrete setting

Maria Koval

Join Date: Sep 2015

Posts: 11
#1

Proportional hazard rate Cox model in the discrete setting

16 Sep 2015, 07:18

I am using Stata command stcox to run the Cox regression with time-varying covariates over years.

Despite Cox model is generally a continuous time duration model, I am basically dealing with a discrete case since I have a new line in my data for each consequent year (1,2,3....etc.) in which covariate changes. Since Cox model, as well as other types of survival models (i.e., Weibull, etc.) produce reasonable results in Stata on my data, I assume that Cox model can be also applied in the setting, as mine.

Does anyone know whether is it correct to use stcox command in Stata for the discrete setting? Maybe there are some references I did not manage to find.

Thanks a lot!
Maria
Tags: None
Juta Kawalerowicz

Join Date: Apr 2014

Posts: 15
#2

16 Sep 2015, 15:36

Hi, I was considering the same problem - I think the answer basically depends on how many years you have in your data, the more you have the more it is warranted that you can treat it as a continuous case. For (recent) examples of applications of Cox models in similar settings as yours (as I understand) check out these papers
Myers, D. 1997. Racial Rioting in the 1960s: An Event History Analysis of Local Conditions.

Braun, R and Koopmans R. 2009. The Diffusion of Ethnic Violence in Germany: The Role of Social Similarity.
Comment
Juta Kawalerowicz

Join Date: Apr 2014

Posts: 15
#3

16 Sep 2015, 15:44

From Rabe-Hesketh and Skrondal Multilevel and Longitudinal Modeling Using Stata (a great reference book): in continuous time survival data the exact survival and censoring times and recorded in relative fine time units [...] discrete time survival data is characterised by relatives few possible survival (or censoring) times with many subjects sharing the same survival time.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

16 Sep 2015, 16:42

It's not clear what your data look like. Your covariates are (re)measured once a year. How does the event information appear? On different lines? Do you have more exact dates for entry, outcome, and non-event exit?

A discrete hazards model in Stata is fit by cloglog.

Cox approximations for heavily tied data have been studied by Hertz-Picciotto and Rockhill (1997) and by Chalita et al, 2002. (I haven't looked at more recent literature.)

The latter (p. 1228) suggest a rule of thumb for choosing between a continuous variable likelihood approach or a discrete method. I translate to Stata usage below.

Define \(f\) = total number of failure events in the data; \(r\) = number of distinct failures (times at which failures occur); and \(n\) be the total number of individuals. The "proportion of of ties" \(pt\) is defined as:

\[
pt = \frac{f - r}{n}
\]

If \(pt\) is
< 0.20, you should use stcox with the Efron approximation for ties
0.20 - 0.25, you can use either cloglog or stcox with the Efron approximation for ties
> 0.25, you should use cloglog.

Both publications used simplified proportional hazards models for their simulations: . Hertz-Picciotto & Rockhill studied a two-group comparison in a constant hazard model with no censoring; Chalita et al. studied a single baseline covariate with uniform censoring in a Weibull model.

Note that Chalita et al. found that the Breslow and Efron methods for dealing with ties were about equivalent, but Hertz-Picciotto and Rockhill found the Efron method superior.

References:

Hertz-Picciotto, Irva, and Beverly Rockhill. 1997. Validity and Efficiency of Approximation Methods for Tied Survival Times in Cox Regression. Biometrics 53, no. 3: 1151-1156.

Chalita, Liciana VAS, Enrico A Colosimo, and Clarice GB Demétrio. 2002. Likelihood approximations and discrete models for tied survival data. Communications in Statistics-Theory and Methods 31, no. 7: 1215-1229.

Last edited by Steve Samuels; 16 Sep 2015, 16:52.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Maria Koval

Join Date: Sep 2015

Posts: 11
#5

17 Sep 2015, 07:39

Thank you very much for so useful replies, Juta and Steven, formula and so useful references!

My data is organized in the following way:

id; existence (i.e., in years); event_tag; covariate
1; 1; 0; 3
1; 2; 0; 6
1; 3; 1; 9
2; 1; 0; 10
2; 2; 0; 12
2; 3; 0; 34

I have 114 failure events, 12 distinct time periods at which failure can occur and 455 number of observations (154 subjects with time-varying covariates). If I apply the formula that Steven suggests I get (114-12)/455=0,224, which suggests to go for stcox. Does this look correct?

I have also tried to use cloglog but then the results are completely different from ones with stcox and many dummy variables are ommitted.

I am also thinking about the difference in cox model and complementary log-log in the availability of unspecified baseline hazard of the cox model. Since cox model has baseline hazard not specified, i do not need to make additional assumptions about time, contrary to complementary log-log model. Thus, I am thinking whether I get different results for stcox and cloglog and better convergence with stcox because of higher flexibility of stcox?

Thank you very much!
Maria

Last edited by Maria Koval; 17 Sep 2015, 07:50.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#6

17 Sep 2015, 10:27

Since cox model has baseline hazard not specified, i do not need to make additional assumptions about time, contrary to complementary log-log model.

I submit that your sentence is a bit misleading. And I note that you didn't show us the specific cloglog model that you fitted. Let's assume that you fitted a cloglog model with a dummy (binary indicator) variable for each of the "12 distinct time periods" (and omitted the constant term) or fitted a cloglog model with a dummy (binary indicator) variable for 11 of the "12 distinct time periods" (and included the constant term). The baseline hazard in these specifications summarises the Interval (period-specific) hazard, not the underlying continuous baseline hazard -- which of course is not identifiable without further assumptions. As Steve Samuels indicated earlier, this cloglog model fits the discrete time analogue to the continuous time proportional hazards model -- corresponding slope coefficients (coefficients on the explanatory variables) in each of the 2 models refer to the same coefficients in the underlying continuous time model. This is the well-known Prentice-Gloeckler result. So, the "additional assumptions about time" refer to the specification of the interval (period-specific) baseline hazard, not to the underlying continuous time baseline hazard. This may sound like pedantry, but I think the distinctions are important.

On the remarks about the numbers of ties being relevant to the choice of discrete versus continuous time model, thanks for the interesting discussion and references, Steve. My take, perhaps because of my social science background, is that the choice between approach is closely related to the width of the intervals in interval-censored survival time data. [I'm thinking of the common situation where the underlying process operates in continuous time, but we survival times are only recorded in discrete intervals (often of equal width -- "months" or "years" but not necessarily). Clearly this is related to the ties issue -- wider intervals raises the likelihood of ties. But some reflection from a substantive point of view may be useful, even if doesn't lead to a formal rule of the kind you relate. Compare modelling longevity data with expected survival times at birth being around 70 years (say). If the observed survival times are measured in years rather than year-month-day form, then there still may not be a problem using a continuous time model: 1 year / 70 years is a small number. There's grouped/discrete data but using a Cox model probably won't go too far wrong. But what if the observed times were grouped into decades? Or consider the length of time to reemployment for male workers beginning an unemployment spell. The average unemployment spell is around 50 days (say). But most unemployment histories record unemployment spells in months. 30 days / 50 days is a relatively large number. So, I might be comfortable using a Cox model for longevity data measured in years, but definitely not for unemployment spell data measured in months.

The Breslow and Efron methods for dealing with ties are approximations and in principle they don't have to be used (see the Stata help and manual for stcox on the exactp option) ... though I agree that this might be impractical in practice.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

17 Sep 2015, 14:37

My own practice is close to yours, Stephen: close attention has to be paid to how the data are recorded in relation to spell length. I think the value of the rule of thumb is to discourage usage of a continuous data method when the discrete method is clearly called for.

The Chalita reference did study the exact Cox method and found it the best for continuous data. I erred in not stating this. I would rewrite my translation of their rule of thumb.

If \(pt\)pt is:
< 0.20
you should use stcox, exact method if feasible, else Efron or Breslow

0.20 - 0.25
use either cloglog or stcox , exact method if feasible, else Efron or Breslow

> 0.25,
you should use cloglog.

Maria: I have one question about your data. Take your third data line in which the subject failed in year 3. When was the covariate measured?

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Maria Koval

Join Date: Sep 2015

Posts: 11
#8

21 Sep 2015, 02:51

Thank you very much, Stephen, for your explanation of the distinction about baseline hazards in Cox and complementary log-log models.

I was trying to fit several models to my data: stcox, cloglog and other survival models (streg).

However, only stcox gives me the best model fit (also, many significant effects). Both Efron and Breslow was producing similar results.

Streg with Weibull distribution was giving similar results to stcox but still worse fit.

Finally, cloglog gives very poor model fit with many non-significant effects.

The model I was fitting in Stata is the following:

For Cox model: stcox covariates yeartrend i.industry (i.e., industry dummies), nohr vce(robust)

For cloglog: cloglog event_tag (i.e., event dummy) covariates i.year i.industry, vce(robust)

I was using the same data organization both for stcox and for cloglog.

My study is about firms’ risks to go bankruptcy where I observe whether firm i went or not to bankruptcy after N years of existence. The time-varying covariates are always measured annually and taken from firms’ annual reports (i.e., firm’s sales, profits, financial leverage, etc.).

Steve, thank you very much for all your help! In the third data line in which the subject failed in year 3, the covariate was measured for year 3 (i.e., for the whole year 3).

Thank you a lot!
Maria
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#9

21 Sep 2015, 07:20

For Cox model: stcox covariates yeartrend i.industry (i.e., industry dummies), nohr vce(robust)

For cloglog: cloglog event_tag (i.e., event dummy) covariates i.year i.industry, vce(robust)

It's unclear from what you write that you fitted comparable models. I am surprised that Cox and cloglog models gave such very different results.

What is "yeartrend" and is it similar or different to "i.year"? Suppose you have a variable called "stime" which is the number of years the firm is observed at risk of bankruptcy, from when the firm started business until the year last observed (bankruptcy or right censored). This is not the same as calendar time. The non-parametric specification of the baseline interval/discrete hazard should be defined in terms of "stime", e.g. i.stime, not year.

Model comparability also depends on you having organised the data appropriately in each case. You haven't told us about that. For the 'easy estimation' method of fitting the cloglog model, each firm should contribute "stime" number of rows to your data set. The number of rows in the data set used your stcox regression depends on your time-varying covariates, but you could in principle use the same data organisation as for the cloglog regression. Show us your stset command too,

Assuming you have the data organised correctly, try the following:

Code:

stset stcox covariates i.industry, nohr cloglog event_tag i.stime i.industry
Comment
Maria Koval

Join Date: Sep 2015

Posts: 11
#10

21 Sep 2015, 08:36

Thank you very much, Stephen!

stset code is the following:

stset stime, failure(event_tag) id(id)

where stime is the number of years the firm is observed at the risk of bankruptcy, id is firm id, event_tag is an indicator variable which gets 1 if firm goes to bankruptcy and 0 otherwise.
I have stime number of rows for each firm in the sample. All my time-varying covariates get different values at each stime (i.e., I have all necessary covariates for each of the year period of firm’s existence).

Thus, the data is organized as

id; existence (i.e., stime); event_tag; covariate
1; 1; 0; 3
1; 2; 0; 6
1; 3; 1; 9
2; 1; 0; 10
2; 2; 0; 12
2; 3; 0; 34

I have also tried to run two separate regression as you suggested:

stset stime, failure(event_tag) id(id)
stcox covariates i.industry, nohr

cloglog event_tag covariates i.stime i.industry

Some of the results are similar. However, some of them are different. I have a high number of two-way and three-way interactions in my model, I have also a lot of industry dummies.

For Cox model I get the results for each of the coefficient of interest, and effects for all industry dummies. However, when I run cloglog, some of the interaction effects are dropped because “they predict failure perfectly”, same is with some of the industry dummies. Thus, when I run cloglog, I can not test my theory fully.

p.s. when I run cloglog, I get all coefficients for i.stime dummies positive and significant. Two of them are dropped because they “predict failure perfectly”. The constant term in cloglog model is always positive and significant.

Last edited by Maria Koval; 21 Sep 2015, 09:08.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#11

24 Sep 2015, 15:22

Maria, FAQ #12 for good reason asks that you show not only your commands, but all the Stata results, put between CODE delimiters for easy reading-otherwise the results don't line up. Without these we cannot tell what is going on. CODE delimiters:
Top is [C O D E]
Bottom is [/C O D E]
but remove the spaces.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#12

27 Nov 2015, 12:07

I now think the "proportion of ties" guidelines are not useful in general: the problem is the \(n\), the number of observations, in the denominator: as \(n\), increases, the index will become very small, no matter how small the number of grouping intervals. So, a study with three grouping intervals could have a small index is \(n\) is large enough, leading to the recommendation that one should use stcox.

Maria: I think that you are wrong is wrong in thinking that a model with many significant variables is the same as one that fits the data well. You have a " high number of two-way and three-way interactions" and, so even with 114 failure events, I suspect serious overfitting, which in turn triggered the"perfect fit" message from cloglog I, like Stephen, wish that you had followed the direction of FAQ 14 and shown us commands and results, so that we could judge for ourselves.

Last edited by Steve Samuels; 27 Nov 2015, 12:23.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Maria Koval

Join Date: Sep 2015

Posts: 11
#13

01 Mar 2016, 07:54

Dear Steve and Stephen, I would like to apologise for my salience. You have helped me a lot with all your suggestions on how to run properly a model using stcox or cloglog. I have collected now a larger sample and have reduced a number of interactions in the model. I have found consistent results with both stcox and cloglog. Thank you very much for your help!
Comment

Announcement

Proportional hazard rate Cox model in the discrete setting

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment