How to deal with left-truncated data and right censoring

Siamak Baradaran

Join Date: Jan 2015

Posts: 20
#1

How to deal with left-truncated data and right censoring

05 Jan 2015, 08:26

Hi all,
I'm working with a parametric hazard model (using streg) where I'm trying to model acquirement of drivers license based on 8 years data. The data is based on a random sample of all individuals. All individuals are assumed to get at risk at the age of 18. My data is a random sample for years between 2003 to 2011 that gives maximum of 8 rows of data per individual and I have some 27,000 individuals in the data set.

I'm stset-ing the data using:
stset year, failure(drv_lic==1) origin(year18) enter(entry_year) id(id),

drv_lic =0 if the individual dont' have drivers license and 1 otherwise.
year18 is a year that the individual turned 18 years old and became at risk
entry_year is a variable indicating the first year the individual is observed in the data

I've been reviewing the literature on left truncated data and left and right censored data. In my case I don't have left-censored data since I know when all individuals became at risk (the year they turned 18) so individuals who turned before 2003 (which is the year where my data set starts at) are left truncated but are included in my data. I don't have interval censored problem either, however there are individuals who leave the data set before they acquire drivers license (pass away or migrate to other countries or simply has not acquired the license by the end of the period, 2011), so I have right censored data.

My question is:
1- does suing streg (after my stset) adjust the loglikeliehood for the left truncation bias?
2- if not, how should I proceed to do so
3- how is the right censored data treated with streg? will I have bias using streg as a result of right censoring?
Should I use other model descriptions than streg? (I need to make a parametric model)

Regards
Sia
Tags: None
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#2

05 Jan 2015, 09:58

Welcome to the Forum. It would be appreciated if you would follow Forum etiquette and re-register to use your real name (firstname lastname). It's easy: hit the Contact Us link at bottom right of screen and make the request. Also, please re-read the FAQ to learn more (hit the black strip at the top of the page).

Please read the Manual entry for -stset- (especially around page 486 in [ST]) and you'll see that the enter() option will help you. Observe that "left truncation" is called "delayed entry" by some people. After that, the issue should be whether the specification of the -stset- command is correct given your data and variable names. At first glance, and given your explanations, your specification is OK. So, it appears that the short answer to your Q1 is "yes". Qn (2) is then not relevant. Q3: the failure() option is used to identify right censored spells. But you write ...

drv_lic =0 if the individual dont' have drivers license and 1 otherwise

This is potentially confusing. Your estimation data set should consist of people who do not have a driving license when first observed and who are followed until either they get a license (but not thereafter), or are lost from the study without being observed to get a license. The event variable underlying the failure specification summarises whether or not someone gets a license over the period s/he is observed.

You claim not to have interval-censored data. However, it seems from your data description that you only have annual observations for each person (license status within a given calendar year?). If this is the case, you do have interval-censored data. That is not a problem, because these models are easy to fit. Why do you need a parametric model (by which I presume you mean: a parametric specification for the baseline hazard)?

My data is a random sample for years between 2003 to 2011

Be careful: perhaps you mean that you have a random sample each year from the stock of individuals (with or without?) licences. [If this is the case, it's not a random sample from the population, precisely because of the left truncation issue. Your sample of spells will contain over-representation of relatively long spells relative to the population distribution.] If your sample is a random sample from the population (all adults over the age of 18) in each and every year, then you don't have a stock sample but may still have left-truncated data -- for all individuals aged > 18 in the first year of your data window. But if it's this sort of sample, note my remarks about the estimation data set above. (It won't include all individuals and all years from the originally-sampled data base.)

All in all, you can see that I find that there are aspects that are unclear in your data description -- but they really matter.
1 like
Comment
Siamak Baradaran

Join Date: Jan 2015

Posts: 20
#3

05 Jan 2015, 10:43

Thanks Stephen for your fast reply.

Sorry about the user name issue, I thought that it should be a short user name which in my case is the short style of my real name, Siamak Baradaran. I have done as you told and think that thte problem will be solved soon but thanks for mentioning it for me.

I also want to thank you for you answers. about interval censoring issue. my data is as you describe (I think it's how you described it). The data is from annual tax report plus some additional data from Statistics Sweden (drives license holding, marital situation, vehicle ownership, etc) which reports the variables for each individual for each specific year. You wrote ...

.."It seems from your data description that you only have annual observations for each person (license status within a given calendar year?). If this is the case, you do have interval-censored data. That is not a problem, because these models are easy to fit"

You are right about the annual data. In my model, I include socio-economic co-variates as income, gender, age, number of vehicles, etc as well as few time varying co-variates regarding living situation (if the person live by him/herself or with one or both parents) and it the person moves from smaller city to larger cities (as proxy for changed accessibility that might influence demand for a vehicle and thereby a license). As I mentioned, I use the streg command for this aim.

1- Am I supposed to do differently, regarding that I got annual data?
2- If so, how?

I need a parametric description since I want to use the model for prognosis of feature driver-s license holding. I could also use a flexible parametric description for the same aim but from my understanding, the semi-parametric cox description could be problematic for that purpose as it leave the shape function (baseline hazard) undefined. However, I have estimated a cox (using stcox) model as well as flexible parametric model (using stpm2) but the parametric Weibull description has so far had the best ll, AIC and BIC statistics.

The sample is made by, first taking all unique individuals that are represented in the data set (all of the years, identifiable by their unique ID's, the individuals of course keep their ID's through the years between 2003 through 2011) in other words I don't take out out different samples for different years, if that is what you mean about stacking problem). From these I draw a random sampled. This of course means that all individuals who entered and exit the population before 2003 are not represented. However, my understanding was that they would not matter in my case. Do you think differently?

Warmest regards
Siamak Baradaran
Comment
Siamak Baradaran

Join Date: Jan 2015

Posts: 20
#4

05 Jan 2015, 10:59

Hi again
I forgot to mention the issue you brought up about about drv_lic=0 otherwise, In my data set. I have excluded all individuals who already have drivers license. Also I have excluded observation on individuals, after they got their drivers license. will that suffice?

However, as you brought up the question, I came of thinking another possible problem. Around 25 to 30% of all individuals in my data acquire their license the same year as they become 18. Since I only have 1 row of data for each individual and year, I can't distinguish the individual co-variates from before and after they got their licenses. In order to include them in the model, I copied their attributes for the year they got license and changed the year to year before they got their license and changed the drv_lic variable to zero. This way I get two rows of data for them and they are then included in the model. My assumption was that since the independent variables have not changed beside drv_lic, they should disturb the model since their co-variets values were constant between the years. Would that work?
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#5

06 Jan 2015, 03:25

Thank you for re-registering. Re questions 1 and 2 in post #3, you might like to look at the materials on Survival Analysis Using Stata at http://www.iser.essex.ac.uk/survival-analysis. There is extensive discussion of modelling of interval-censored (also referred to as discrete time, or grouped) survival time data, including what your data set-up should look like, and how to fit the models easily using standard commands in Stata. (There is a draft textbook; and hands-on Lessons.) You will see that it's also straightforward to fit models with a parametric baseline, so you can do post-estimation prediction and extrapolation (examples given).
The matters that you raise in the last paragraph of post #4 are not very clearly explained to me (and what you say you're doing sounds odd). The issue that you describe seems, in part, related to the fact that you have grouped data. But another fundamental issue is what your underlying model is, in particular whether you think the hazard rate of getting a license in some year t is related to the year-t values of predictors or, say, the year t-1 values. Of course, if a predictor variable is constant (e.g. sex), this is not an issue. But it may be for other characteristics that vary over time. And if lagged predictors are the relevant ones, there is indeed a problem about what to do for the first year in which individuals are at risk! All in all, there is an issue of "science" (related to your substantive problem per se), as much as purely statistical modelling issues.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

06 Jan 2015, 08:35

Hello, Siamak,

There is an interesting book on the subject: Mario Cleves, William Gold, Roberto Gutierrez and Yulia Marchenko. An Introduction to Survival Analysis Using Stata. 3rd edition, Stata Press, 2010.

There is a chapter (Chapter 4) exactly on Censoring and Truncation.

Concerning left-truncation, the authors state it is "easy to deal with", not only in parametric models but also in semiparametric models.

With regards to right-censoring (you said that's your main concern), the author state in terms of potential difficulties."All that is known about a [right] censored observation is that failure occurs after time ti, yet exactly when remains unknown by definition"

The options "origin", "enter" e "exit" for stset apply for these aspects.

Best regards,

Marcos
Comment
Siamak Baradaran

Join Date: Jan 2015

Posts: 20
#7

08 Jan 2015, 21:41

Thanks Stephen and Marcus for your replies, I will certainly follow your suggesions
Comment
Siamak Baradaran

Join Date: Jan 2015

Posts: 20
#8

08 Jan 2015, 21:43

Thanks Stephen and Marcus for your replies, I will certainly follow your suggestions
Comment
Siamak Baradaran

Join Date: Jan 2015

Posts: 20
#9

21 Jan 2015, 05:55

Dear Stephen

While working with the drivers license model, a problem came up that I think was rather interesting. Since my data set included all individuals, I also had individuals who were very old who still had not acquired a license and were therefore in the spell. This would of course give very big left-truncation periods and problem and I just wanted to see how the the parameters and the survival functions change if I gradually dropped individuals with large left-truncation. As standard model I ran a parametric Weibull setup {"streg age sex student, d(weibull)"} and what I could see was that:

1- the parameter values change of course allot between the models as well as ll AIC and BIC.
2- the survival curve started to become more and more s-shaped the more I reduced the truncation.
3- At the same time the survival function approached to zero and became zero once I had no left-truncated individuals left.
4- Also the number of iterations increased the more left-truncated individuals were removed from the data and STAT had more difficulties to estimate the model with lots of concave areas and backups.
5- I couldn't see the same problem with models without age variable. In that model there were those who survived but nopt as many as it was expected.

There was of course a problem since there were many individuals who still had not acquired a license and should have survived. After some further work I found that if I changed the the distribution to exponential, the above problems were all solved.

I'm really new in survival analysis but couldn't stop thinking that it had something to do with the way the data was set. I used:

"stset year, failure(drv_lic==1) id(id) origin(the year the individual became 18) enter(the first year she/he was observed) exit(drv_lic==0 1m the value was "." for all other years but the year the person either got license or was last time observed)"

The duration it takes to acquire a license of curse is year-origin (origin being year the individual became at risk), which also exactly is (year-enter)- the truncated years. So when I dropped the truncated years, the duration was exactly the number of years I had observations which is also equal to age-variable.

So I wonder if there is a reason for Weibull model or any other but the exponential model (later I tested with all other distributions as well) the model to act like that?

Best
Siamak
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#10

21 Jan 2015, 16:22

I haven't been following this thread, so this is about one statement in your last post.

I found that if I changed the the distribution to exponential, the above problems were all solved.

This is not a solution, as the exponential is an unbelievable distribution (it has just one parameter to describe mean and SD); the results from the Weibull model, a generalization of the exponential, are likely to reject the exponential. So your problems remain.

Last edited by Steve Samuels; 21 Jan 2015, 16:26.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#11

22 Jan 2015, 03:25

If you want a parametric model and are having problems with the Weibull model, maybe you should think about considering a flexible parametric model with restricted cubic splines

There is a review article from the Stata Journal on the matter: http://www.statalist.org/forums/foru...he-return-list

Also, I suggest you try "stpm2".

Best,

Marcos

Best regards,

Marcos
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#12

22 Jan 2015, 04:31

Re post #9 addressed to me by Siamak: to be honest, I find your findings annoying rather than "rather interesting"! To me, you are insufficiently clear whether the new data points that you refer to as being added in correspond to left-censored or left-truncated spells. It is "well known" that exponential models can be reliably fitted to continuous survival time data with left-censored spells assuming that the exponential model is correct. But this is virtually always a heroic assumption, i.e. the assumption that the baseline hazard function does not vary with elapsed duration is implausible! In fact, assuming a constant hazard is implausible for most applications, regardless of whether the data are censored or truncated. (This is what Steve Samuels is alluding to.) Second, I see that you are persisting with continuous time survival models (you cite stset and refer to Exponential and Weibull models), even though your data are interval-censored/grouped/discrete and you have been directed to sources that show you how to apply appropriate methods to this kind of data. [I therefore disagree with Marcos Almeida's suggestion to look at a different type of model for continuous time survival data.] Siamak: you admit to being "really new in survival analysis" so, please, read -- and digest -- those sources. I think this is a more worthwhile activity for making substantive progress on your research project than "explorations" of the kind you seem to be doing.
Comment
Siamak Baradaran

Join Date: Jan 2015

Posts: 20
#13

22 Jan 2015, 08:47

Dear Stephen,

I have considered that my data could be of discrete time but after Reading an article by Trond Petersen, "Analyzing change over time in continuous dependent variable: specification and estimation of continuous state space hazard rate models", I think that I should use continuous time models.

Specifically in the second page in his paper he says: "
The first distinction to be drawn is between processes in continuous state space and processes in discrete state space. For the latter, which are failure time processes in discrete state space, methods for analysis of duration or event-history data apply, on which the literature is voluminous (see, e.g., Tuma and Hannan 1984, pt. 2). For continuous state space processes, a further distinction must be drawn between diffusion and failure time processes. In diffusion processes, the dependent variable is in constant motion; that is, it changes all the time and in small time intervals only in small amounts (see, e.g., Lamperti 1977, pp. 125-26). The sample paths of diffusion processes are therefore continuous functions of time. In continuous state space failure time processes, in contrast, the dependent variable re-mains unchanged for finite periods of time, but at the time of a change, it jumps to a new value, and the jump can be of any size and in any direction. The sample paths are hence discontinuous functions of time, but they have a finite number of points of discontinuity, each occurring when the dependent variable Changes."

My data is in fact annually updated but the covariates might have been jumped to their new value at any time between the observations, which is why I think I have continues time data. Please let me know if I should consider otherwise.

Best
Siamak
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#14

22 Jan 2015, 09:13

I think you're misinterpreting Petersen. Even if the underlying process is continuous, as it no doubt is in your case, the fact is that the data come to you in interval-censored form -- and you need to take account of this. (This is discussed in my materials.) Don't confuse the underlying data generation process with the data measurement process. What I suggesting that you do in your modelling of annual data is entirely consistent with the hazard rate of license take-up being in continuous time (days rather than years, say).

For instance, the cloglog version of the discrete time proportional hazards model fits exactly the same slope coefficients as the corresponding underlying continuous time proportional hazards model. The relationship between the baseline hazard from the discrete time model and the underlying continuous time one is more complicated -- as it should be, because you don't observe what's happening within the intervals -- a year in your case. There are also decisions regarding the treatment of predictors who values may potentially vary with survival time (i.e. within intervals). This is explained in my materials. (Progress can be made if you're prepared to assume that time-varying predictors are constant within intervals.)

So, like a scratched vinyl record, I'm ending up repeating myself myself myself regarding my recommendations to you.

That is, I would fit your model using the methods I've already cited. [Note too that the underlying continuous time baseline hazard can only be estimated when you have interval-censored data if you make -- potentially strong and unrealistic -- assumptions about the nature of the continuous hazard within the discrete intervals (years in your case). In addition to the estimation approaches that I've reminded you repeatedly of, you could also look at intcens (on SSC), but be aware that it does not allow you to use predictors that vary with survival time.]
Comment
Siamak Baradaran

Join Date: Jan 2015

Posts: 20
#15

22 Jan 2015, 09:25

Thanks Stephen, I believe I understand you point now, even though it meant that you had to repeat yourself for which I apologize and I also want to thank you for your patient with a beginner.

Last edited by Siamak Baradaran; 22 Jan 2015, 09:30.
Comment

Announcement

How to deal with left-truncated data and right censoring

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment