discrete-time hazard model with xtcloglog

Thorhildur Olafsdottir

Join Date: Jan 2015

Posts: 5
#1

discrete-time hazard model with xtcloglog

14 Jan 2015, 03:25

Dear statalist
I´m trying to fit a discrete-time hazard model using a cloglog-link:
xi: xtcloglog y Dyear6 Dyear8 i.age*t marital urban children if gender==1
y: a binary outcome (0 or 1)
Dyear6 (7): Dummy for year 6 and 7
i.age (50 dummies for age)
t: chronological time (12 years)
demographics are dummies (0 or 1)
The dataset is large (1.5 millions observations of which only 3600 are positive outcomes).
So I´m modeling a rare outcome with individual unbalanced panel data.
Is there something in this model that makes ML estimation impossible?
I´ve tried running the model with agegroups but always get the „not concave“ message in the iteration process.
Much appreciated if someone could be of assistance on this question.
Thorhildur
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

14 Jan 2015, 05:26

thorhildo (please read FAQ #6 and how to re-register according to Stata forum preference for real full names and surnames. Thanks):
at a very first glance, I would say that including 50 age dummies plus interaction with time among the predictors is probably asking too much out of your data.
You may want to re-run your model with a smaller set of predictors first, then add the others one in turn and see when the convergence problem comes alive.
I would also recommend you to take a look at Stephen Jenkins' excellent contents on the topic you're interested in at: http://www.iser.essex.ac.uk/survival-analysis

Kind regards,
Carlo
(Stata 19.0)
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#3

14 Jan 2015, 05:46

"thorhildo": welcome to the Forum! Please re-read the Forum FAQ (hit the black bar at the top of the page), and note the request to register using your full name (firstname familyname) and why. It's easy to do: hit the Contact Us button at bottom right of screen and place your request. Please also note the FAQ recommendations about how to maximize the chances of getting helpful answers, and how to use CODE delimiters.

You are not 100% clear about the nature of "time" in your specification. You need to clarify to us what is (a) the variables used to characterise the baseline hazard function, and (ii) the variables used to describe calendar time. (What exactly is "chronological time"?)

Assuming you have your data in person-year form (long format), the variables in (a) will summarise for each period how many years the person will have been at risk of the event since first entered. (I assume you have no left truncated spells.) The variables in (b) refer to calendar time, I presume. Be aware that each extra year of time at risk also means that the calendar year number goes up by one too -- so there is a perfect correlation, and it's hard to identify calendar time and duration effects separately. You are advised to use as calendar time predictor a variable describing the calendar corresponding to the first year at risk of the event. And, hopefully, there is variation across your sample in this (which you can check). My guess is that these sorts of issues are behind the convergence issues that you are having. I conjecture that they are also present were you to fit the model using cloglog as well, but not so apparent. (By trying also to have unobserved heterogeneity, you also asking much much more to be identified and, correspondingly, Stata will find it harder to converge.)

For discussion of discrete time proportional hazards (and other) models, with many Stata examples, see the "Survival Analysis Using Stata" webpages at http://www.iser.essex.ac.uk/survival-analysis

Note also that xi: is not the best way to proceed in Stata 13 (which is what we assume you have unless told otherwise -- see the FAQ). You should use factor variables: help fvvarlist
Comment
Thorhildur Olafsdottir

Join Date: Jan 2015

Posts: 5
#4

14 Jan 2015, 08:11

From: Thorhildur Olafsdottir (I´ve sent a request to the forum of re-registering with full name)

Dear Stephen,

Many thanks! I´m using stata 11.2

I will now clarify what variables are used to characterise the baseline hazard function.

First: My data is in a person(year) format (unbalanced panel) and I know the duration of each person´s spell (spell: years from age 25 without a specific health shock). The observation period is a 12 year period. I have a combination of a stock and flow sample and yes, there are left-truncated spells since everyone (aged 25-75) is eligible into the sample at the beginning of the observation time that are in the initial state of no health shock. Those turning age 25 after the beginning of the observation period enter the data in the year they turn 25. In other words, I have a prospective follow-up of multiple cohorts that end in a single chronological year. I want to include calendar time as a continous variable to account for the time trend of the the health shock. I am thus using age and calendar time to characterise the baseline hazard function.

Here´s my idea to model the baseline hazard function. By including age as dummies in the model each age cohort has a seperate baseline hazard (intercept) and by interacting age with time (continuous variable taking on values from 1-12) I model the different slopes of the hazard for each cohort (as the health shock is more likely to occur with progressing age).
I´m particularly interested in wheather the conditional probability of the health shock changed in year 6 7 and 8 and therefore I include dummies for those years to see if they are different from the time trend. This may introduce problems into the model with age modeled as dummies. I´m not sure. I might be forced to use age as a continuous variable in the model eventhough the dummy specification of age is more appealing I think.

1. Does the characterisation of the baseline hazard as I describe it make sense?

1. By including unobserved heterogeneity, do you think it is impossible to estimate this model? I´m unfamiliar with analyzing panel data without addressing serial correlation within each individual (FE or RE models).

2. I can estimate the model with xtreg. By estimating the model with xtreg (linear probability model) (and preferably xtreg, fe as this is a first of two models, the latter including mediators that suggests a selection process) – would I still be able to address the bias in my estimate of interest caused by censoring, and left-truncation? I haven´t come across many applications using discrete time linear probabiliy hazard models. (By this question I´m referring to the log likelihood function that includes information from a)uncensored individuals b)censored individuals and is maximized by ML estimator with the logit or cloglog link functions).

I tried changing the code to the following, but it didn´t help. Cloglog gave results though, but coefficients for age were not as expected.
/*
xtcloglog y i.age##c.t y87 y88 y89 married urban children if gender==1
*/

Many thanks,
Thorhildur
Comment
Thorhildur Olafsdottir

Join Date: Jan 2015

Posts: 5
#5

14 Jan 2015, 08:19

Dear Carlo, Many thanks for your response. I have tried what you suggested but that doesn´t seem to solve my problem at this stage at least.
I´m familiar with the links that you and Stephen suggest as a reading/practice material. This is excellent material and I recommend it as a starting point.
Best wishes,
Thorhildur
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#6

14 Jan 2015, 10:10

I am thus using age and calendar time to characterise the baseline hazard function.

Q1. Still unclear to me. To be 100% clear: Everyone is aged 25 in the year that they are first observed (not the same as first at risk)? so aged 26 in second year observed, etc. Whatever, if so, age and calendar time are collinear. Given left truncation, the survival time variable relevant to the specification of the baseline hazard function is not equal to age when first observed, but number of years at risk of event. E.g. if start being at risk at age 21, then start being observed at age 25, in which case the duration count in the year that is aged 25 is 5 (or maybe 4 depending on the convention that you are using regarding censoring). See my webpages on how to set up the data in the left-truncated case. The variables you include in your cloglog regression to summarise duration dependence are then functions of this survival time integer count variable (which we might call s in order to distinguish it from calendar time measured in t).

So, is "year 7 and year 8" referring to an expected effect in years 7 and 8 of the spell (something to do with s) or in calendar years labelled for some reason 7 or 8 (something to do with t).

I still maintain that using a set of binary indicators for "calendar year when first at risk" is the way ahead in order to look at the effects of "calendar time", t. I don't see how a "time trend" is properly identified separately from duration dependence (related to s) -- apart from via some peculiarities of functional form.

If you have a linear baseline dependence function (enter s itself in the regression rather than some function of it) as well as the dummies for s = 7, and for s = 8, then you can have the baseline hazard increasing or decreasing at a constant rate per year, or constant -- except when shifted in those 2 critical years of the spell.

1. By including unobserved heterogeneity, do you think it is impossible to estimate this model? I´m unfamiliar with analyzing panel data without addressing serial correlation within each individual (FE or RE models).

You cannot combine left truncation with unobserved heterogeneity in the way you hope. The people who survive in the risk set until year first observed are a "selected" sample (you get an over-representation of "long stayers"), so estimates will be wrong. Note that streg no longer allows you to combine left truncation with unobserved heterogeneity/frailty (unless you force it to do so). It's for the same reason, though in the continuous survival time context. So, I don't think you can use xtcloglog. Stick with cloglog.

2. I can estimate the model with xtreg. By estimating the model with xtreg (linear probability model) (and preferably xtreg, fe

Fixed effects models (as econometricians think of them) don't work in the survival time context. (I think there may be a note somewhere in my web MS about this. The exception is when you have repeated spells on the same person -- Google for Paul Allison's papers about this case. In short, the observation unit for the FE in the FE model is the individual, not the person-year. With a single spell for each person, you can't do the differencing out.) This leads instead to Random Effects models (as econometricians think of them) which is precisely what xtreg, re is (and so is xtcloglog). But, as I said above, you can't estimate models with frailty if you have left-truncated data (without (a) writing your own ml maximize, and (b) having data about the period from when first at risk until first observed). So, if you want to fit a linear probability model, you are left with regression, not xtreg. But, as you anticipate, few would entertain use of this model in this context. Stick with cloglog.

PS thanks for re-registering.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#7

14 Jan 2015, 11:21

Thorhildur:
thanks for re-registering.
Sorry I wasn't that helpful.
I do hope that Stephen's comprehensive advices will enlight your research path.

Kind regards,
Carlo
(Stata 19.0)
Comment
Thorhildur Olafsdottir

Join Date: Jan 2015

Posts: 5
#8

14 Jan 2015, 18:51

Many thanks. The way to address the left-truncation in the data is still unclear to me.
I did look for an example on how to set up the data in the left-truncated case without luck.

To answer your question on year7 and year8:
Year7 and year8 refer to expected effect in calendar year(t). I therefore suggested entering t in the regression (linear baseline dependence function). There would be a shift in those 2 critical years of the observed period (12 years). If I would create binary indicators for calendar years, I´m not sure how I could model the effect of those 2 critical years on the conditional probability of an event occurring.
Does a spell length have to be identical with number of observation pr individual in the data? If that is the case, how does one incorporate information on spell length previous to beginning of observation period (t1)?

For purpose of clarity:
An example: If individual is 64 at the beginning of observation period (t1) he has the spell length of 39 at that time (25 when first at risk – which is the lower age restriction of the individuals in the dataset). I know if event has occured prior to time t1 and thus only include those in initial state at time t1. So the idea was to model duration dependency by creating binary indicators for each age (representing cohorts). If this individual enters the data in year t1 and has an event in say, t7 his age in the data is 64, 65, ....70. He is in the dataset for seven periods; t1, t2, t3.....t7.
Another individual might enter the data in year t4 as he just turned 25 in that year. His spell length will be 8 years and he will be 33 at t12. I have data on this individual in t4, t5, t6...t12 given he is censored. Calendar time variable will take on the values 4(1)12.
To estimate models with frailty if I have left-truncated data, it is not enough to have data about the period from when first at risk until first observed (that is spell length, not predictors), but would also have to write my own ml maximize?

I wonder what the benefits of fitting a hazard model in the context of duration analysis are compared to a logistic regression with the same setup of the data (person-year). Maybe the bias because of left-truncation would be ignored but censoring would be accounted for in the likelihood function?

Many thanks,
Thorhildur
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#9

15 Jan 2015, 02:49

The way to address the left-truncation in the data is still unclear to me.
I did look for an example on how to set up the data in the left-truncated case without luck.

This suggests that you didn't look at all the materials on the Survival Analysis website that I directed you to. Look at section 6.2 of the "Survival Analysis", unpublished MS, downloadable from the site. [As the FAQ suggests, remarks like "didn't work" or, as here, looked "without luck" are not helpful. You are advised to provide specific information about where and where you didn't look, especially when you've had particular recommendations.] Related: this is probably my last post on this topic. For further advice from me, see http://www.cemmap.ac.uk/event/id/1054

1.Does a spell length have to be identical with number of observation pr individual in the data? If that is the case, how does one incorporate information on spell length previous to beginning of observation period (t1)?

Please read the materials cited regarding data organisation.

1.To estimate models with frailty if I have left-truncated data, it is not enough to have data about the period from when first at risk until first observed (that is spell length, not predictors), but would also have to write my own ml maximize?

As I said before, you need both data about the pre-stock sampling date and to write your own maximisation routine. The point is that your likelihood function in the "left-truncation + frailty" case has to take account of the dynamic self-selection process that leads some subjects who entered the risk set before the stock sampling date not to survive until that date or, correspondingly, that those that did survive long enough to be stock-sampled are a non-random sample from the population of entrants.

2.I wonder what the benefits of fitting a hazard model in the context of duration analysis are compared to a logistic regression with the same setup of the data (person-year). Maybe the bias because of left-truncation would be ignored but censoring would be accounted for in the likelihood function?

This appears to confirm to me that you have not read the related literature before posting again (I'm not referring simply to my own materials*, but the many others that have borrowed from when writing them). If you had read it, you would know that the expansion of the data set to person-year form is a type of trick that enables one to maximize the correct likelihood for a discrete time survival model. The choice of logit or cloglog (or indeed maybe probit model) is simply a choice about which particular survival model you want. logit models used to be used a lot I think because logit commands were widely available in software. But cloglog programs are routinely available now. Using cloglog has the particular advantage (to me) of having a Proportional Hazards interpretation. [Again, see the literature!]

* see also my "Easy ways to estimate discrete time duration models", Oxford Bulletin of Economics and Statistics, 1995. This covers the random sample case, as well as left-truncation (though, unfortunately, the latter was incorrectly referred to as 'left censoring' in the article -- it was before I knew better ... and the editors/referees let it through!)
Comment
Thorhildur Olafsdottir

Join Date: Jan 2015

Posts: 5
#10

15 Jan 2015, 03:39

Many thanks for your very helpful advice. It is much appreciated.
I agree about the impression the last question gave.
Best regards,
Thorhildur
Comment
Jeff Samuel

Join Date: May 2022

Posts: 1
#11

30 May 2022, 05:58

Following this discussion, I wish to request for assistance on how to implement a discrete-time proportional hazard or clog clog model to model dis-adoption of IPM.I collected endline survey data on adoption(dummy), year of adoption, disadoption (dummy);year of adoption, year of disadoption) and on other covariates but I do not know how to organize this dataset suitable for duration analysis
Thanks
Comment

Announcement

discrete-time hazard model with xtcloglog

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment