Problems with duration / survival analysis data: transformation to time-span data?

Tomas Machalicek

Join Date: Jan 2016

Posts: 2
#1

Problems with duration / survival analysis data: transformation to time-span data?

24 Jan 2016, 03:52

Hello everyone,

I am conducting a survival analysis on an unemployment micro data. I receive a monthly output from a local unemployment database. So far I have only data from two months, november and december 2015, but I will receive more as the time goes. My task is to conduct duration analysis of the individuals' unemployment.

My question is how should I correctly and most efficiently tranform this kind of data into a duration data where I can use stset.

The data is as follows [picture 1]:

time: yearmonth when data received
id: (date of birth)
id2: combined id with BEGIN_REG EDUC FEMALE (in case people are born on the same day)
FEMALE: gender
age
BEGIN_REG: start of registration at the unemployment office
END_REG: end of registration (ie. leaving unemployment)
evid_length: length of evidence at unempl. office, ie. how long is the person unemployed (in days)
LAST_EMPL: place of last employment
REASON: for leaving unemployment
EDUC: education
censor: censored data (if END_REG = .; then the person has not yet left unemployment).
... other indicators are not related to my question

My data looks the way that each month I have the same person occurring in the database again and again, while the evid_length keeps increasing (as the length of his unemployment is increasing).

My initial thought was to use snapspan:

snapspan id2 time END_REG evid_length censor, gen(time0) replace

But then I get the following error:

14 subjects have 28 duplicate time values
it is unclear which record to use at the specified time
perhaps
1. id2 is wrong and the records are not really for
the same subject, or
2. time is wrong and one record occurs after the other
r(459);

And I am not sure how to proceed. I tried various variations on the snapspan command, but I think this one is the most correct one. For the timevar in snapsan I thought of putting there the evid_length, but I think that's incorrect and it should be in the varlist - as that's the variable that occurs at the time and always changes.

My other thought was to eliminate always the ealier observations and keep only always the latest one:

foreach id2 == `n'{
drop if id2 == `n'& time==201512* evid_length > time==201511*evid_length
}

or

drop if time==201511*id2==time==201512*id2 & time==201512* evid_length > time==201511*evid_length

But I am not sure if this would be the correct procedure for time analysis and those attempts were unsuccessful anyways. The loop didn't work and the drop command dropped only few observations.

This is my first time working with this type of data. I have read up on it, including the stata manuals, etc. I will be happy for any kind of help and/or advice!

1 Photo
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

24 Jan 2016, 11:47

Hello Tomas,

Welcome to the Stata Forum!

It seems you need to do at least 2 tasks before the survival analysis:

Generate and format the time variable. Please type:

Code:

, help datetime

Perform the - stset - and for this you'll need to include the time variable (you created and formatted) as the time-to-event variable, and select the "failure" variable (I didn't find any in the snapshop from your data set.. I kindly ask you to use the CODE delimiters and present the information as suggested in the FAQ #12). Please type:

Code:

. help stset

Hopefully that helps.

Best,

Marcos

Best regards,

Marcos
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1438
#3

24 Jan 2016, 16:18

Tomas: you appear to have monthly survival time data, but I presume that individuals can move into and out of unemployment on any given day. Put differently, it seems to me that you have grouped (interval-censored) duration data, not continuous duration data on unemployment spells. If this is the case, I think you should be looking to use discrete time survival analysis methods (methods for interval-censored/grouped data). All Stata's st commands are designed for use with continuous time data; ditto snapspan. See the "Survival Analysis Using Stata" webpages for some freely downloadable materials, including discussion of the distinction and examples:http://www.iser.essex.ac.uk/survival-analysis
Comment
Parijat Joy

Join Date: Nov 2015

Posts: 2
#4

13 Nov 2016, 18:18

Thanks Marcos and Stephen for addressing this problem.

I am getting the same error message for a dataset, in which the time variable 'days' is the number of days counted from a random date before the data collection date. The individual observations represent admission events to hospitals. Each patient has a unique identifier 'visit' and each admission has a unique identifier 'key'. Some patients will have multiple admissions; implying for a single value of 'visit' there may be more than one value of 'key' depending on the number of admissions this patient had. Moreover, the date from which 'days' is counted is randomly selected for each patient with the unique 'visit'.

My assumption is that this is snapshot dataset as there is a single time variable 'days' that indicates the n-th day when admission took place. Var 'key' changes with each new admission. So to convert this data from snapshot to timespan type, I applied the command line:

snapspan visit days key

This results in an error message:

1291 subjects have 2617 duplicate days values
it is unclear which record to use at the specified time
perhaps
1. visit is wrong and the records are not really for
the same subject, or
2. days is wrong and one record occurs after the other

My thought is that, subjects (patients) that have more than one admission (that is carrying more than one 'key' for each episode of admission) should have more than one entry for variable 'days' counted from a single randomly assigned date before the study period, representing the time of occurrence for each admission (corresponding to multiple entries for variable 'key' for a single patient) This is how the dataset is structured and should not generate an error message.

Anyway, tried the steps to drop "duplicate" values as mentioned in this Stata FAQ,
http://www.stata.com/support/faqs/da...d-time-values/

and got these:

duplicates report visit days

Duplicates in terms of visit days

--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 913733 0
2 | 2614 1307
3 | 3 2
--------------------------------------

bysort visit days: assert _N == 1
1308 contradictions in 915041 by-groups
assertion is false

duplicates tag visit days, gen(isdup)

Duplicates in terms of visit days

. edit if isdup

. drop isdup

However, even after dropping the duplicates (that perhaps is not the right thing to do),
still getting the same error message, on trying snapspan again

snapspan link days key

1291 subjects have 2617 duplicate days values
it is unclear which record to use at the specified time
perhaps
1. visit is wrong and the records are not really for
the same subject, or
2. days is wrong and one record occurs after the other

Please advise on how to convert this dataset from snapshot to timespan.
The subsequent task is to use stset, to perform time to event analyses.

Thank you very much for your help :-)

Last edited by Parijat Joy; 13 Nov 2016, 18:23.
Comment

Announcement

Problems with duration / survival analysis data: transformation to time-span data?

Comment

Comment

Comment