Hello everyone,
I am conducting a survival analysis on an unemployment micro data. I receive a monthly output from a local unemployment database. So far I have only data from two months, november and december 2015, but I will receive more as the time goes. My task is to conduct duration analysis of the individuals' unemployment.
My question is how should I correctly and most efficiently tranform this kind of data into a duration data where I can use stset.
The data is as follows [picture 1]:
time: yearmonth when data received
id: (date of birth)
id2: combined id with BEGIN_REG EDUC FEMALE (in case people are born on the same day)
FEMALE: gender
age
BEGIN_REG: start of registration at the unemployment office
END_REG: end of registration (ie. leaving unemployment)
evid_length: length of evidence at unempl. office, ie. how long is the person unemployed (in days)
LAST_EMPL: place of last employment
REASON: for leaving unemployment
EDUC: education
censor: censored data (if END_REG = .; then the person has not yet left unemployment).
... other indicators are not related to my question
My data looks the way that each month I have the same person occurring in the database again and again, while the evid_length keeps increasing (as the length of his unemployment is increasing).
My initial thought was to use snapspan:
snapspan id2 time END_REG evid_length censor, gen(time0) replace
But then I get the following error:
14 subjects have 28 duplicate time values
it is unclear which record to use at the specified time
perhaps
1. id2 is wrong and the records are not really for
the same subject, or
2. time is wrong and one record occurs after the other
r(459);
And I am not sure how to proceed. I tried various variations on the snapspan command, but I think this one is the most correct one. For the timevar in snapsan I thought of putting there the evid_length, but I think that's incorrect and it should be in the varlist - as that's the variable that occurs at the time and always changes.
My other thought was to eliminate always the ealier observations and keep only always the latest one:
foreach id2 == `n'{
drop if id2 == `n'& time==201512* evid_length > time==201511*evid_length
}
or
drop if time==201511*id2==time==201512*id2 & time==201512* evid_length > time==201511*evid_length
But I am not sure if this would be the correct procedure for time analysis and those attempts were unsuccessful anyways. The loop didn't work and the drop command dropped only few observations.
This is my first time working with this type of data. I have read up on it, including the stata manuals, etc. I will be happy for any kind of help and/or advice!
I am conducting a survival analysis on an unemployment micro data. I receive a monthly output from a local unemployment database. So far I have only data from two months, november and december 2015, but I will receive more as the time goes. My task is to conduct duration analysis of the individuals' unemployment.
My question is how should I correctly and most efficiently tranform this kind of data into a duration data where I can use stset.
The data is as follows [picture 1]:
time: yearmonth when data received
id: (date of birth)
id2: combined id with BEGIN_REG EDUC FEMALE (in case people are born on the same day)
FEMALE: gender
age
BEGIN_REG: start of registration at the unemployment office
END_REG: end of registration (ie. leaving unemployment)
evid_length: length of evidence at unempl. office, ie. how long is the person unemployed (in days)
LAST_EMPL: place of last employment
REASON: for leaving unemployment
EDUC: education
censor: censored data (if END_REG = .; then the person has not yet left unemployment).
... other indicators are not related to my question
My data looks the way that each month I have the same person occurring in the database again and again, while the evid_length keeps increasing (as the length of his unemployment is increasing).
My initial thought was to use snapspan:
snapspan id2 time END_REG evid_length censor, gen(time0) replace
But then I get the following error:
14 subjects have 28 duplicate time values
it is unclear which record to use at the specified time
perhaps
1. id2 is wrong and the records are not really for
the same subject, or
2. time is wrong and one record occurs after the other
r(459);
And I am not sure how to proceed. I tried various variations on the snapspan command, but I think this one is the most correct one. For the timevar in snapsan I thought of putting there the evid_length, but I think that's incorrect and it should be in the varlist - as that's the variable that occurs at the time and always changes.
My other thought was to eliminate always the ealier observations and keep only always the latest one:
foreach id2 == `n'{
drop if id2 == `n'& time==201512* evid_length > time==201511*evid_length
}
or
drop if time==201511*id2==time==201512*id2 & time==201512* evid_length > time==201511*evid_length
But I am not sure if this would be the correct procedure for time analysis and those attempts were unsuccessful anyways. The loop didn't work and the drop command dropped only few observations.
This is my first time working with this type of data. I have read up on it, including the stata manuals, etc. I will be happy for any kind of help and/or advice!
Comment