A special case of survival analysis?

Monica Muller

Join Date: Jul 2014

Posts: 226
#1

A special case of survival analysis?

20 Feb 2016, 20:25

Hi All,

I have a dataset of employees of an organization between 2013 and 2015 with several variables. Using this dataset, I want to model probability of retention over time for future job applicants. In other words, I want to predict the probability of any given future applicant’s retention at each point in time of his or her tenure.

The big problem in my data is that performance is one of the predictors of retention in this particular case, but I only have performance data for 2013-2015. And, some employees started working in this company many years ago, some started after 2013, and some people left their job at different points between 2013 and 2015.
So, I don't have equal number of observations on performance for everyone. I also don't have a common start point for all of the employees.

The attached sample data is made-up and simply shows all different possibilities that occur in my dataset.

Is it possible to model retention with my data? Do you have any suggestions for me? I really appreciate your help.

Code:

input int female int age int start int left_2014 int left_2015 int perf2013 int perf2014 int perf2015 int tenure female age start left_2014 left_2015 perf2013 perf2014 perf2015 tenure 1 54 2000 1 . 3 . . 16 1 50 2002 0 1 3 4 . 14 1 43 2008 0 0 4 4 5 8 0 36 2011 0 0 3 4 3 5 1 29 2013 1 . 2 . . 3 1 32 2013 0 1 3 3 . 3 0 30 2013 0 0 2 3 3 3 1 27 2014 . 1 . 4 4 2 0 26 2014 . 0 4 3 5 2 0 28 2015 . . 3 4 4 1 end

start shows employees start date in the company
left-2014 shows whether employee left at 2014
perf variables show performance for each year
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

21 Feb 2016, 06:42

Hello Monica,

The theme (retention over time) is far away from my field of expertise. That said, I fear some questions of yours relate fundamentally to the theoretical background concerning survival analysis.

For these "general" aspects, I underline the Stata Manual is surely the best starting point. That said, I believe I can give you some help.

To name a few, and also giving the respective recommendations:

You must create the "id" variable.
Data should be in long format for this sort of analysis, Therefore, you will have "performance" and "year" instead of "perf2013", "perf2014", etc.
If your "event" is "leaving the job", instead of "left_2014" and "left_2015", I guess you'll need to create one variable - say, "left".
To cope with gaps, you may think about delving into "counting process" (CP) data layout
Consider the inclusion of a time varying covariate in the model.
Finally, I suggest you catch up with the options "start", "stop" (as well as "origin", "enter" and "exit") and have fun playing with them, until you get the desired effect.

Hopefully that helps!

Best,

Marcos

Best regards,

Marcos
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#3

21 Feb 2016, 07:08

You ask 2 things: (1) how to model retention with existing data; (2) how to predict retention with other data. I focus on (1). You appear to have grouped ("discrete") survival time data. Marcos provides helpful information about how to think about your data. More information on discrete time survival analysis at http://Survival Analysis Using Stata...vival-analysis. (There are not "continuous" survival times, so the st suite in Stata is not relevant -- and this is where I differ from Marcos.) The "event" is "leaving the job", so you are modelling the lengths of time (spell lengths) since each individual became at risk of leaving their job. To define spell lengths (and thence get the data set up correctly), you need to know when people started their job, and hence first became first at risk of leaving. If you know this for everyone, you have left-truncated ("delayed entry") data, and you can model it straightforwardly -- see the materials at the URL above. If you don't know when they started their jobs, the data on spell lengths are left-censored and modelling is really difficult. (Because the chances of leaving a job in a given year depend on how long you've been in the job already, then if you don't know how long elapsed duration is, you can't model duration dependence! There simply isn't the information available. You can make progress only by making assumptions -- which may not be plausible -- e.g. that the changes of leaving the job do not vary with elapsed duration.)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

21 Feb 2016, 07:45

I thank Stephen - arguably, a referential author on the subject - for the insightful remarks in #3.

I wrote my reply to #1 on the spur of the moment, basically on the main tenets related to the preparation of data set for survival analysis, and I confess I didn't notice "time" was a discrete variable in the aforementioned model.

Just to take profit from the theme and discussion, I wish to share a didatic text from the Stata Manual, quite related to the topic (http://www.stata.com/manuals13/stdiscrete.pdf), which I found fully clarifying ever since I read it first.

By the way, the author - guess who - of the initial draft was Stephen Jenkins.

Best,

Marcos

Last edited by Marcos Almeida; 21 Feb 2016, 07:48.

Best regards,

Marcos
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#5

21 Feb 2016, 10:36

Dear Marcos and Stephen,

I really appreciate the valuable information you shared with me. I learned a lot and the sources are super useful. I will read them. Thanks.

Fortunately, I know when each person started to work, but the other problem I have is that their retention also depends on their performance, but for a person who started working at 2000 and never left I have 3 performance observations at 2013,2014, and 2015. For someone who started at 2014 and never left I only have two performance observations on 2014 and 2015 and no data on 2013. For someone who started at 2008 and left at 2014, I only have one performance observation on 2013, and so on. My concern is that I have missing values for performance for 3 completely different reasons,
(1) for some people I don't have their performance over their tenure, because my data starts at 2013 and I don't have information about before 2013.
(2) for some people I don't have performance data because they left
(3) for some people I don't have performance data on a particular year because they hadn't started their job on that year.
Also, any combination of the 3 reasons is possible!

How can I deal with the performance variable as a predictor of retention?

I really appreciate your time.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1433
#6

21 Feb 2016, 12:22

Sorry, but I have no recommendations for dealing with this missing data problem. You might have to model "missingness" jointly with survival time
Comment
Monica Muller

Join Date: Jul 2014

Posts: 226
#7

21 Feb 2016, 22:53

Thanks. I hope it's not completely impossible. Thanks for your help.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17702
#8

23 Feb 2016, 06:50

Monica:
as far as the missingness issue is concerned, I would recommend you to take a look at http://www.missingdata.org.uk/ which is maintained by Jeremy Bartlett, whose posts appears on this list from time to time.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

A special case of survival analysis?

Comment

Comment

Comment

Comment

Comment

Comment

Comment