Identifying coworkers in spell data

Sebastien Willis

Join Date: Mar 2017

Posts: 26
#1

Identifying coworkers in spell data

17 Jul 2019, 06:27

I have a dataset of worker employment spells, and I would like a general way to identify an individual's co-workers when the individual starts a job. Consider the following example dataset:

Code:

input worker_id firm_id start_date end_date 1 1 1 4 1 2 5 10 2 1 2 8 2 2 9 10 3 1 6 7 4 3 2 7 end

When worker 1 starts work in firm 1, she has no co-workers, same when she starts in firm 2. When worker 2 starts in firm 1, worker 1 is her co-worker until time period 4, and worker 3 is her co-worker from time period 6 on. When worker 3 starts, worker 2 is her co-worker for the time worker 3 is in . When worker 4 starts, she has no co-workers.

The following code illustrates something like what I'm after

Code:

levelsof firm_id, local(firms) tempfile worker_ds save "`worker_ds'" clear gen worker_id = . tempfile coworker_ds save "`coworker_ds'" foreach j of local firms { di "" di "results for firm `j'" di "" use "`worker_ds'", clear keep if firm_id == `j' count local obs = r(N) total_coworkers = . * identify set of coworkers for each worker in firm j forvalues i = 1/`obs' { local coworkers`i' "" local n_coworkers = 0 forvalues ii = 1/`obs' { if start_date[`ii'] < end_date[`i'] & end_date[`ii'] > start_date[`i'] & `i' != `ii' { local nextworker = worker_id[`ii'] local coworkers`i' `coworkers`i'' `nextworker' local n_coworkers = `n_coworkers' + 1 } } replace total_coworkers = `n_coworkers' if _n == `i' } * create a worker-coworker level dataset expand total_coworkers_all, generate(expand_obs) gen coworker_id = . sort worker_id local n = 1 * fill in values for coworker_id variable forvalues i = 1/`obs' { di "`i''s coworkers are `coworkers`i''" foreach id of local coworkers`i' { replace coworker_id = `id' if _n == `n' local n = `n' + 1 } } append using "`coworker_ds'" save "`coworker_ds'", replace }

I say "general way" because while this code would do what I want on test data, it would not in general work. I have 40 million spells, so looping, replacing, or sorting repeatedly will take far too long. The code could also fail in various ways: (i) I could easily hit the size limit for the macro -coworkers`i'-; (ii) the code assumes that each worker works a single spell in a firm; (iii) I'm sure people can come up with other reasonble cases in which this code would fail.

I realise this a broad, somewhat vague question, but it would already be great if anyone has suggestions for commands or features of stata's syntax I could exploit to radically speed this up and make is more robust, or could point me towards references that might have suggestions for how to go about solving this. Even improving individuals steps would be helpful.

There are many things I would like to subsequently do with these data. Two important ones are calculate the number of coworkers with certain charactersitics a worker has during a given spell, and see whether a worker has former coworkers in a firm in a subsequent spell with a different firm. I am using Stata 14.1
Tags: foreach, network, spell data
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

17 Jul 2019, 11:45

Your post describes two different goals. The first is to identify all coworkers who are present when the given worker starts. That can be done with the following code:

Code:

clear input worker_id firm_id start_date end_date 1 1 1 4 1 2 5 10 2 1 2 8 2 2 9 10 3 1 6 7 4 3 2 7 end tempfile original save `original' rename worker_id coworker_id rename *_date * rangejoin start_date start end using `original', by(firm_id) // NOW BRING BACK PEOPLE WITH NO COWORKERS AT START merge m:1 firm_id worker_id start_date using `original', assert(match using) nogenerate drop start end replace coworker_id = . if coworker_id == worker_id by worker_id firm_id start_date (coworker_id), sort: gen _j = _n reshape wide coworker_id, i(worker_id firm_id start_date) j(_j) // OPTIONAL order worker_id firm_id start_date end_date, first

The above code requires the -rangejoin- command, written by Robert Picard, and available from SSC. -rangejoin-, in turn, requires the -rangestat- command, written by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Note: It is probably a better idea for you to omit the -reshape wide- command and leave the data in long layout, as it will, I think, simplify your later work. I put the -reshape wide- in there just so you could more easily visualize the results.

On the other hand, you also seem to want to identify all coworkers who overlap with the employee at any time during their tenure. That is a different matter.

Code:

clear input worker_id firm_id start_date end_date 1 1 1 4 1 2 5 10 2 1 2 8 2 2 9 10 3 1 6 7 4 3 2 7 end assert start_date <= end_date capture program drop one_firm program define one_firm tempfile original save `original' rename worker_id coworker_id rename *_date coworker_= joinby firm_id using `original' // DROP SELF MATCHES drop if worker_id == coworker_id // RETAIN ONLY OVERLAPS drop if coworker_start_date > end_date | coworker_end_date < start_date // IDENTIFY OVERLAP PERIOD gen overlap_start = max(start_date, coworker_start_date) gen overlap_end = min(end_date, coworker_end_date) drop coworker_start_date coworker_end_date sort firm_id worker_id start_date coworker_id order firm_id worker_id coworker_id start_date, first exit end runby one_firm, by(firm_id) status

The above code leaves all overlapping pairs of coworkers in the data, along with the start and end dates of the worker, and the start and end dates of the overlap period. Each pair appears twice: with each member of the pair appearing once as the worker and once as the coworker. I did not reshape the data into wide layout at the end this time, and I recommend you leave the data in long layout for your next steps, as I am confident it will prove easier this way.

This code requires the -runby- command, written by Robert Picard and me, also available from SSC.

Breaking the data into chunks by firm_id and processing each firm_id separately is not logically necessary. However, given the size of your data set, trying to do it all at once will probably lead to memory problems (and will also be much slower), so I have pre-emptively done it that way.
2 likes
Comment
Sebastien Willis

Join Date: Mar 2017

Posts: 26
#3

18 Jul 2019, 00:16

Hi Clyde, thanks very much for you detailed answer. Rangejoin and the program you provided look exactly like what I was looking for.

You are right, I am indeed interested in identifying both the coworkers present when a worker starts with a firm, and the coworkers present at any point, and clearly I hadn't thought enough about how the two operations are different.
Comment

Announcement

Identifying coworkers in spell data

Comment

Comment