Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying coworkers in spell data

    I have a dataset of worker employment spells, and I would like a general way to identify an individual's co-workers when the individual starts a job. Consider the following example dataset:
    Code:
    input worker_id firm_id start_date end_date
    1 1 1 4
    1 2 5 10
    2 1 2 8
    2 2 9 10
    3 1 6 7
    4 3 2 7
    end
    When worker 1 starts work in firm 1, she has no co-workers, same when she starts in firm 2. When worker 2 starts in firm 1, worker 1 is her co-worker until time period 4, and worker 3 is her co-worker from time period 6 on. When worker 3 starts, worker 2 is her co-worker for the time worker 3 is in . When worker 4 starts, she has no co-workers.

    The following code illustrates something like what I'm after
    Code:
    levelsof firm_id, local(firms)
    
    tempfile worker_ds
    save "`worker_ds'"
    clear
    gen worker_id = .
    tempfile coworker_ds
    save "`coworker_ds'"
    
    foreach j of local firms {
    
        di ""
        di "results for firm `j'"
        di ""
        
        use "`worker_ds'", clear
        keep if firm_id == `j'
        count
        local obs = r(N)
        total_coworkers = .
        
        * identify set of coworkers for each worker in firm j
        forvalues i = 1/`obs' {
            local coworkers`i' ""
            local n_coworkers = 0
            forvalues ii = 1/`obs' {
                if start_date[`ii'] < end_date[`i'] & end_date[`ii'] > start_date[`i'] & `i' != `ii' {
                    local nextworker = worker_id[`ii']
                    local coworkers`i' `coworkers`i'' `nextworker'
                    local n_coworkers = `n_coworkers' + 1
                }
            }
            replace total_coworkers = `n_coworkers' if _n == `i'
        }
        
        * create a worker-coworker level dataset
        expand total_coworkers_all, generate(expand_obs)
        gen coworker_id = .
        sort worker_id
        local n = 1
        
        * fill in values for coworker_id variable
        forvalues i = 1/`obs' {
            di "`i''s coworkers are `coworkers`i''"
            foreach id of local coworkers`i' {
                replace coworker_id = `id' if _n == `n'
                local n = `n' + 1
            }
        }
        
        append using "`coworker_ds'"
        save "`coworker_ds'", replace
    
    }
    I say "general way" because while this code would do what I want on test data, it would not in general work. I have 40 million spells, so looping, replacing, or sorting repeatedly will take far too long. The code could also fail in various ways: (i) I could easily hit the size limit for the macro -coworkers`i'-; (ii) the code assumes that each worker works a single spell in a firm; (iii) I'm sure people can come up with other reasonble cases in which this code would fail.

    I realise this a broad, somewhat vague question, but it would already be great if anyone has suggestions for commands or features of stata's syntax I could exploit to radically speed this up and make is more robust, or could point me towards references that might have suggestions for how to go about solving this. Even improving individuals steps would be helpful.

    There are many things I would like to subsequently do with these data. Two important ones are calculate the number of coworkers with certain charactersitics a worker has during a given spell, and see whether a worker has former coworkers in a firm in a subsequent spell with a different firm. I am using Stata 14.1

  • #2
    Your post describes two different goals. The first is to identify all coworkers who are present when the given worker starts. That can be done with the following code:

    Code:
    clear
    input worker_id firm_id start_date end_date
    1 1 1 4
    1 2 5 10
    2 1 2 8
    2 2 9 10
    3 1 6 7
    4 3 2 7
    end
    
    tempfile original
    save `original'
    
    rename worker_id coworker_id
    rename *_date *
    
    rangejoin start_date start end using `original', by(firm_id)
    // NOW BRING BACK PEOPLE WITH NO COWORKERS AT START
    merge m:1 firm_id worker_id start_date using `original', assert(match using) nogenerate
    drop start end
    
    replace coworker_id = . if coworker_id == worker_id
    
    by worker_id firm_id start_date (coworker_id), sort: gen _j = _n
    reshape wide coworker_id, i(worker_id firm_id start_date) j(_j) // OPTIONAL
    
    order worker_id firm_id start_date end_date, first
    The above code requires the -rangejoin- command, written by Robert Picard, and available from SSC. -rangejoin-, in turn, requires the -rangestat- command, written by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

    Note: It is probably a better idea for you to omit the -reshape wide- command and leave the data in long layout, as it will, I think, simplify your later work. I put the -reshape wide- in there just so you could more easily visualize the results.

    On the other hand, you also seem to want to identify all coworkers who overlap with the employee at any time during their tenure. That is a different matter.

    Code:
    clear
    input worker_id firm_id start_date end_date
    1 1 1 4
    1 2 5 10
    2 1 2 8
    2 2 9 10
    3 1 6 7
    4 3 2 7
    end
    
    assert start_date <= end_date
    
    capture program drop one_firm
    program define one_firm
        tempfile original
        save `original'
    
        rename worker_id coworker_id
        rename *_date coworker_=
    
        joinby firm_id using `original'
    
        //  DROP SELF MATCHES
        drop if worker_id == coworker_id
    
        //  RETAIN ONLY OVERLAPS
        drop if coworker_start_date > end_date | coworker_end_date < start_date
    
        //  IDENTIFY OVERLAP PERIOD
        gen overlap_start = max(start_date, coworker_start_date)
        gen overlap_end = min(end_date, coworker_end_date)
        drop coworker_start_date coworker_end_date
    
        sort firm_id worker_id start_date coworker_id
        order firm_id worker_id coworker_id start_date, first
        exit
    end
    
    runby one_firm, by(firm_id) status
    The above code leaves all overlapping pairs of coworkers in the data, along with the start and end dates of the worker, and the start and end dates of the overlap period. Each pair appears twice: with each member of the pair appearing once as the worker and once as the coworker. I did not reshape the data into wide layout at the end this time, and I recommend you leave the data in long layout for your next steps, as I am confident it will prove easier this way.

    This code requires the -runby- command, written by Robert Picard and me, also available from SSC.

    Breaking the data into chunks by firm_id and processing each firm_id separately is not logically necessary. However, given the size of your data set, trying to do it all at once will probably lead to memory problems (and will also be much slower), so I have pre-emptively done it that way.

    Comment


    • #3
      Hi Clyde, thanks very much for you detailed answer. Rangejoin and the program you provided look exactly like what I was looking for.

      You are right, I am indeed interested in identifying both the coworkers present when a worker starts with a firm, and the coworkers present at any point, and clearly I hadn't thought enough about how the two operations are different.

      Comment

      Working...
      X