Counting unique coworkers with discontinuous spells

Sebastien Willis

Join Date: Mar 2017

Posts: 30
#1

Counting unique coworkers with discontinuous spells

Today, 02:59

Hi Statalisters

I am trying to calculate the number of unique coworkers a worker has in a given time window, as well as the number of coworkers with certain characteristics. The following code illustrate what I would like to do.

Code:

clear input id firm year male 1 1 1 1 1 2 2 1 1 1 3 1 1 1 4 1 2 1 1 0 2 1 2 0 2 3 3 0 2 0 4 0 3 1 2 1 3 1 3 1 3 4 4 1 4 . 2 1 4 4 4 1 5 1 3 0 5 1 4 0 6 5 1 1 6 5 2 1 end tempfile mydata save `mydata' // keep track of all workers keep id duplicates drop tempfile idlist save `idlist' // create all pairs of coworkers use `mydata' rename (id male) i_= joinby firm year using `mydata' // create all pairs of coworker-years drop if id == i_id // drop self-matches keep *id *male duplicates drop // keep one observation per unique coworker pair. // count coworkers with given characteristics gen own_sex = male == i_male // characteristic of interest collapse (count) id (sum) male own_sex, by(i_id) rename (id male own_sex) (N N_male N_own_sex) // add individuals with no coworkers back in rename i_id id merge 1:1 id using `idlist', nogen foreach var of varlist N* { replace `var' = 0 if missing(`var') }

The problem with this code is the -joinby- step. My dataset is way too large to efficiently form all pairwise combinations like this. I have considered splitting it into smaller chunks but even then the computations would take weeks, if nothing goes wrong. I am wondering if there is a way to perform the same calculations on the original dataset, without the -joinby- step, perhaps using -rangestat-? The difficulty I keep running into is that a worker can have multiple discontinuous spells in the same firm during the time period and I can't think of an elegant way around it.

Thanks for any suggestion
Tags: Joinby, panel data, rangestat, spell data

Nick Cox

Join Date: Mar 2014
Posts: 35782

Today, 06:16

This is a very interesting question to me, but I have difficulty following exactly what you want. Perversely, or otherwise, I would rather start from your data example than try to follow what your complicated code is doing.

You don''t seem to define "a given time window" so I take it to mean same year. Mentioning spells may mean that you want something quite different.

Similarly a coworker presumably works for the same firm.

Then the number of coworkers is just

the number of workers in a given firm and year MINUS 1

-- as a person is not their own coworker,

The number of males is just a sum over an indicator variable 1 for males, and the number of females follows from that.

The number of coworkers who are male is the number of males MINUS the variable male, which works either way as you need to subtract 1 if any person is male, and do not need to subtract 1 if that person is female, but subtracting 0 gets you the right result.

Code:

clear
input id firm year male
1 1 1 1
1 2 2 1
1 1 3 1
1 1 4 1
2 1 1 0
2 1 2 0
2 3 3 0
2 0 4 0
3 1 2 1
3 1 3 1
3 4 4 1
4 . 2 1
4 4 4 1
5 1 3 0
5 1 4 0
6 5 1 1
6 5 2 1
end

sort firm year male 

egen nmales = total(male), by(firm year)

bysort firm year : gen nothers = _N - 1  

order firm year 

list, sepby(firm year) 

     +--------------------------------------------+
     | firm   year   id   male   nmales   nothers |
     |--------------------------------------------|
  1. |    0      4    2      0        0         0 |
     |--------------------------------------------|
  2. |    1      1    2      0        1         1 |
  3. |    1      1    1      1        1         1 |
     |--------------------------------------------|
  4. |    1      2    2      0        1         1 |
  5. |    1      2    3      1        1         1 |
     |--------------------------------------------|
  6. |    1      3    5      0        2         2 |
  7. |    1      3    1      1        2         2 |
  8. |    1      3    3      1        2         2 |
     |--------------------------------------------|
  9. |    1      4    5      0        1         1 |
 10. |    1      4    1      1        1         1 |
     |--------------------------------------------|
 11. |    2      2    1      1        1         0 |
     |--------------------------------------------|
 12. |    3      3    2      0        0         0 |
     |--------------------------------------------|
 13. |    4      4    4      1        2         1 |
 14. |    4      4    3      1        2         1 |
     |--------------------------------------------|
 15. |    5      1    6      1        1         0 |
     |--------------------------------------------|
 16. |    5      2    6      1        1         0 |
     |--------------------------------------------|
 17. |    .      2    4      1        1         0 |
     +--------------------------------------------+

Writing MINUS here is deliberate shouting: one of many still useful details in Tukey's Exploratory Data Analysis (1977) is that such emphasis can help make definitions clear.

Comment

Sebastien Willis

Join Date: Mar 2017

Posts: 30
#3

Today, 06:39

Hi Nick, thanks for the quick reply.

Sorry for not making it clear. In the example, "a given time window" is simply the four years for which each individual is observed. In the real world, I am interested in the number of coworkers in a rolling window (e.g. the number of unique coworkers in the past 4 years), but I will only load the data one "window" at a time, i.e. I will only load 4 years of data at a time. A coworker is indeed someone who works for the same firm in the same year.

If all workers had at most one spell in a given firm, the problem would be simple, and a solution along the lines of your example would work. For example I could simply count the number of coworkers in a worker's first year in a firm, and the number of new coworkers in each subsequent year. But it is workers leaving and coming back that cause problems. I don't want to double-count these individuals.
Comment
Sebastien Willis

Join Date: Mar 2017

Posts: 30
#4

Today, 06:56

Maybe the simplest way of putting what I’m after is as follows. For each individual, I want to know how many unique coworkers that individual had in the last four years. That means that I don’t want to double count coworkers with whom the individual shares a firm in multiple years, or co-workers they work with across multiple firms.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35782
#5

Today, 07:10

Let's take this one problem at a time. In your 4 year period

Code:

egen tag = tag(firm id) egen ndistinct = total(tag), by(firm)

first tags each worker for that firm just once and then the number of distinct workers is a total over that and finally the number of distinct co-workers is that MINUS 1. There is more on this in the paper underlying distinct from the Stata Journal.

The number of workers who are males is the total(tag * male) as an expression fed to egen and again you need to subtract the indicator male.
Code would need to be more complicated if workers change gender.

If you want a moving window I think rangestat (from SSC, as you are asked to explain: FAQ Advice =12) can do it with a custom script.
Comment
Sebastien Willis

Join Date: Mar 2017

Posts: 30
#6

Today, 12:27

Thanks for the suggestion. Focusing just on the problem of counting coworkers, I'm not sure that the proposed solution will work. For a given worker, say 1, ndistinct will count all workers at the firm at any point in the four-year window, and not just those workers in the years where 1 was working at the firm. Furthermore, I cannot assume that a worker only works a single spell (i.e. a closed interval of years, e.g. years 1, 2, & 3) at a firm during my observation window, which motivated my original solution with joinby. I hope the following shorter example makes this clear.

Code:

clear input id firm year 1 1 1 1 2 2 1 2 3 2 1 1 2 1 2 2 3 3 3 4 1 3 4 2 3 1 3 4 1 1 4 5 2 4 1 3 end egen tag = tag(firm id) egen ndistinct = total(tag), by(firm) replace ndistinct = ndistinct - 1 // don't count self drop tag year duplicates drop // keep one observation per id-firm collapse (sum) N = ndistinct, by(id) list /* +--------+ | id N | |--------| 1. | 1 3 | 2. | 2 3 | 3. | 3 3 | 4. | 4 3 | +--------+ correct answer: id N 1 2 2 2 3 2 4 3 */

To be clear, I am not sure that a simpler solution than the brute-force approach of joinby exists, but I thought others might have better ideas.
Comment

Announcement