Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting unique coworkers with discontinuous spells

    Hi Statalisters

    I am trying to calculate the number of unique coworkers a worker has in a given time window, as well as the number of coworkers with certain characteristics. The following code illustrate what I would like to do.

    Code:
    clear
    input id firm year male
    1 1 1 1
    1 2 2 1
    1 1 3 1
    1 1 4 1
    2 1 1 0
    2 1 2 0
    2 3 3 0
    2 0 4 0
    3 1 2 1
    3 1 3 1
    3 4 4 1
    4 . 2 1
    4 4 4 1
    5 1 3 0
    5 1 4 0
    6 5 1 1
    6 5 2 1
    end
    
    tempfile mydata
    save `mydata'
    
    // keep track of all workers
    keep id
    duplicates drop
    tempfile idlist
    save `idlist'
    
    // create all pairs of coworkers
    use `mydata'
    rename (id male) i_=
    joinby firm year using `mydata' // create all pairs of coworker-years
    drop if id == i_id                 // drop self-matches
    keep *id *male
    duplicates drop // keep one observation per unique coworker pair.
    
    // count coworkers with given characteristics
    gen own_sex = male == i_male    // characteristic of interest
    collapse (count) id (sum) male own_sex, by(i_id)
    rename (id male own_sex) (N N_male N_own_sex)
    
    // add individuals with no coworkers back in
    rename i_id id
    merge 1:1 id using `idlist', nogen
    foreach var of varlist N* {
        replace `var' = 0 if missing(`var')
    }
    The problem with this code is the -joinby- step. My dataset is way too large to efficiently form all pairwise combinations like this. I have considered splitting it into smaller chunks but even then the computations would take weeks, if nothing goes wrong. I am wondering if there is a way to perform the same calculations on the original dataset, without the -joinby- step, perhaps using -rangestat-? The difficulty I keep running into is that a worker can have multiple discontinuous spells in the same firm during the time period and I can't think of an elegant way around it.

    Thanks for any suggestion

  • #2
    This is a very interesting question to me, but I have difficulty following exactly what you want. Perversely, or otherwise, I would rather start from your data example than try to follow what your complicated code is doing.

    You don''t seem to define "a given time window" so I take it to mean same year. Mentioning spells may mean that you want something quite different.

    Similarly a coworker presumably works for the same firm.

    Then the number of coworkers is just

    the number of workers in a given firm and year MINUS 1

    -- as a person is not their own coworker,

    The number of males is just a sum over an indicator variable 1 for males, and the number of females follows from that.

    The number of coworkers who are male is the number of males MINUS the variable male, which works either way as you need to subtract 1 if any person is male, and do not need to subtract 1 if that person is female, but subtracting 0 gets you the right result.

    Code:
    clear
    input id firm year male
    1 1 1 1
    1 2 2 1
    1 1 3 1
    1 1 4 1
    2 1 1 0
    2 1 2 0
    2 3 3 0
    2 0 4 0
    3 1 2 1
    3 1 3 1
    3 4 4 1
    4 . 2 1
    4 4 4 1
    5 1 3 0
    5 1 4 0
    6 5 1 1
    6 5 2 1
    end
    
    sort firm year male 
    
    egen nmales = total(male), by(firm year)
    
    bysort firm year : gen nothers = _N - 1  
    
    order firm year 
    
    list, sepby(firm year) 
    
         +--------------------------------------------+
         | firm   year   id   male   nmales   nothers |
         |--------------------------------------------|
      1. |    0      4    2      0        0         0 |
         |--------------------------------------------|
      2. |    1      1    2      0        1         1 |
      3. |    1      1    1      1        1         1 |
         |--------------------------------------------|
      4. |    1      2    2      0        1         1 |
      5. |    1      2    3      1        1         1 |
         |--------------------------------------------|
      6. |    1      3    5      0        2         2 |
      7. |    1      3    1      1        2         2 |
      8. |    1      3    3      1        2         2 |
         |--------------------------------------------|
      9. |    1      4    5      0        1         1 |
     10. |    1      4    1      1        1         1 |
         |--------------------------------------------|
     11. |    2      2    1      1        1         0 |
         |--------------------------------------------|
     12. |    3      3    2      0        0         0 |
         |--------------------------------------------|
     13. |    4      4    4      1        2         1 |
     14. |    4      4    3      1        2         1 |
         |--------------------------------------------|
     15. |    5      1    6      1        1         0 |
         |--------------------------------------------|
     16. |    5      2    6      1        1         0 |
         |--------------------------------------------|
     17. |    .      2    4      1        1         0 |
         +--------------------------------------------+




    Writing MINUS here is deliberate shouting: one of many still useful details in Tukey's Exploratory Data Analysis (1977) is that such emphasis can help make definitions clear.

    Comment


    • #3
      Hi Nick, thanks for the quick reply.

      Sorry for not making it clear. In the example, "a given time window" is simply the four years for which each individual is observed. In the real world, I am interested in the number of coworkers in a rolling window (e.g. the number of unique coworkers in the past 4 years), but I will only load the data one "window" at a time, i.e. I will only load 4 years of data at a time. A coworker is indeed someone who works for the same firm in the same year.

      If all workers had at most one spell in a given firm, the problem would be simple, and a solution along the lines of your example would work. For example I could simply count the number of coworkers in a worker's first year in a firm, and the number of new coworkers in each subsequent year. But it is workers leaving and coming back that cause problems. I don't want to double-count these individuals.

      Comment


      • #4
        Maybe the simplest way of putting what I’m after is as follows. For each individual, I want to know how many unique coworkers that individual had in the last four years. That means that I don’t want to double count coworkers with whom the individual shares a firm in multiple years, or co-workers they work with across multiple firms.

        Comment


        • #5
          Let's take this one problem at a time. In your 4 year period

          Code:
          egen tag = tag(firm id) 
          
          egen ndistinct = total(tag), by(firm)
          first tags each worker for that firm just once and then the number of distinct workers is a total over that and finally the number of distinct co-workers is that MINUS 1. There is more on this in the paper underlying distinct from the Stata Journal.

          The number of workers who are males is the total(tag * male) as an expression fed to egen and again you need to subtract the indicator male.
          Code would need to be more complicated if workers change gender.

          If you want a moving window I think rangestat (from SSC, as you are asked to explain: FAQ Advice =12) can do it with a custom script.

          Comment


          • #6
            Thanks for the suggestion. Focusing just on the problem of counting coworkers, I'm not sure that the proposed solution will work. For a given worker, say 1, ndistinct will count all workers at the firm at any point in the four-year window, and not just those workers in the years where 1 was working at the firm. Furthermore, I cannot assume that a worker only works a single spell (i.e. a closed interval of years, e.g. years 1, 2, & 3) at a firm during my observation window, which motivated my original solution with joinby. I hope the following shorter example makes this clear.

            Code:
            clear
            input id firm year
            1 1 1
            1 2 2
            1 2 3
            2 1 1
            2 1 2
            2 3 3
            3 4 1
            3 4 2
            3 1 3
            4 1 1
            4 5 2
            4 1 3
            end    
            
            egen tag = tag(firm id) 
            egen ndistinct = total(tag), by(firm)
            replace ndistinct = ndistinct - 1 // don't count self
            drop tag year
            duplicates drop // keep one observation per id-firm
            collapse (sum) N = ndistinct, by(id)
            
            list
            /*
                 +--------+
                 | id   N |
                 |--------|
              1. |  1   3 |
              2. |  2   3 |
              3. |  3   3 |
              4. |  4   3 |
                 +--------+
            
            correct answer:
                id N
                1 2
                2 2
                3 2
                4 3
            
            */
            To be clear, I am not sure that a simpler solution than the brute-force approach of joinby exists, but I thought others might have better ideas.

            Comment

            Working...
            X