Events in intervals, rangejoin and rangerun - staffing decision

Amin Sofla

Join Date: May 2018
Posts: 67

Events in intervals, rangejoin and rangerun - staffing decision

18 May 2018, 09:32

Dear Statalist,

I would like to analyze the effect of workload on the probability of being staffed for the new clients. In the 'Dataset 1' (below), we observe all the branches with their corresponding employees and clients. In addition, we observe every employee that was staffed by the specific client, and the corresponding starting and ending date. I limit the sample to one year for the sake of brevity. Variable definition: br_id is the branch Id; emp_id is the employee Id, cl_id is the client's Id; assign_id is the assignment decision's Id; emclstdate is the starting date, emclendate is the ending date; newassign is an indicator that shows whether there is an assignment in the current year or no; clsize is the client's size.

Dataset 1:

yeare	br_id	emp_id	cl_id	newassign	assign_id	emclstdate	emclendate	clsize
2005	1	1	12	0		09/01/2001	23/05/2007	60
2005	1	1	2	0		31/10/2003	22/09/2014	80
2005	1	1	5	0		01/11/2003	06/02/2006	20
2005	1	2	7	0		01/11/2003	05/05/2009	90
2005	1	2	4	0		03/08/2004	16/05/2006	90
2005	1	2	8	1	3	01/12/2005	02/05/2006	60
2005	1	3	11	0		24/01/2004	31/03/2017	50
2005	1	3	6	0		24/11/2004	02/05/2006	80
2005	1	3	1	1	1	01/01/2005	16/05/2006	30
2005	2	4	3	0		14/12/2004	09/03/2010	70
2005	2	4	10	1	4	20/01/2005	12/03/2010	30
2005	2	5	13	0		14/12/2004	27/12/2006	20
2005	2	5	9	1	2	20/12/2005	27/12/2005	10

The unit of observation for the final analysis will be the employee - his(her) new client dyad, which represents a staffing opportunity. In order to test my hypotheses, in each branch, and for each staffing (assignment) decision, I need to identify a set of employees that could be assigned to the client in each staffing decision. I directly observe every employee that was staffed through the 'Dataset 1' and include them in the opportunity set. In order to identify non-staffed employees that could have been staffed on a specific transaction, I would like to consider all the non-staffed employees who worked for the branch on the specific assignment (staffing) date. Next, for each transaction (staffing decision), I would like to calculate the workload (aggregated clients’ size) for all the potential employees in the branch, on one day before staffing decision (see the workload). The final dataset should look like the 'Dataset 2'.The dependent variable (staffed in the 'Dataset 2') is an indicator variable that takes a value of one if the employee is assigned to the client (from 'Dataset 1') and zero otherwise.

Dataset 2:

yeare	br_id	assign_id	staffed	emp_id	emclstdate	workload
2005	1	1	0	1	01/01/2005	160
2005	1	1	0	2	01/01/2005	180
2005	1	1	1	3	01/01/2005	130
2005	2	2	0	4	20/12/2005	100
2005	2	2	1	5	20/12/2005	20
2005	1	3	0	1	01/12/2005	160
2005	1	3	1	2	01/12/2005	180
2005	1	3	0	3	01/12/2005	160
2005	2	4	1	4	20/01/2005	70
2005	2	4	0	5	20/01/2005	20

To summarize:
How can I create the staffing opportunity set (In each office, for each staffing decision, find all the staffed and non-staffed employees in the branch- and merge altogether – The Dataset 2)?
How can I calculate the aggregated workload of each staffed and non-staffed employees at each staffing (assignment) date (actually one day before the decision)? (Dataset 2 – workload variable)
More importantly, how can I improve the efficiency of the program? Please also note that in each year, there are in total about 1000 branches, 50,000 assignment (staffing) decisions, 5,000 employees, and 250,000 clients.

I truly appreciate your time and consideration,

Last edited by sladmin; 28 May 2018, 05:36. Reason: Title/subject change

Tags: None

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

21 May 2018, 14:35

You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. It is a good idea to show us exactly what you entered and what you got. Also, simplify everything to the minimum needed to generate the problem. You're asking us to puzzle through a lot of text to figure out your problem.

First, you almost certainly need to merge the two datasets. You will want to xtset your data. Then, you can use egen with by statements to calculate what you want for each office.

Given your apparent skill level in Stata, don't worry about speed. You are a long way from even getting it to do what you want.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

21 May 2018, 19:11

Here's how I would approach the problem. Unfortunately, the data example presented is a bit too thin to develop and test code for a solution so I created a demonstration dataset with a similar structure. I use the term contract to refer to each assignment:

Code:

* create a demonstration dataset
clear all
set seed 3213
set obs 10
gen br_id = _n
gen empl_count = runiformint(3,10)
expand empl_count
bysort br_id: gen emp_id = _n
gen cl_count = runiformint(1,20)
expand cl_count
bysort br_id emp_id: gen cl_id = _n
gen emclstdate = runiformint(mdy(1,1,2001), mdy(12,31,2017))
gen emclendate = runiformint(emclstdate, emclstdate + 365*10)
format %td emclstdate emclendate
gen clsize = runiformint(10,90)
drop empl_count cl_count

* only one contract initiation per client per date
isid br_id emclstdate emp_id cl_id, sort
gen contract = _n
save "contracts.dta", replace
list in 1/20

and here are the first 20 observations:

Code:

. list in 1/20

     +--------------------------------------------------------------------+
     | br_id   emp_id   cl_id   emclstd~e   emclend~e   clsize   contract |
     |--------------------------------------------------------------------|
  1. |     1        3       3   22mar2001   25jan2004       56          1 |
  2. |     1        3       2   04apr2001   25feb2010       29          2 |
  3. |     1        6      11   12oct2001   10dec2008       65          3 |
  4. |     1        6       8   17nov2001   06feb2010       35          4 |
  5. |     1        5       3   18mar2002   13oct2005       74          5 |
     |--------------------------------------------------------------------|
  6. |     1        6       1   26sep2002   01apr2011       67          6 |
  7. |     1        3      10   15jan2003   13jan2006       47          7 |
  8. |     1        7       4   21jan2003   22oct2005       30          8 |
  9. |     1        2      10   11mar2003   17apr2008       85          9 |
 10. |     1        1       7   13mar2003   05may2008       15         10 |
     |--------------------------------------------------------------------|
 11. |     1        6       4   29mar2003   13feb2013       31         11 |
 12. |     1        3       9   06may2003   05jan2012       70         12 |
 13. |     1        3      11   06jul2003   25sep2004       72         13 |
 14. |     1        1      11   02aug2003   18jun2005       74         14 |
 15. |     1        6       2   04mar2004   28may2012       48         15 |
     |--------------------------------------------------------------------|
 16. |     1        6      14   21mar2004   08nov2012       41         16 |
 17. |     1        1      12   26may2004   16apr2009       65         17 |
 18. |     1        1      16   03jun2005   04nov2007       42         18 |
 19. |     1        3       4   05jul2005   07oct2008       61         19 |
 20. |     1        1       1   26jan2006   14aug2008       32         20 |
     +--------------------------------------------------------------------+

.

The first step is to track the workload of each employee by branch. The technique used is well explained in:

SJ-13-1 dm0068 . . . . . Stata tip 114: Expand paired dates to pairs of dates
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
Q1/13 SJ 13(1):217--219 (no commands)
tip on using expand to deal with paired dates

http://www.stata-journal.com/article...article=dm0068

and applied to the problem at hand:

Code:

use "contracts.dta", clear
expand 2
bysort contract: gen date = cond(_n == 1, emclstdate, emclendate)
by contract: gen inout = cond(_n == 1, clsize, -clsize)
format %td date
bysort br_id emp_id (date contract): gen load = sum(inout)
// daily workload is measured after considering all events of the day
by br_id emp_id date: keep if _n == _N
keep br_id emp_id date load
save "load.dta", replace

The next step is to make an inventory of employee availability per branch. Each employee's tenure at the branch is bounded by the date of her first contract to the close date of her most recent contract. With that in hand, I use rangejoin (from SSC) to match each employee to all branch contracts that occur within her tenure. The net effect is to form, for each contract, all pairwise combinations of employees available on the day the contract starts. I then append the workload data and order these two sets of observations by date within each branch. I then use rangerun (from SSC) to look back, one day before the contract starts, for the most recent workload for each available employee.

Code:

* branch employee inventory, use earliest and latest date to determine job tenure
clear all
use "contracts.dta"
collapse (min) day1=emclstdate (max) dayN=emclendate, by(br_id emp_id)

* match each branch employee to potential staffing decisions during their tenure
rangejoin emclstdate day1 dayN using "contracts.dta", by(br_id) keep(emclstdate contract emp_id)
rename emp_id_U emp_chosen
gen chosen = emp_id == emp_chosen
gen date = emclstdate - 1
isid br_id contract emp_id, sort
format %td date

* combine with workload data
append using "load.dta"
isid br_id date contract emp_id, sort missok

* define a program to get the load for each potential employee on the day before the contract
program do1
    drop if mi(load)
    sort date
    keep if _n == _N
    rename load load2use
end

* use a date in the future to ignore load observations (low bound will be > date)
gen low = cond(!mi(contract), ., mdy(12,31,2099))
format %td low

rangerun do1, interval(date low date) by(br_id emp_id)

isid br_id date contract emp_id, sort missok

* final clean-up, drop load observations
drop if mi(contract)
drop day1 dayN load low

list if contract <= 10, sepby(contract)

and here are the results for the first 10 contracts of the first branch:

Code:

. list if contract <= 10, sepby(contract)

      +----------------------------------------------------------------------------------+
      | br_id   emp_id   emp_ch~n   emclstd~e   contract   chosen        date   load2use |
      |----------------------------------------------------------------------------------|
   1. |     1        3          3   22mar2001          1        1   21mar2001          . |
      |----------------------------------------------------------------------------------|
   2. |     1        3          3   04apr2001          2        1   03apr2001         56 |
      |----------------------------------------------------------------------------------|
   3. |     1        3          6   12oct2001          3        0   11oct2001         85 |
   4. |     1        6          6   12oct2001          3        1   11oct2001          . |
      |----------------------------------------------------------------------------------|
   5. |     1        3          6   17nov2001          4        0   16nov2001         85 |
   6. |     1        6          6   17nov2001          4        1   16nov2001         65 |
      |----------------------------------------------------------------------------------|
   7. |     1        3          5   18mar2002          5        0   17mar2002         85 |
   8. |     1        5          5   18mar2002          5        1   17mar2002          . |
   9. |     1        6          5   18mar2002          5        0   17mar2002        100 |
      |----------------------------------------------------------------------------------|
  10. |     1        3          6   26sep2002          6        0   25sep2002         85 |
  11. |     1        5          6   26sep2002          6        0   25sep2002         74 |
  12. |     1        6          6   26sep2002          6        1   25sep2002        100 |
      |----------------------------------------------------------------------------------|
  13. |     1        3          3   15jan2003          7        1   14jan2003         85 |
  14. |     1        5          3   15jan2003          7        0   14jan2003         74 |
  15. |     1        6          3   15jan2003          7        0   14jan2003        167 |
      |----------------------------------------------------------------------------------|
  16. |     1        3          7   21jan2003          8        0   20jan2003        132 |
  17. |     1        5          7   21jan2003          8        0   20jan2003         74 |
  18. |     1        6          7   21jan2003          8        0   20jan2003        167 |
  19. |     1        7          7   21jan2003          8        1   20jan2003          . |
      |----------------------------------------------------------------------------------|
  20. |     1        2          2   11mar2003          9        1   10mar2003          . |
  21. |     1        3          2   11mar2003          9        0   10mar2003        132 |
  22. |     1        5          2   11mar2003          9        0   10mar2003         74 |
  23. |     1        6          2   11mar2003          9        0   10mar2003        167 |
  24. |     1        7          2   11mar2003          9        0   10mar2003         30 |
      |----------------------------------------------------------------------------------|
  25. |     1        1          1   13mar2003         10        1   12mar2003          . |
  26. |     1        2          1   13mar2003         10        0   12mar2003         85 |
  27. |     1        3          1   13mar2003         10        0   12mar2003        132 |
  28. |     1        5          1   13mar2003         10        0   12mar2003         74 |
  29. |     1        6          1   13mar2003         10        0   12mar2003        167 |
  30. |     1        7          1   13mar2003         10        0   12mar2003         30 |
      +----------------------------------------------------------------------------------+

Comment

Amin Sofla

Join Date: May 2018

Posts: 67
#4

23 May 2018, 14:16

Dear Robert, I truly appreciate your time and consideration. Your suggested solution indeed is creative and efficient. Now, I would like to ask another question:
Previously, I assumed that the client size does not change over time. I would like to release this assumption. Doing so, I can construct the staffing-set data; however, I have difficulty modifying your code in a way that incorporates the dynamic nature of client size in calculating the workload. I observe a given client's size annually - at end of the client's reporting period.
To sum up, how I can calculate the workload at each contract date when the sizes of clients are changing yearly-based?

The demonstration datasets are presented below:

// Code:
// create the demonstration datasets *
************************************************
** create a demonstration dataset for contracts *

PHP Code:

clear all set seed 3213 set obs 10 gen br_id = _n label var br_id "Branch Id" gen empl_count = runiformint(3,10) expand empl_count bysort br_id: gen emp_id = _n label var emp_id "Employee Id" gen cl_count = runiformint(1,20) expand cl_count bysort br_id emp_id: gen cl_id = _n label var cl_id "Client ID" gen emclstdate = runiformint(mdy(1,1,2001), mdy(12,31,2017)) format %td emclstdate label var emclstdate "Employee-Client starting date" gen emclendate = runiformint(emclstdate, emclstdate + 365*10) label var emclendate "Employee-Client ending date" format %td emclendate drop empl_count cl_count isid br_id emclstdate emp_id cl_id, sort gen contract = _n label var contract "Contract (Assignment) Id" label data "Employees and their contracts (assignments)" sa " contracts.dta", replace

* and here are the first 10 observations:

* Code:

PHP Code:

list in 1/10

br_id emp_id cl_id emclstdate emclendate contract

1. 1 3 3 22mar2001 25jan2004 1
2. 1 3 2 04apr2001 25feb2010 2
3. 1 6 11 12oct2001 10dec2008 3
4. 1 6 8 17nov2001 06feb2010 4
5. 1 5 3 18mar2002 13oct2005 5
6. 1 6 1 26sep2002 01apr2011 6
7. 1 3 10 15jan2003 13jan2006 7
8. 1 7 4 21jan2003 22oct2005 8
9. 1 2 10 11mar2003 17apr2008 9
10. 1 1 7 13mar2003 05may2008 10

// Code:
************************************************** **************************
* create a demonstration dataset for clients and their sizes over the years*

PHP Code:

clear all set obs 20 set seed 3213 gen c = runiformint(1,20) gen cl_id = _n label var cl_id "Client ID" expand 28 bysort cl_id: gen yeare=_n+1999 label var yeare "Client Reporting Period - Year" tab year gen cl_size = (runiformint(1,10)/10+c)*10 label var cl_size "Client size in this Year" drop c tempfile clsize sa `clsize' , replace use "contracts.dta" , clear keep cl_id emclstdate emclendate sort cl_id emclstdate by cl_id, sort: egen mindate = min(emclstdate) by cl_id, sort: egen maxdate = max(emclendate) format %td mindate maxdate sort cl_id duplicates drop cl_id, force gen clrsdate=mindate-runiformint(90,180) gen clredate=maxdate+runiformint(1,30) format %td clrsdate clredate label var clrsdate "Min reporting starting date" label var clredate "Max reporting starting date" format %td clrsdate clredate keep cl_id clrsdate clredate label data "Employees and their contracts (assignments)" merge 1:m cl_id using `clsize', keep(match) nogen gen clrsdated=day(clrsdate) gen clrsdatem=month(clrsdate) gen clrsdatey=year(clrsdate) gen clredated=day(clredate) gen clredatem=month(clredate) gen clredatey=year(clredate) drop if yeare<clrsdatey xtset cl_id yeare gen clrds=mdy(clrsdatem, clrsdated, yeare) format %td clrds label var clrds "Client's Reporting Period - Start" gen clrde=clrds[_n+1]-1 if cl_id==cl_id[_n+1] replace clrde=clrds+365 if clrde==. label var clrde "Client's Reporting period - End" format %td clrds clrde keep cl_id yeare cl_size clrds clrde label data "Client Size in Each Reporting Period" sa "clientinfo.dta", replace

and here are the first 10 observations:

* Code:

PHP Code:

list in 1/10

cl_id yeare cl_size clrds clrde
1. 1 2000 187 15nov2000 14nov2001
2. 1 2001 185 15nov2001 14nov2002
3. 1 2002 183 15nov2002 14nov2003
4. 1 2003 181 15nov2003 14nov2004
5. 1 2004 186 15nov2004 14nov2005

6. 1 2005 188 15nov2005 14nov2006
7. 1 2006 184 15nov2006 14nov2007
8. 1 2007 187 15nov2007 14nov2008
9. 1 2008 186 15nov2008 14nov2009
10. 1 2009 185 15nov2009 14nov2010

* End of creating the demonstration datasets
Attached Files

dofile-dem data.do (2.6 KB, 1 view)

Last edited by Amin Sofla; 23 May 2018, 14:18.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

24 May 2018, 10:17

In retrospect, it makes sense that the client size changes through time. I'm pretty sure that there's a way to retrofit the code to accommodate this twist but I'm going to change tack a bit. I'm not sure I understand the rules of when the client size is measured annually. Here's some code that assumes that the census is done at the end of the calendar year except that it is observed on the date of the first day of the contract and the last day of the contract. The code starts from the demonstration dataset created in #3:

Code:

* create annual data on client size
clear all
use "contracts.dta"
collapse (min) day1=emclstdate (max) dayN=emclendate (min) clsize, by(br_id cl_id)
gen years = year(dayN) - year(day1) + 1
expand years
bysort br_id cl_id: gen year = year(day1) + _n - 1
by br_id cl_id: replace clsize = clsize + runiformint(-clsize+5,clsize)
gen clsize_date = mdy(12,31,year)
replace clsize_date = day1 if year == year(day1)
replace clsize_date = dayN if year == year(dayN)
format %td clsize_date
drop years
isid br_id cl_id year, sort
save "clsize.dta", replace

The "clsize.dta" contains one observation per year per branch client, as measured on clsize_date. We can now return to the contracts data and match each contract to the client size for each year the contract is in effect:

Code:

use "contracts.dta", clear
drop clsize
gen year1 = year(emclstdate)
gen yearN = year(emclendate)
rangejoin year year1 yearN using "clsize.dta", by(br_id cl_id)
isid br_id emclstdate emp_id cl_id year, sort
save "contracts_annual.dta", replace

As before, we create an inventory of branch employees with their first and last day on the job.

Code:

clear all
use "contracts.dta"
collapse (min) day1=emclstdate (max) dayN=emclendate, by(br_id emp_id)

Each employee is then matched to all contracts that started during the employee's tenure:

Code:

* branch employee inventory, use earliest and latest date to determine job tenure
rangejoin emclstdate day1 dayN using "contracts.dta", by(br_id) keep(emclstdate contract emp_id)
rename emp_id_U emp_chosen
gen chosen = emp_id == emp_chosen
gen clsize_date = emclstdate - 1
isid br_id contract emp_id, sort
format %td clsize_date

At this point, the data in memory contains the set of potential employees to consider when awarding each contract. The clsize_date is the date to consider when calculating the load when the contract is awarded (the next day). In order to be able to calculate the load, we append the annual contract data:

Code:

append using "contracts_annual.dta"
isid br_id contract clsize_date emp_id, sort missok

We only want to look back and calculate the load for observations where chosen is not missing so we set the lower bound to a missing value, which will pick-up any observation from any point in the past up to the value of the upper bound (clsize_date). We use a lower bounds far in the future for observations when chosen is missing (the observations come from the "contracts_annual.dta" dataset); this creates a situation where low > high, in which case no observations will fall within such interval. For each potential awardee, the do1 program will keep the most recent client size and then sum the load across all clients.

Code:

* define a program to get the load for each potential employee on the day before the contract
program do1
    drop if !mi(chosen)
    bysort cl_id (year): keep if _n == _N
    gen load = sum(clsize)
end

* to run -do1- only on decision observations, create low > high for contract obs
gen low = cond(!mi(chosen), ., mdy(12,31,2099))
format %td low

rangerun do1, interval(clsize_date low clsize_date) by(br_id emp_id)

isid br_id contract clsize_date emp_id, sort missok

* final clean-up, drop contract observations
drop if mi(chosen)
keep br_id emp_id contract *chosen clsize_date load

list if contract <= 10, sepby(contract)

And here's the results from the final list command:

Code:

. list if contract <= 10, sepby(contract)

      +------------------------------------------------------------------+
      | br_id   emp_id   emp_ch~n   contract   chosen   clsize_~e   load |
      |------------------------------------------------------------------|
   1. |     1        3          3          1        1   21mar2001      . |
      |------------------------------------------------------------------|
   2. |     1        3          3          2        1   03apr2001     21 |
      |------------------------------------------------------------------|
   3. |     1        3          6          3        0   11oct2001     32 |
   4. |     1        6          6          3        1   11oct2001      . |
      |------------------------------------------------------------------|
   5. |     1        3          6          4        0   16nov2001     32 |
   6. |     1        6          6          4        1   16nov2001     72 |
      |------------------------------------------------------------------|
   7. |     1        3          5          5        0   17mar2002     32 |
   8. |     1        5          5          5        1   17mar2002      . |
   9. |     1        6          5          5        0   17mar2002     81 |
      |------------------------------------------------------------------|
  10. |     1        3          6          6        0   25sep2002     32 |
  11. |     1        5          6          6        0   25sep2002      . |
  12. |     1        6          6          6        1   25sep2002     81 |
      |------------------------------------------------------------------|
  13. |     1        3          3          7        1   14jan2003     18 |
  14. |     1        5          3          7        0   14jan2003      5 |
  15. |     1        6          3          7        0   14jan2003     79 |
      |------------------------------------------------------------------|
  16. |     1        3          7          8        0   20jan2003     48 |
  17. |     1        5          7          8        0   20jan2003      5 |
  18. |     1        6          7          8        0   20jan2003     79 |
  19. |     1        7          7          8        1   20jan2003      . |
      |------------------------------------------------------------------|
  20. |     1        2          2          9        1   10mar2003     30 |
  21. |     1        3          2          9        0   10mar2003     48 |
  22. |     1        5          2          9        0   10mar2003      5 |
  23. |     1        6          2          9        0   10mar2003    137 |
  24. |     1        7          2          9        0   10mar2003     58 |
      |------------------------------------------------------------------|
  25. |     1        1          1         10        1   12mar2003      . |
  26. |     1        2          1         10        0   12mar2003     30 |
  27. |     1        3          1         10        0   12mar2003     48 |
  28. |     1        5          1         10        0   12mar2003      5 |
  29. |     1        6          1         10        0   12mar2003    137 |
  30. |     1        7          1         10        0   12mar2003     58 |
      +------------------------------------------------------------------+

.

You can manually spot check results by listing the annual contracts that have occurred up to the decision date. Take observations 29 from the example above:

Code:

use "contracts_annual.dta", clear
isid br_id emp_id clsize_date cl_id, sort
list cl_id contract clsize year clsize_date if br_id == 1 & emp_id == 6 & clsize_date <= mdy(3,12,2003)

and the results, with the most recent contract size highlighted in blue:

Code:

. list cl_id contract clsize year clsize_date if br_id == 1 & emp_id == 6 & clsize_date <= mdy(3,12,2003)

      +----------------------------------------------+
      | cl_id   contract   clsize   year   clsize_~e |
      |----------------------------------------------|
 290. |    11          3       72   2001   12oct2001 |
 291. |     8          4        9   2001   17nov2001 |
 292. |     1          6       15   2002   26sep2002 |
 293. |     8          4       55   2002   31dec2002 |
 294. |    11          3        9   2002   31dec2002 |
      |----------------------------------------------|
 295. |     4         11       58   2003   21jan2003 |
      +----------------------------------------------+

.

Announcement

Events in intervals, rangejoin and rangerun - staffing decision

Comment

Comment

Comment

Comment