Summing up durations over a 5-year-window

Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#1

Summing up durations over a 5-year-window

26 Apr 2016, 04:10

Hello,

I'm using episode-based data with daily information (see screenshot below, or download sample file here: http://doku.iab.de/fdz/siab/SIAB7510...aten_Stata.zip) . Broadly speaking, the database contains employment and unemployment spells for many different individuals over a period of about 35 years.

Now, I would like to generate a variable that contains the days of unemployment over a 5-year-window before the corresponding observation. I already created a variable which shows the unemployment duration of a specific unemployment spell ("unempl_dauer" in the screenshot) and a variable which sums up the duration of all unemployment spells up to this point ("tage_unempl" in the screenshot). Unfortunately, I'm struggling to generate a variable that is like "unempl_dauer" but ignores the duration of unemployment spells older than 5 years.

I would be very grateful for any suggestions how I can solve my problem.

Best regards,
Sebastian

1 Photo
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35657
#2

26 Apr 2016, 04:39

Please do read http://www.statalist.org/forums/help#stata and act upon it to give us a data example file using dataex (SSC) . It's asking a lot to expect people to open up a .zip file and to flip back and forth between a screenshot photo and your post. I can't easily read the screenshot any way.
Comment

Sebastian Geiger

Join Date: Oct 2015
Posts: 124

27 Apr 2016, 10:27

Thank you very much for your quick answer. You're right, the output of dataex is much better than the screenshot:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(persnr betnr) int(begepi endepi) float(unempl_dauer tage_unempl)
90593000       .n 14077 14172  96  96
90593000       .n 14173 14847   0  96
90593024       .n 15212 15255   0   0
90593024       .n 15300 15333  34  34
90593030 50284864 14757 14930   0   0
90593030 50139714 14947 14975   0   0
90593030 50139714 14976 15070   0   0
90593030 50139714 16118 16172   0   0
90593030       .n 16191 16436 246 246
90593030 50017841 16437 16539   0 246
90593030       .n 16437 16539   0 246
90593030 50017841 16540 16801   0 246
90593030 50017841 16802 17166   0 246
90593030 50017841 17167 17439   0 246
90593030 50017841 17440 17531   0 246
90593030 50017841 17532 17897   0 246
90593030 50017841 17898 18262   0 246
90593030 50017841 18263 18370   0 246
90593030 50435853 18371 18420   0 246
90593030 50017841 18421 18442   0 246
90593030 50017841 18443 18572   0 246
end
format %d begepi
format %d endepi
label values betnr miss_de
label def miss_de .n ".n  fehlt syst.", modify

begepi = the start of the spell (in days)
endepi = the end of the spell (in days)
persnr = an id for each individuals
betnr = an id for each firm; if it is not an employment spell the value is missing (.n)
unempl_dauer = the duration of the unemployment spell (in days), it is 0 if it is not an unemployment spell (but an employment spell)
tage_unempl = the sum of the days in unemployment before the start of the current observation

tage_unempl is almost what I want. but I'd like to generate an additional variable which sums up only the unemployment duration within a five-year window before the observation and not for the entire career of each individual.

Thanks again.

Best regards,
Sebastian

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#4

27 Apr 2016, 12:15

Thanks for posting with -dataex-. This is a one-liner if you use -rangestat-, a new command by Robert Picard, Roberto Ferrer, and Nick Cox. You can get it with -ssc install rangestat-. Then all you need is:

Code:

local five_years_ago = int(-5*365.25) rangestat (sum) unempl_dauer, interval(begepi `five_years_ago' 0) by(persnr)

I interpreted "five year window before the observation" to mean starting from the begepi date and working back five years, including the current episode. See the help for rangestat for details of how the -interval()- option works, and you can modify this if you the reference date should be the endepi, or if you want to exclude that observation itself, etc.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

27 Apr 2016, 12:40

Clyde, I'm not sure this correctly manages the time windows. More than likely, there will be some observations where the 5 year window starts after the beginning of the employment spell and before the end of it. You can't target this directly with rangestat without filling in the data with daily observations. Here's an alternative that uses rangejoin, rangestat's little cousin. To install it, type in Stata's Command window:

Code:

ssc install rangejoin

I tweaked the example to split the employment spell with 246 days. I also added an observation that require pro-rating the days.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(persnr betnr) int(begepi endepi) float(unempl_dauer tage_unempl)
90593000       .n 14077 14172  96  96
90593000       .n 14173 14847   0  96
90593024       .n 15212 15255   0   0
90593024       .n 15300 15333  34  34
90593030 50284864 14757 14930   0   0
90593030 50139714 14947 14975   0   0
90593030 50139714 14976 15070   0   0
90593030 50139714 16118 16172   0   0
90593030       .n 16191 16400 210 246
90593030       .n 16401 16436  36 246
90593030 50017841 16437 16539   0 246
90593030       .n 16437 16539   0 246
90593030 50017841 16540 16801   0 246
90593030 50017841 16802 17166   0 246
90593030 50017841 17167 17439   0 246
90593030 50017841 17440 17531   0 246
90593030 50017841 17532 17897   0 246
90593030 50017841 17898 18200   0 246
90593030 50017841 18201 18370   0 246
90593030 50017841 18263 18370   0 246
90593030 50435853 18371 18420   0 246
90593030 50017841 18421 18442   0 246
90593030 50017841 18443 18572   0 246
end
format %d begepi
format %d endepi
label values betnr miss_de
label def miss_de .n ".n  fehlt syst.", modify

* you should an observation identifier
gen obsid = _n

* save observations with spells of unemployments
preserve
keep obsid persnr endepi unempl_dauer
keep if unempl_dauer != 0
save "unempl_dauer.dta", replace
restore

* pair each observations with employment spells in 5 year window
gen low = begepi - 365 * 5
gen high = begepi - 1
format %td low high
rangejoin endepi low high using "unempl_dauer.dta", by(persnr)

* the number of days between endepi_U and the target window
gen days = endepi_U - low

* the number of days to count
gen days2use = cond(days > unempl_dauer_U, unempl_dauer_U, days)

* add up employment spells within the window and reduce to the
* original observations
bysort obsid: egen unemp5yr = total(days2use)
by obsid: keep if _n == 1

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#6

27 Apr 2016, 13:00

I think it's unclear exactly what Sebastian wants. Since he described his problem as involving summing up values of unempl_dauer, I took his goal to be simply restricting that sum to those episodes that began within the five year window. (Even that is somewhat ambiguous, because he doesn't say whether he wants to include the current observation, but that is a refinement detail.) Robert Picard's approach is more fine-grained and counts actual days of unemployment in the five year period, including days from episodes that began prior to 5 years ago, but extend into it. I think for many purposes Robert Picard's approach produces a more useful statistic, but I can also think of situations where the beginning-of-episode based total would be desired.

Well, ultimately, it's up to Sebastian which he needs.
Comment
Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#7

28 Apr 2016, 05:19

Sorry for being ambiguous in my question. However, both of your replies were very helpful nevertheless. Robert's code gives me exactly what I want. Thank you both!
Comment
Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#8

08 Jun 2016, 03:55

Unfortunately, I have to re-open this thread. After using the code on my test file, everything was fine. However, I tried to run my analysis on the real data now, and since this dataset is large (and by that I mean very large - about 60 million observations) my operating system does not provide enough memory for the calculation. I cannot modify the available memory because I have to run the analysis on a computer/server of the German Federal Employment Agency.

My observation period ranges from 2000 to 2014. I already dropped every individual that has no observations in this period. I also excluded spells ending before the 1st January of 1995 from the "unempl_dauer.dta" file because earlier information will not be needed for the 5-year window. The -rangejoin- command, though, still needs to much memory.

I cannot exclude all older observations from the main file at this stage because information from, for example, 1980 will be necessary to extrapolate/impute values for some variables later.

Maybe someone here knows an approach which solves the problem.

Any answer will be appreciated. Thanks in advance!

Best regards,
Sebastian

Last edited by Sebastian Geiger; 08 Jun 2016, 04:10.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#9

08 Jun 2016, 08:51

Assuming you have enouigh memory to hold the final result and that the problem arises due to need for temporary memory during the calculations, what about breaking up the file? For some appropriate value of N, put the first N people into one file, the second N people into the next, etc. Then process each file separately, and at the end, append all the results together.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

08 Jun 2016, 09:33

I concur with Clyde, the rangejoin approach can be done in parts.

I had a second look at the problem and I think that I came up with a solution that works with rangestat. The approach isolates dates from unemployment spells. The c(minfloat) value is used for observations with employment spells. This way, these observations are never picked-up by rangestat as they are never within the 5 year window. The unemployments spells are summed based on the end date of the spell. The earliest unemployment spell start date is also collected. If an unemployment spell starts before the start of the 5 year window, then the days to remove can be calculated using the offset:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(persnr betnr) int(begepi endepi) float(unempl_dauer tage_unempl)
90593000       .n 14077 14172  96  96
90593000       .n 14173 14847   0  96
90593024       .n 15212 15255   0   0
90593024       .n 15300 15333  34  34
90593030 50284864 14757 14930   0   0
90593030 50139714 14947 14975   0   0
90593030 50139714 14976 15070   0   0
90593030 50139714 16118 16172   0   0
90593030       .n 16191 16400 210 246
90593030       .n 16401 16436  36 246
90593030 50017841 16437 16539   0 246
90593030       .n 16437 16539   0 246
90593030 50017841 16540 16801   0 246
90593030 50017841 16802 17166   0 246
90593030 50017841 17167 17439   0 246
90593030 50017841 17440 17531   0 246
90593030 50017841 17532 17897   0 246
90593030 50017841 17898 18200   0 246
90593030 50017841 18201 18370   0 246
90593030 50017841 18263 18370   0 246
90593030 50435853 18371 18420   0 246
90593030 50017841 18421 18442   0 246
90593030 50017841 18443 18572   0 246
end
format %d begepi
format %d endepi
label values betnr miss_de
label def miss_de .n ".n  fehlt syst.", modify

* the start and end of unemployment spells
gen ustart = cond(unempl_dauer > 0, begepi, c(minfloat))
gen uend = cond(unempl_dauer > 0, endepi, c(minfloat))
format %td ustart uend

* define a 5 year window for each observation
gen low = begepi - 365 * 5
gen high = begepi - 1
format %td low high

* find all unemployment spells that end within the 5 year window;
* sum the unemployed days and pick-up the start of the earliest unemployed spell
rangestat (sum) unemp5yr = unempl_dauer (min) ustart1 = ustart, interval(uend low high) by(persnr)
format %td ustart1

* the offset in days between the 5 year window start and the earliest unemployment spell start
gen offset = low - ustart1
replace unemp5yr = unemp5yr - offset - 1 if offset > 0 & !mi(offset)
replace unemp5yr = 0 if mi(unemp5yr)

Comment

Sebastian Geiger

Join Date: Oct 2015

Posts: 124
#11

09 Jun 2016, 13:41

Clyde and Robert, thank you both for your answers. I implemented Robert's approach with my test data first. I ran both approaches and compared both variables. They are almost identical. There is, however, a small difference for some observations. To be exact, the -rangejoin- approach yields one day less for some observations. As far as I can see, the difference occurs when the first (i.e. oldest) spell that counts to the 5-year window starts just one calendar day after the spell for which the value is calculated. To give you an example: When I calculate the 5-year window for a spell starting on the 15th of October 2010, the -rangestat- approach includes the duration of a spell starting on the 16th of October 2005 completely (as I would say it is correct). On the other side, the approach using -rangejoin- appears to cut one day off from those spells. Even though, -rangestat- is apparently doing the right thing, I was just wondering why -rangejoin- leads to different results?!

If the approach using -rangestat- also works with the original data will be determined in a few hours when the server of the Federal Employment Agency is done .
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#12

09 Jun 2016, 16:15

Picky, picky, picky...

When I crafted #10, the results replicated the ones in #5 so I stopped there. The one day difference you report has to do with how to adjust the results if the earliest unemployment spell starts before the start of the 5 year window. I think that both #5 and #10 could be off by one day and need tweaking. Here's a short example that incorporates the case you report in #11 and a few others from the original data example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(persnr) int(begepi endepi) float(unempl_dauer)
1        16725 16820  96
1        18645 18650   0
90593000 16725 16820  96
90593000 18550 10000   0
90593030 16191 16400 210
90593030 18201 18370   0
end
format %d begepi
format %d endepi

* you should an observation identifier
gen obsid = _n

* save observations with spells of unemployments
preserve
keep obsid persnr endepi unempl_dauer
keep if unempl_dauer != 0
save "unempl_dauer.dta", replace
restore

* pair each observations with employment spells in 5 year window
gen low = begepi - 365 * 5
gen high = begepi - 1
format %td low high
rangejoin endepi low high using "unempl_dauer.dta", by(persnr)

* the number of days between endepi_U and the target window
gen days = endepi_U - low + 1

* the number of days to count
gen days2use = cond(days > unempl_dauer_U, unempl_dauer_U, days)

* add up employment spells within the window and reduce to the
* original observations
bysort obsid: egen unemp5yr1 = total(days2use)
by obsid: keep if _n == 1

list persnr begepi endepi unempl_dauer low high days endepi_U unempl_dauer_U days2use unemp5yr


* repeat, this time using -rangestat-
keep persnr begepi endepi unempl_dauer unemp5yr1

* the start and end of unemployment spells
gen ustart = cond(unempl_dauer > 0, begepi, c(minfloat))
gen uend = cond(unempl_dauer > 0, endepi, c(minfloat))
format %td ustart uend

* define a 5 year window for each observation
gen low = begepi - 365 * 5
gen high = begepi - 1
format %td low high

* find all unemployment spells that end within the 5 year window;
* sum the unemployed days and pick-up the start of the earliest unemployed spell
rangestat (sum) unemp5yr = unempl_dauer (min) ustart1 = ustart, interval(uend low high) by(persnr)
format %td ustart1

* the offset in days between the 5 year window start and the earliest unemployment spell start
gen offset = low - ustart1
replace unemp5yr = unemp5yr - offset if offset > 0 & !mi(offset)
replace unemp5yr = 0 if mi(unemp5yr)

list persnr-unemp5yr, noobs

Last edited by Robert Picard; 09 Jun 2016, 16:22.

Comment

Sebastian Geiger

Join Date: Oct 2015
Posts: 124

#13

21 Jun 2016, 10:43

Sorry for my late response. I implemented your suggestion and it appears to work very well. Thank you, Robert.

Unfortunately, I'm not done yet. I work with two different types of datasets. One is spell-based and one is a panel with one observation per person per month. The latter dataset is generated from the spell-based dataset by keeping only one spell for each person at every reference date. Since the variables of the panel should refer to the reference date and not to the start of a spell, I need to create the 5-year-windows for the panel separately.

I took Robert's approach as an example and wrote the code for the panel. Maybe someone could check whether I did it correctly or if I missed something. Any remarks are appreciated.

Here is the code (I use a loop to calculate the 5-year-windows for employment, unemployment, and temporary agency employment):

Code:

[...]

scalar cdate = date("`c'-`y'", "MDY")

[...]

foreach type in empl unempl temp {

    * Define for what kind of spells the calculation is done
    global type `type'                                                            // Calculation must be done for the panel seperately because the 5-year-window shifts to the reference date and does not refer to the start of the spell anymore
    if "${type}"=="empl"    local ifcond "quelle ==1 & trainee!=1"                // Define if conditions to identify "live" spells
    if "${type}"=="unempl"    local ifcond "unempl==1"
    if "${type}"=="temps"    local ifcond "temps==1"
    
    * Drop variable from the spell-based dataset
    drop ${type}5yr
    
    * The start and end of the spells (but only for the kind we run the calcuation here, i.e. ${type})
    sort persnr spell
    gen start = cond(${type}_dauer > 0, begepi, c(minfloat))
    gen end = cond(${type}_dauer > 0, endepi, c(minfloat))
    format %td start end

    * Define a 5-year window for each observation (note: here it refers to the reference date of the cross-section = cdate)
    gen low = cdate - 365.25 * 5 + 1                                // Procedure counts day of reference date to the window (see also correction for the right end of the window), but excludes the reference point 5-years ago                
    gen high = cdate - 1
    format %td low high

    * Sum the days within the 5-year window & pick up the start of the earliest spell that is of ${type} kind
    rangestat (sum) ${type}5yr = ${type}_dauer (min) start1 = start, interval(end low high) by(persnr)
    format %td start1

    * Add time of current/live spell until the reference date (right end of the window)
    replace ${type}5yr = ${type}5yr + ${type}_dauer - (endepi - cdate)    if `ifcond'    & begepi<=cdate & endepi>=cdate            // Current value + duration of current spell - days that the current spell exceeds the reference date; only if live spell (see if condition)
    
    * Offset in days between the 5-year window start and the earliest ${type} spell start (left end of the window)
    gen offset = low - start1
    replace ${type}5yr = ${type}5yr - offset if offset>0 & !missing(offset)
    replace ${type}5yr = 0 if missing(${type}5yr)
    
    * Replace all spell for which the calculations makes no sense (i.e. those with no intersection with the reference date) with missings
    replace ${type}5yr = . if begepi>cdate | endepi<cdate                        // Spells that start after the reference date (begepi>cdate) or end before the reference date (endepi<cdate) are dropped later anyway
    
    * Drop aux. variables
    drop start end low high start1 offset
    }

Later, as already mentioned, only those spells with an intersection with the reference date

Code:

[...]
keep if begepi <= cdate & endepi >= cdate
[...]

Comment

Sebastian Geiger

Join Date: Oct 2015
Posts: 124

#14

21 Jun 2016, 16:43

I already recognized one mistake. I should set the missings of empl5yr to zero directly after the -rangestat- command. Otherwise the correction procedure on the right end of the window (i.e. adding up the time of the current spell until the reference date) would not work, since it is not possible to add anything to a missing value.

The new code would be:

Code:

    foreach type in empl unempl temp {
    
    * Define for what kind of spells the calculation is done
    global type `type'                                                            // Calculation must be done for the panel seperately because the 5-year-window shifts to the reference date and does not refer to the start of the spell anymore
    if "${type}"=="empl"    local ifcond "quelle ==1 & trainee!=1"                // Define if conditions to identify "live" spells
    if "${type}"=="unempl"    local ifcond "unempl==1"
    if "${type}"=="temps"    local ifcond "temps==1"
    
    * Drop variable from the spell-based dataset
    drop ${type}5yr
    
    * The start and end of the spells (but only for the kind we run the calcuation here, i.e. ${type})
    sort persnr spell
    gen start = cond(${type}_dauer > 0, begepi, c(minfloat))
    gen end = cond(${type}_dauer > 0, endepi, c(minfloat))
    format %td start end

    * Define a 5-year window for each observation (note: here it refers to the reference date of the cross-section = cdate)
    gen low = cdate - 365.25 * 5 + 1                                // Procedure counts day of reference date to the window (see also correction for the right end of the window), but excludes the reference point 5-years ago                
    gen high = cdate - 1
    format %td low high

    * Sum the days within the 5-year window & pick up the start of the earliest spell that is of ${type} kind
    rangestat (sum) ${type}5yr = ${type}_dauer (min) start1 = start, interval(end low high) by(persnr)
    format %td start1

    * Set to zero if no values have been picked up by rangestat (yet)
    replace ${type}5yr = 0 if missing(${type}5yr)
    
    * Add time of current/live spell until the reference date (right end of the window)
    replace ${type}5yr = ${type}5yr + ${type}_dauer - (endepi - cdate)    if `ifcond'    & begepi<=cdate & endepi>=cdate            // Current value + duration of current spell - days that the current spell exceeds the reference date; only if live spell (see if condition)
    
    * Offset in days between the 5-year window start and the earliest ${type} spell start (left end of the window)
    gen offset = low - start1
    replace ${type}5yr = ${type}5yr - offset if offset>0 & !missing(offset)
    
    * Replace all spell for which the calculations makes no sense (i.e. those with no intersection with the reference date) with missings
    replace ${type}5yr = . if begepi>cdate | endepi<cdate                        // Spells that start after the reference date (begepi>cdate) or end before the reference date (endepi<cdate) are dropped later anyway
    
    * Drop aux. variables
    drop start end low high start1 offset
    }

Announcement