Spell data - values between two spells, imputation

Inna Petrunyk

Join Date: Oct 2014

Posts: 31
#1

Spell data - values between two spells, imputation

16 Mar 2019, 12:05

Dear community,

I have spell data on employment with the corresponding firm id if employed.
I have calculated the tenure duration over time (tenure_cum) and the total tenure duration in a firm (tenure_max) according to the rule I imposed, which is: consider spell of unemployment (with no firm id) as employment spell if the unemployment duration between 2 employment spells in the same firm ist < 6 months.

I now have the dataset of this form:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long persnr int begepi long endepi float(spell duration employed same_firm tenure_cum tenure_max) 1 10768 10769 32 2 1 0 2 43 1 10856 10868 33 13 0 . . . 1 10869 10924 34 56 0 . . . 1 10932 10957 35 26 1 1 28 43 1 10958 10972 36 15 1 1 43 43 end format %d begepi format %d endepi

And would like to obtain this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long persnr int begepi long endepi float(spell duration employed same_firm tenure_cum tenure_max) 1 10768 10769 32 2 1 0 2 43 1 10856 10868 33 13 0 1 . 43 1 10869 10924 34 56 0 1 . 43 1 10932 10957 35 26 1 1 28 43 1 10958 10972 36 15 1 1 43 43 end format %d begepi format %d endepi

This refers to each firm id, keeping in mind that for the same firm id I can have different tenures for the same person over time.
Do you have any suggestions on this?

Thanks in advance.

Last edited by Inna Petrunyk; 16 Mar 2019, 12:21.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30179

16 Mar 2019, 13:33

I'm not completely sure I understand, but if I do, what you want is:

Code:

by persnr (begepi), sort: gen spell2 = sum(employed != employed[_n-1])
by persnr spell2 (begepi), sort: gen duration2 = endepi[_N] - begepi[1]

by persnr (begepi), sort: replace tenure_max = tenure_max[_n-1] ///
    if missing(tenure_max) & duration2 <= 183 // 6 MONTHS = 183 DAYS

Comment

Inna Petrunyk

Join Date: Oct 2014
Posts: 31

17 Mar 2019, 07:45

Sorry, let me start from the original dataset that I have. I think the auxiliary variables I created may have confused you.

This is an example of the dataset in which I have spells with labor market status (status), only for employment spells I have also the corresponding establishment identifier (firmid).

I want to calculate tenure over time, i.e, for each persnr and firmid over time I want to create a variable with information on total employment duration in a firmid, allowing interruptions < 6 months.
It is here to consider that a person may be employed in the same firmid at differnt points of time with employment relationship in other firms in between. This implies that for a person having been employed in a certain firm I can have different tenures, depending at what time point I observe the person. So the variable tenure will be constant for person-firmid if the interruption of employment in that firm is short, and then there may another value for tenure for the same person-firmid several years later, for example.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(persnr firmid) int begepi long endepi byte status float(spell duration)
337837 50249512  5479  5843  4 20  365
337837 50249512  5844  6209  4 21  366
337837      503  6210  6574  4 22  365
337837      503  6575  6939  4 23  365
337837      503  6940  7304  4 24  365
337837 50249512  7305  7452  4 25  148
337837        .  7453  7670  7 26  218
337837 50249512  7881  8035  4 27  155
337837 50249512  8036  8127  4 28   92
337837        .  8128  8400  7 29  273
337837        .  8401  8486 10 30   86
337837 50249512  8487  8765  4 31  279
337837      503  8766  9131  4 32  366
337837      503  9132  9496  4 33  365
337837 50249512  9497  9861  4 34  365
337837 50249512  9862 10047  4 35  186
337837        . 10133 10147 19 36   15
337837 50249512 10227 10297  4 37   71
337837        . 10303 10419 19 38  117
337837        . 10420 10422 19 39    3
337837        . 10423 10471 19 40   49
337837 50249512 10472 10513  4 41   42
337837 50249512 10514 10552  4 42   39
337837        . 10553 10592 10 43   40
337837        . 10593 10762 10 44  170
337837        . 10839 10957 10 45  119
337837        . 10958 11314 10 46  357
337837        . 11315 11319 10 47    5
345476 50250994  5479  5843  4  9  365
345476 50250994  5844  6008  4 10  165
345476        .  6009  7302  1 11 1294
345476        .  7303  8763  1 12 1461
345476 50509895  9315  9496  4 13  182
345476 50509895  9497  9695  4 14  199
345476        .  9696  9703 20 15    8
345476        .  9704  9852 20 16  149
345476        .  9853  9861 20 17    9
345476 50509895  9862 10176  4 18  315
437068 50484498  5479  5843  4  4  365
437068 50484498  5844  6209  4  5  366
437068 50380719  6210  6574  4  6  365
437068 50380719  6575  6939  4  7  365
437068 50484498  6940  7304  4  8  365
437068 50484498  7305  7357  4  9   53
437068 50303829  7668  7670  4 10    3
437068      503  7671  8035  4 11  365
437068      503  8036  8083  4 12   48
437068 50487075  8084  8200  4 13  117
437068 50268647  8401  8588  4 14  188
437068 50487075  8589  8678  4 15   90
437068 50469889  8681  8696  4 16   16
437068 50433174  8749  8750  4 17    2
437068        .  8760  8765 10 18    6
437068        .  8766  8993 10 19  228
437068 50298965  8994  9025  4 20   32
437068 50298965  9132  9496  4 21  365
437068 50298965  9497  9861  4 22  365
437068 50488955  9912 10091  4 23  180
437068        . 10110 10113 10 24  104
437068        . 10114 10124 10 25   13
437068        . 10128 10134 10 26   30
437068 50488955 10154 10330  4 27   74
437068        . 10436 10492 10 28   57
437068        . 10493 10592 10 29  100
437068        . 10593 10702 10 30  110
437068 50488955 10703 10721  4 31   19
437068 50380719 10768 10769  4 32    2
437068        . 10856 10868 10 33   13
437068        . 10869 10924 10 34   56
437068 50380719 10932 10957  4 35   26
437068 50380719 10958 10972  4 36   15
437068 50247116 11220 11322  4 37  103
437068 50247116 11323 11687  4 38  365
437068 50247116 11688 11763  4 39   76
437068 50452297 12139 12293  4 40  155
437068        . 12327 12391  7 41   65
437068 50452297 12618 12671  4 42   54
437068        . 12672 12783  7 43  112
437068        . 12991 13112  7 44  122
end
format %d begepi
format %d endepi
label values firmid betnr_de
label values status status
label def status 1 "1 school", modify
label def status 4 "4 employed", modify
label def status 7 "7 unemployed", modify
label def status 10 "7 unemployed", modify
label def status 19 "19 ill", modify
label def status 20 "20 maternity", modify

With the following code I calculated tenure durations and finally date of notice period (you're fired).

Code:

sort persnr firmid spell

keep if firmid < .

gen        tenure = duration
replace    tenure = tenure + tenure[_n-1] if (persnr == persnr[_n-1] & ///
                     firmid == firmid[_n-1] & begepi < endepi[_n-1] + 182)

gen tenure_d = 0
replace tenure_d = 1 if (persnr == persnr[_n-1] & firmid == firmid[_n-1] & ///
                          begepi < endepi[_n-1] + 182)


gen tenure_fine = tenure if (tenure_d[_n+1] == 0)
gsort persnr firmid -begepi
replace tenure_fine = tenure_fine[_n-1] if (persnr == persnr[_n-1]) & ///
                       (tenure_d[_n-1] == 1)

sort persnr spell
save data_small, replace


use data, clear

sort persnr spell
merge 1:1 persnr firmid begepi using data_small
drop _merge
sort persnr spell

save data_final, replace


gen tenure_y = tenure_fine/365.25
gen tenure_years = round(tenure_y, .001)
drop tenure_y

gen        notice_period_years = .
replace notice_period_years = 0.039 if (tenure_years <= 0.5)
replace notice_period_years = 0.077 if (tenure_years >= 0.583    &    tenure_years < 2)
replace notice_period_years = 0.083 if (tenure_years >= 2        &    tenure_years < 5)
replace notice_period_years = 0.167 if (tenure_years >= 5        &    tenure_years < 8)
replace notice_period_years = 0.250 if (tenure_years >= 8        &    tenure_years < 10)
replace notice_period_years = 0.333 if (tenure_years >= 10        &    tenure_years < 12)
replace notice_period_years = 0.417 if (tenure_years >= 12        &    tenure_years < 15)
replace notice_period_years = 0.500 if (tenure_years >= 15        &    tenure_years < 20)
replace notice_period_years = 0.583 if (tenure_years >= 20        &    tenure_years <  .)

gen notice_period_days = notice_period_years*365.25


gen notice_period_date= endepi - notice_period_days + 1 if (tenure == tenure_fine)
format notice_period_date %td
gsort persnr firmid -begepi
replace notice_period_date = notice_period_date[_n-1] if (persnr == persnr[_n-1]) & ///
                       (tenure_d[_n-1] == 1)

sort persnr spell

save data_notice, replace

Now I want to paste this notice date (or tenure, the code will be the same), which is a constant within an employment relationship (including periods of interruptions < 6 months), to all the observations with missings (e.g. due to short-term illness) that refer to that employment relationship.

Last edited by Inna Petrunyk; 17 Mar 2019, 07:56.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30179
#4

17 Mar 2019, 12:44

I'm sorry, but I don't think I understand. If I am understanding you correctly, there aren't any such short interruptions in your example data. If there are, can you point them out? If not, can you provide an example where there are some, and say where they are in the example?

As an aside, looking at the code you've used, you can save yourself a lot of keystrokes (and probably a lot of errors and some time) by making effective use of Stata's -by:- prefix. For example:

Code:

replace tenure = tenure + tenure[_n-1] if (persnr == persnr[_n-1] & /// firmid == firmid[_n-1] & begepi < endepi[_n-1] + 182)

can be much more efficiently coded as:

Code:

by persnr firmid, sort: replace tenure = tenure + tenure[_n-1] if begepi < endepi[_n-1] + 182

If you are not familiar with -by:- you are missing out on one of Stata's best and most important features. -help by-
Comment

Inna Petrunyk

Join Date: Oct 2014
Posts: 31

17 Mar 2019, 13:09

Thanks. You're right, I can also use the prefix by.

Here is an example with employment spell interruptions due to illness (status=19):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(persnr firmid) int begepi long endepi byte status float(spell duration tenure tenure_d tenure_fine)
337837 50249512  9497  9861  4 34 365 365 0 703
337837 50249512  9862 10047  4 35 186 551 1 703
337837        . 10133 10147 19 36  15   . .   .
337837 50249512 10227 10297  4 37  71 622 1 703
337837        . 10303 10419 19 38 117   . .   .
337837        . 10420 10422 19 39   3   . .   .
337837        . 10423 10471 19 40  49   . .   .
337837 50249512 10472 10513  4 41  42 664 1 703
337837 50249512 10514 10552  4 42  39 703 1 703
end
format %d begepi
format %d endepi
label values firmid betnr_de
label values status status

Then, if I want to include these short interruptions between employment spells in the same firm in the calculation of tenure duration, I suppose I should change the corresponding part of the code from above as follows:

Code:

* gen tenure = duration
  by persnr firmid, sort : replace tenure = tenure + tenure[_n-1] + begepi-endepi[_n-1]-1 ///
                                     if (begepi < endepi[_n-1] + 182))

Last edited by Inna Petrunyk; 17 Mar 2019, 13:47.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30179
#6

17 Mar 2019, 17:55

I don't think your code in the last box will work, because in the observations you want to extend the tenure into, tenure starts out as missing, so it will remain missing. Also, if I understood you correctly, you do not want to increase the tenure by the amount of the (brief) absence--you want to keep it where it was previously, am I right?

If I have that right, then this code should do it:

Code:

gen long last_employer = firmid, after(firmid) by persnr (begepi), sort: replace last_employer = last_employer[_n-1] /// if missing(last_employer) gen low = begepi - 182 gen high = begepi - 1 format low high %td rangestat (max) last_tenure = tenure, by(persnr last_employer) interval(endepi low high) replace tenure = last_tenure if missing(tenure) drop low high last_employer

-rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer, and is available from SSC.
Comment

Inna Petrunyk

Join Date: Oct 2014
Posts: 31

18 Mar 2019, 03:35

Your're right, the box works only with non-missing values in tenure, which is the case if I first keep if betnr < . and then merge with the previously saved dta.
Anyway, your code is perfect if brief interruptions should not be included in the tenure. In fact, I looked for a solution for both cases.
Thanks a lot for your help, also for showing how rangestat works.

Could you, please, explain me why I get different results when using this code:

Code:

gen        tenure = duration
replace    tenure = tenure + tenure[_n-1] if (persnr == persnr[_n-1] &  ///
                     firmid == firmid[_n-1] &  begepi < endepi[_n-1] + 182)

and the one you suggested:

Code:

gen        tenure = duration
by persnr firmid, sort: replace  tenure = tenure + tenure[_n-1] if ///
                        begepi < endepi[_n-1] + 182

starting from this dataset with no missings:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(persnr firmid) int begepi long endepi byte status float(spell duration)
337837      503  6210  6574 4 22 365
337837      503  6575  6939 4 23 365
337837      503  6940  7304 4 24 365
337837      503  8766  9131 4 32 366
337837      503  9132  9496 4 33 365
337837 50249512  5479  5843 4 20 365
337837 50249512  5844  6209 4 21 366
337837 50249512  7305  7452 4 25 148
337837 50249512  7881  8035 4 27 155
337837 50249512  8036  8127 4 28  92
337837 50249512  8487  8765 4 31 279
337837 50249512  9497  9861 4 34 365
337837 50249512  9862 10047 4 35 186
337837 50249512 10227 10297 4 37  71
337837 50249512 10472 10513 4 41  42
337837 50249512 10514 10552 4 42  39
345476 50250994  5479  5843 4  9 365
345476 50250994  5844  6008 4 10 165
345476 50509895  9315  9496 4 13 182
345476 50509895  9497  9695 4 14 199
345476 50509895  9862 10176 4 18 315
437068      503  7671  8035 4 11 365
437068      503  8036  8083 4 12  48
437068 50247116 11220 11322 4 37 103
437068 50247116 11323 11687 4 38 365
437068 50247116 11688 11763 4 39  76
437068 50268647  8401  8588 4 14 188
437068 50298965  8994  9025 4 20  32
437068 50298965  9132  9496 4 21 365
437068 50298965  9497  9861 4 22 365
437068 50303829  7668  7670 4 10   3
437068 50380719  6210  6574 4  6 365
437068 50380719  6575  6939 4  7 365
437068 50380719 10768 10769 4 32   2
437068 50380719 10932 10957 4 35  26
437068 50380719 10958 10972 4 36  15
437068 50433174  8749  8750 4 17   2
437068 50452297 12139 12293 4 40 155
437068 50452297 12618 12671 4 42  54
437068 50469889  8681  8696 4 16  16
437068 50484498  5479  5843 4  4 365
437068 50484498  5844  6209 4  5 366
437068 50484498  6940  7304 4  8 365
437068 50484498  7305  7357 4  9  53
437068 50487075  8084  8200 4 13 117
437068 50487075  8589  8678 4 15  90
437068 50488955  9912 10091 4 23 180
437068 50488955 10154 10330 4 27  74
437068 50488955 10703 10721 4 31  19
end
format %d begepi
format %d endepi
label values firmid betnr_de
label values status status
label def status 4 "4 employed", modify

Last edited by Inna Petrunyk; 18 Mar 2019, 04:31.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30179
#8

18 Mar 2019, 08:58

There is a difference between them, and both of them are actually wrong in two ways! Sorry about what I wrote in #4--too quick and I didn't test it!

The difference between them arises because the version with -by- sorts the data before it runs, and because persnr and firmid do not uniquely identify observations, the sort order is indeterminate, and following Stata's general rule, it randomizes the order of observations within combinations of persnr and firmid. So the order of the data is different when these codes run, which implies that varname[_n-1] has different values. In order for either of these codes to be correct, it is important that the sort order also account for the begepi variable.

The other problem is that the reference to varname[_n-1] will always be missing value when _n == 1. Not only does that get things wrong in the first observation, but the code will then continue to propagate missing values into subsequent observations until the -if- condition finally fails due to a long episode.

So the correct ways to do this are:

Code:

sort persnr firmid begepi gen tenure = duration replace tenure = tenure + tenure[_n-1] if (persnr == persnr[_n-1] & /// firmid == firmid[_n-1] & begepi < endepi[_n-1] + 182 & _n > 1) gen tenure2 = duration by persnr firmid (begepi), sort: replace tenure2 = tenure2 + tenure2[_n-1] if /// begepi < endepi[_n-1] + 182 & _n > 1 assert tenure == tenure2
Comment
Inna Petrunyk

Join Date: Oct 2014

Posts: 31
#9

18 Mar 2019, 11:32

Ok, now I see. I actually have done sort by persnr firmid spell as in #3, that's why I got correct results.
I will add the option _n > 1 to my code. Thanks a lot.
Comment

Announcement