Hi all,
I am confused by the variables -stsplit- generates and I am not sure I checked the transformation properly. Let me ask for some independent review here.
I have spells of data for individuals, analyzing their retirement decisions. In one setup, the spells could start at birth, and end with "failure" at retirement (or without failure when my sample runs out at a specific point in calendar time). As I "only" observe all jobs after 1985, I also specify entry for stset. Exit is the same as the time for the censored observations, but I set it just in case. Time is measured at a monthly resolution, but I would work with years.
All this said, I think this is the proper -stset- line:
This results in lines like this one:
Now I want to analyze covariates' effect on retirement, and as common, most covariates are available in annual panels using calendar time. Shouldn't the following line let me merge in (1:1 PNR year) data for 2000-2007?
But the generated y looks differently:
E.g. The failure variable is not filled, even though -stsplit- could understand from the original data that the new spells end without failure. More importantly, I am confused by the scaling of the the generated 'y'. From the age information, shouldn't I infer that `year = y + 2000` instead? Why is that, and why did I get a y = -1 record for 1999 when I was splitting after December 1999, at 0? In any case, I see spell -1 start in 1985 (the original enter time), and the last record y=8 start on January 1, 2008 12 a.m. and end at exit or failure?
Constructively, replacing `retired` to 0 when missing, and using `year = y + 2000` would be the correct data to then use with time-varying covariates? (OK, it is also relevant how my covariates are measured. They are measured at end-of-calendar-year, so maybe they should be predict only the next spell. That said, some flow covariates describe what happened over the same year, which affects hazard the same year, not the following. (And I will merge in leads and lags to capture the timing of effects anyway.)
I am confused by the variables -stsplit- generates and I am not sure I checked the transformation properly. Let me ask for some independent review here.
I have spells of data for individuals, analyzing their retirement decisions. In one setup, the spells could start at birth, and end with "failure" at retirement (or without failure when my sample runs out at a specific point in calendar time). As I "only" observe all jobs after 1985, I also specify entry for stset. Exit is the same as the time for the censored observations, but I set it just in case. Time is measured at a monthly resolution, but I would work with years.
All this said, I think this is the proper -stset- line:
Code:
stset retmonth, origin(time birthmonth) enter(time ym(1985,1)) exit(time ym(2011,12)) id(PNR) failure(retired) scale(12)
Code:
+---------------------------------------------------------------------------------------------------------------+ | PNR year retired lastmo~h lyear cohort age birthm~h retmonth _st _d _t _t0 | |---------------------------------------------------------------------------------------------------------------| 1. | 9 2011 0 12 2011 1951 60.75 1951m3 2011m12 1 0 60.75 33.833333 | +---------------------------------------------------------------------------------------------------------------+
Code:
stsplit y, after(time = ym(1999,12)) at(0(1)8) trim replace year = y + 1999
Code:
. l +--------------------------------------------------------------------------------------------------------------------+ | PNR year retired lastmo~h lyear cohort Alder birthm~h retmonth _st _d _t _t0 y | |--------------------------------------------------------------------------------------------------------------------| 1. | 9 2011 . 12 2011 1951 60.75 1951m3 1999m12 0 0 48.75 33.833333 -1 | 2. | 9 2011 . 12 2011 1951 60.75 1951m3 2000m12 1 0 49.75 48.75 0 | 3. | 9 2011 . 12 2011 1951 60.75 1951m3 2001m12 1 0 50.75 49.75 1 | 4. | 9 2011 . 12 2011 1951 60.75 1951m3 2002m12 1 0 51.75 50.75 2 | 5. | 9 2011 . 12 2011 1951 60.75 1951m3 2003m12 1 0 52.75 51.75 3 | |--------------------------------------------------------------------------------------------------------------------| 6. | 9 2011 . 12 2011 1951 60.75 1951m3 2004m12 1 0 53.75 52.75 4 | 7. | 9 2011 . 12 2011 1951 60.75 1951m3 2005m12 1 0 54.75 53.75 5 | 8. | 9 2011 . 12 2011 1951 60.75 1951m3 2006m12 1 0 55.75 54.75 6 | 9. | 9 2011 . 12 2011 1951 60.75 1951m3 2007m12 1 0 56.75 55.75 7 | 10. | 9 2011 0 12 2011 1951 60.75 1951m3 2011m12 0 0 60.75 56.75 8 | +--------------------------------------------------------------------------------------------------------------------+
Constructively, replacing `retired` to 0 when missing, and using `year = y + 2000` would be the correct data to then use with time-varying covariates? (OK, it is also relevant how my covariates are measured. They are measured at end-of-calendar-year, so maybe they should be predict only the next spell. That said, some flow covariates describe what happened over the same year, which affects hazard the same year, not the following. (And I will merge in leads and lags to capture the timing of effects anyway.)