Hello Statalisters,
I'm assisting a project that deals with divorce determinants and we would like to implement time-to-event analysis using our longitudinal data in Stata SE (15 is most likely the earliest version that we will use). Here's a code that creates dummy data that's structurally similar to ours:
The observation window is years 1995 to 2010. The variables of interest are: years of marriage and divorce, two variables tagging those events in the observation window and a single time varying variable "wage" (let's assume that we want to estimate the effect of wage on the risk of divorce). And as you can see there are a five different ids that cover different patterns of truncation and censoring in this data. I'd like to go over the those briefly:
id=1: In my understanding this is a typical case of left truncation - we know the year of becoming at risk (year of marriage) for this person as well as the year of divorce, but a large part of the period of being at risk is chronologically before our observation window, so time varying variable wage for this period is unobserved.
id=2: This is simply a never-married person. I assume all observations for this person will be correctly tagged as irrelevant after -stset- as they don't provide any useful information for divorce analysis.
id=3: This is a case where both the year of marriage and the year of divorce are known, but two years before divorce fall outside of the observation window and thus we have no data on wage. If it matters I don't think this is a typical case of right censoring where the failure event time is not known, but the effect of wage can be properly estimated using the available data regardless.
id=4: I guess this is the least problematic case that doesn't need any special treatment or consideration because all of the needed data is available and falls in the observation window.
id=5: This is a case where both events, marriage and divorce, happen before the observation window, and I suspect this id also produces no useful info for estimating the effect of wage.
(Obviously there are more possible patterns of censoring and truncation, but I think it would be enough focusing just on those above)
So considering this data, its various truncation and censoring patterns, and my goal of estimating the effect of wage - is this the correct way to use the -stset- command?:
stset year, id(id) failure(divorce_event==1) origin(time year_of_marriage) enter(year==1995) exit(time year_of_divorce)
Can I simply run -stcox wage- after the above setup knowing that left truncation and other censoring/truncation phenomena are accounted for in the estimation? Or are there any additional steps that need to be taken before the estimation?
The answer of German Rodriguez in an older thread on the topic (that addresses single record data), -stset- output, Stata help files and some additional literature that I've read on the topic suggest that I'm on the right path, but I would still like to be sure I'm not missing anything crucial. So any comments and suggestions would be very welcome.
Thanks!
I'm assisting a project that deals with divorce determinants and we would like to implement time-to-event analysis using our longitudinal data in Stata SE (15 is most likely the earliest version that we will use). Here's a code that creates dummy data that's structurally similar to ours:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input byte id int(year year_of_marriage year_of_divorce) byte(marriage_event divorce_event) double wage 1 1995 1980 2004 . . 2827 1 1996 1980 2004 . . 3577 1 1997 1980 2004 . . 3146 1 1998 1980 2004 . . 2573 1 1999 1980 2004 . . 2919 1 2000 1980 2004 . . 2805 1 2001 1980 2004 . . 2505 1 2002 1980 2004 . . 3377 1 2003 1980 2004 . . 3971 1 2004 1980 2004 . 1 2880 1 2005 1980 2004 . . 3620 1 2006 1980 2004 . . 3805 1 2007 1980 2004 . . 3502 1 2008 1980 2004 . . 2931 1 2009 1980 2004 . . 3250 1 2010 1980 2004 . . 3475 2 1995 . . . . 6771 2 1996 . . . . 7645 2 1997 . . . . 5991 2 1998 . . . . 5524 2 1999 . . . . 5602 2 2000 . . . . 7000 2 2001 . . . . 7007 2 2002 . . . . 7132 2 2003 . . . . 6709 2 2004 . . . . 5133 2 2005 . . . . 6517 2 2006 . . . . 6546 2 2007 . . . . 5799 2 2008 . . . . 6850 2 2009 . . . . 5082 2 2010 . . . . 7558 3 1995 2000 2012 . . 9235 3 1996 2000 2012 . . 11263 3 1997 2000 2012 . . 7593 3 1998 2000 2012 . . 10815 3 1999 2000 2012 . . 8490 3 2000 2000 2012 1 . 7771 3 2001 2000 2012 . . 7960 3 2002 2000 2012 . . 11056 3 2003 2000 2012 . . 11119 3 2004 2000 2012 . . 9010 3 2005 2000 2012 . . 9871 3 2006 2000 2012 . . 10827 3 2007 2000 2012 . . 10356 3 2008 2000 2012 . . 8673 3 2009 2000 2012 . . 9036 3 2010 2000 2012 . . 11140 4 1995 2003 2007 . . 14384 4 1996 2003 2007 . . 14804 4 1997 2003 2007 . . 11376 4 1998 2003 2007 . . 14948 4 1999 2003 2007 . . 10370 4 2000 2003 2007 . . 11220 4 2001 2003 2007 . . 13678 4 2002 2003 2007 . . 14306 4 2003 2003 2007 1 . 10334 4 2004 2003 2007 . . 14982 4 2005 2003 2007 . . 13948 4 2006 2003 2007 . . 11298 4 2007 2003 2007 . 1 11978 4 2008 2003 2007 . . 11398 4 2009 2003 2007 . . 15072 4 2010 2003 2007 . . 11734 5 1995 1966 1990 . . 17162 5 1996 1966 1990 . . 17712 5 1997 1966 1990 . . 17587 5 1998 1966 1990 . . 17567 5 1999 1966 1990 . . 16107 5 2000 1966 1990 . . 17127 5 2001 1966 1990 . . 17750 5 2002 1966 1990 . . 19077 5 2003 1966 1990 . . 14807 5 2004 1966 1990 . . 13915 5 2005 1966 1990 . . 19117 5 2006 1966 1990 . . 16587 5 2007 1966 1990 . . 17312 5 2008 1966 1990 . . 14907 5 2009 1966 1990 . . 19647 5 2010 1966 1990 . . 17237 end
id=1: In my understanding this is a typical case of left truncation - we know the year of becoming at risk (year of marriage) for this person as well as the year of divorce, but a large part of the period of being at risk is chronologically before our observation window, so time varying variable wage for this period is unobserved.
id=2: This is simply a never-married person. I assume all observations for this person will be correctly tagged as irrelevant after -stset- as they don't provide any useful information for divorce analysis.
id=3: This is a case where both the year of marriage and the year of divorce are known, but two years before divorce fall outside of the observation window and thus we have no data on wage. If it matters I don't think this is a typical case of right censoring where the failure event time is not known, but the effect of wage can be properly estimated using the available data regardless.
id=4: I guess this is the least problematic case that doesn't need any special treatment or consideration because all of the needed data is available and falls in the observation window.
id=5: This is a case where both events, marriage and divorce, happen before the observation window, and I suspect this id also produces no useful info for estimating the effect of wage.
(Obviously there are more possible patterns of censoring and truncation, but I think it would be enough focusing just on those above)
So considering this data, its various truncation and censoring patterns, and my goal of estimating the effect of wage - is this the correct way to use the -stset- command?:
stset year, id(id) failure(divorce_event==1) origin(time year_of_marriage) enter(year==1995) exit(time year_of_divorce)
Can I simply run -stcox wage- after the above setup knowing that left truncation and other censoring/truncation phenomena are accounted for in the estimation? Or are there any additional steps that need to be taken before the estimation?
The answer of German Rodriguez in an older thread on the topic (that addresses single record data), -stset- output, Stata help files and some additional literature that I've read on the topic suggest that I'm on the right path, but I would still like to be sure I'm not missing anything crucial. So any comments and suggestions would be very welcome.
Thanks!