Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discrete Time Survival Analysis - Choice of time scale

    Hi,

    I'm performing a discrete time survival analysis (DTSA) on longitudinal panel data with data collected every 2 years from 1994/95 through 2010/2011. My primary interest is in examining arthritis as a risk factor for developing heart disease. I created a person-period dataset and performed all of my initial analyses with a non-parametric specification of calendar time as my time-scale. I therefore created 8 indicator variables (d1-d8) for each 2-year wave of data collection. I then extended my model to include the exposure of interest (x) and covariates (z) which include age.

    The model looks something like this

    logit Y d1-d8 x z, nocons or

    Apon further consideration, given that my primary research question relates to predicting chronic disease onset in a longitudinal population-based survey, I think attained age is a better choice for time scale because defining the risk sets in terms of age is more relevant in the present context than calendar time and age is an important confounder for heart disease onset that needs to be carefully controlled for. This is summarized in the excerpt below by Thiebaut & Benichou, Stat Med, 2004 (link for reference: http://www.ncbi.nlm.nih.gov/pubmed/?...mulation+study)

    "In most epidemiologic cohort studies, subjects are followed up prospectively for the occurrence of a given disease. Upon analysing such data, the effect of age needs to be tightly controlled because the incidence of most diseases, especially chronic diseases, is strongly determined by age. The natural time-scale W is then (attained) age. Using time-on-study as the time-scale would generally not be relevant, especially when the inclusion into the cohort coincides with an interview, which is not supposed to modify one’s risk. Indeed in epidemiologic cohort studies, contrary to clinical studies, the time when a subject comes under observation usually does not coincide with the time when the subject becomes at risk for the disease of interest."

    Question 1

    My first question is as follows, what would be a preferred method for specifying age as the time scale in (DTSA). The age range in my sample is 18-105 years so specifying a dummy indicator for each year of age is both cumbersome and problematic because an event does not occur in each year of age. So far I have specified attained age as the time scale in 2 ways

    1. Indicator attained age categories as follows (18-44, 45-49, 50-54, 55-59, 60-65, 65-69, 70-74, 75+). Note 18-44 collapsed category selected because sparse number of events in this age range. This category grouping is useful for comparison with other studies.
    2. Polynomial specification of age centered at mean age value at baseline. i.e mean age baseline is 45 years. So created two centered age variables
    - cage = age-45
    - cage2 = cage^2

    How do I go about choosing between the two specification? I was reviewing Stephen Jenkins Lesson 6 - Estimation: (ii) discrete time models (logistic and cloglog), and while there are examples of various specifications of time, I did not see how to select the most appropriate one. (Apologies if I just missed this in the lecture notes). I cannot use a likelihood ratio test because these are non-nested models. Can I chose based on AIC/BIC criteria? Note: That in reality it does not make much difference on the effect of my exposure whether I use age one way or the other but feel that I should have a clear decision making process as these analyses are part of my doctoral thesis.

    Question 2

    My second question further relates to specifying the time scale in discrete time survival analysis (DTSA). If I go with age as the time scale, I still want to account for potential calendar period effects over the follow up period from 1994/94-2010/11. The publication I site below is a discussion of choice of time scale in longitudinal surveys using continuous time survival analysis with Cox proportional hazards model but I think the discussion is relevant to DTSA. Korn et al. AJE 1997 (link for reference: http://www.ncbi.nlm.nih.gov/pubmed/8982025) Here is the relevant excerpt

    The recommended continuous time proportional hazards model which controls for period effects as well as age and cohort effects is given as:



    where
    A = a is the age of the individual during the follow-up period
    b0 is the birth cohort of the individual with Bj birth cohort intervals, e.g.,1906-1910, 1911-1915, etc.
    B'z is a vector of regression parameters

    If my age time scale was the polynomial specification above (cage cage 2), my question is how would the DTSA model with logit link statement look in STATA?

    Suppose I create 2 birth cohorts
    BC190610
    BC191115

    Would I create interactions with cage & cage2?

    cageBC190610=cage*BC190610
    cageBC191115=cage*BC191115
    cage2BC190610=cage2*BC190610
    cage2BC191115=cage2*BC191115

    Would I then specify the 4 interaction terms, 2 birth cohort indicators along with exposure and other covariates as follows:

    logit Y cageBC190610 cageBC191115 cage2BC190610 cage2BC191115 BC190610 BC192215 x z , or

    Thanks in advance

    Orit








  • #2
    sorry the equation did not post here is a picture of the equation that should be included above
    Last edited by Orit Schieir; 30 Jun 2015, 10:52.

    Comment


    • #3
      Judging from the graphs in the Korn and Graubard article you linked to, their examples had age at event to the nearest year.

      Your proposed age intervals look too broad to me. At worst, you know an event took place in the two-year interval between interviews, which will be at one of three ages a person had during the two years (unless all interviews were on birthdays).

      My major question is: can you do better than a two year interval?, Do the data have an approximate date of "onset", however you are defining it? I'm thinking of year and approximate month or even year and season. If the answer is "yes", then you can have more exact ages and use the Korn-Graubard approach.

      I note that "Onset" of chronic disease is not easy to define: Diagnosis is possible before clinical symptoms appear, so better-screened people will have earlier diagnosis. On the other hand, for some, the first "symptom" might be death. So, I'm curious: how do you intend to define onset?


      Last edited by Steve Samuels; 30 Jun 2015, 20:31.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Hi Steve,

        Thank you for your reply and general interest.

        The panel dataset I use provides updated values for age (age in years at each interview date) as well as date of interview (dd/mm/yyyy) for each survey cycle that is at least partially completed by respondents. I also have information on cause of death (COD) with fairly complete data on year of death.

        As you note "disease onset" is difficult to define in panel data like mine so I have to make a simplifying assumption. I assume that the cycle visit where the survey respondent first reports heart disease as present or the cycle where there is a death with cause of death reported as ischemic heart disease or heart failure as the the disease onset date. I could also perform a sensitivity analysis where I consider the cycle just before the reported cycle as the disease onset date.

        My actual syntax is housed on a secure server with the dataset but just to give you an idea of how I construct the variables I provide code from memory below.

        *variables in dataset
        *HD_ = time dependent self-reported doctor diagnosed heart disease
        *CODHD_ =time dependent cause of death due to ischemic heart disease or heart failure
        *time = survey cycle (1 thru 8)
        *age_ = time dependent age in years
        *baseage = age in years at baseline


        *
        bysort id(time): egen firstHD = min(cond(HD_==1 | CODHD_==1, time)
        bysort id(time): egen timeobs = max(cond(HD_!=., time))

        gen studytime = timeobs
        replace studytime = firstHD if firstHD!=.

        by id, sort: drop if time> studytime

        If I didnt't have missing values for time dependent age over the survey I could use age instead of time in code above, but I do have missing age due to skipped visits. I then fill in missing values for age at skipped visits as follow

        *create time dependent person year indicator
        gen PYtime = time*2 (*because intervals are 2 years apart)

        *create imputed time dependent continuous age in years

        gen ageE_ = age_
        replace ageE_ = baseage+PYtime if age_==.


        I have used ageE_ centered at the mean age of the baseline visit as the time scale and have also performed the analysis with 8 separate indicators for grouped age (18-44, 45-49, 50-54, 55-59, 60-65, 65-69, 70-74, 75+) based on ageE_. The latter mostly to compare hazard rates with age groups in other studies. My understanding based on your response above is that you would suggest to use the centered continuous time dependent age variable as Korn & Graubard do rather than the indicator variables correct?

        Any insight on the period/cohort effect issue I raise above?

        Thanks again for your response




        Comment


        • #5
          I'm sorry to say that I don't see a way in Stata to use age as the primary time scale. In terms of age, you have "interval censored" data, which means for each person you know only that the event took place between ages two interviews apart. If you have birth date, you could refine this to "biological age". I believe that the latest version of SAS has a command to do Cox regression with such data; Stata doesn't. In Stata, my best suggestion is to use follow-up time as the primary time scale in a cloglog. Add age at interview as flexible polynomial with the fp prefix.
          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment


          • #6
            I retract my previous post: you can analyze interval-censored survival data in Stata: Patrick Royston's stpm command (SSC) can fit proportional hazards or proportional odds models, and Jamie Griffin's intcens command can fit a variety of parametric distributions. Both are at SSC.

            References:

            Royston, Patrick. 2001. Flexible parametric alternatives to the Cox model, and more. Stata Journal 1, no. 1: 1-28.

            available at: http://http://www.stata-journal.com/...iclenum=st0001



            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Thanks for the advice. I understand I don't know the exact age of onset because the data are interval censored but I wonder if it's still a good idea, at least as a sensitivity analysis, to compare results using age as the time scale either with the assumption that age at interview where the condition is first reported is the age of onset or alternatively a more realistic assumption, assuming that age of onset is the midway age between age at interview where the condition is first reported (age(t)) and the age at the interview just preceding it (age(t-1)).

              Comment


              • #8
                If you have date of birth, you should use biological age (in months or days); with age-last birthday, too many tied intervals can bother the interval-censoring algorithms.

                You can certainly try "mid-point" assignment, but know that it is probably biased if risk of heart disease increases with age: more people will tend to have an event earlier in an interval. f you have time-varying covariates then you must also estimate their values at the "mid-points".

                I like the idea of a sensitivity analysis which selects models that objectively seem to fit the data. Then assess how sensitive the arthritis-heart disease association is to model choice.

                I suspect that you may want to start an age analysis at an age much older than 18; the data are not likely to be informative for risks at younger ages.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment

                Working...
                X