Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Survival Curve for Discrete Panel Data

    Hi all,

    I want to plot a usual Kaplan-Meier survival curve. However, my data is of discrete panel structure (as the topic says), causing two problems at the same time.

    There are some typical commands like sts for the continuous case, but I do not find or achieve anything useful for my purpose.
    Furthermore I would like to customize the length of the x-axis, making sts commands inadequate.
    There are some hints on the internet like this:
    https://www.stata.com/statalist/arch.../msg00277.html
    (but with no application for the discrete case)
    or this:
    https://www.stata.com/meeting/spain1...te-spain16.pdf
    (with slides 37 and 38 being very interesting, but with no application for the panel case)

    In detail, I have yearly firm data with a binary variable of unit value if a firm exits the market in the corresponding year, my dependent variable, and I would like to use the cloglog command to estimate survival/hazard rates.

    Is there any help out there? :-)

    Best regards,

    Tim

  • #2
    Prof. Jenkins' (excellent) course material on survival analysis should be what you're looking for (in particular, see lesson 6)

    Code:
    https://www.iser.essex.ac.uk/resources/survival-analysis-with-stata

    Comment


    • #3
      I am currently working with a panel dataset which has around 227 variables and 6233 observations. The data is from (1967- 1987) and I need aggregating all the variables which are ranging over this time period.

      Comment


      • #4
        Akhilesh:
        re-posting your query exploiting different threads is by no means the way to increase your chances of getting helpful replies.
        I replied to your original post here: https://www.statalist.org/forums/for...tional-dataset.
        Probably you need to be more detailed about what you're after. Thanks.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          @ Andrea:
          Thanks, I read this publication carefully and tried to apply it on my problem.
          I think the problem is the following: In the cancer.dta example the author generated variable j indicating the spell month but j is not equal to the patient's age.
          I do not know if this makes sense in my case. In a drug study it might be interesting to see at what month/year after using a new drug the patient dies (or not). For me (or any other analysis n firm survival) the start of the spell is the day the company is founded. On seldomly has data reachung back to the year of incorporation. So in a nutshell j!=age and I do not know how to solve this if we do not have information for the whole life cycle of a patient/company.

          Or maybe I mixed things up as I don't see the wood for the trees after hours of trial and error.
          sts does work, but not with custom length of x axis... Generating the variable by hand is much more complicated than expected.

          Comment


          • #6
            the start of the spell is the day the company is founded. On seldomly has data reachung back to the year of incorporation. So in a nutshell j!=age and I do not know how to solve this if we do not have information for the whole life cycle of a patient/company.
            At the heart of survival analysis is the idea that the chance of an event occurring depends on elapsed duration (time since the unit first became at risk of the event occurring = start date). If you don't know when the start date was, then you don't know elapsed duration. How, then, would you fit a model if you don't have the crucial ingredient? Sometimes you may first observe a unit part way through a spell but you do also know the spell start date. This is the case of left-truncated data (also known as "delayed entry"). Survival analysis with left-truncated data is straightforward. Survival analysis when you don't know the spell start date is really difficult -- near impossible. You could assume that the hazard rate is constant (i.e. doesn't vary with elapsed duration), but that's simply assuming the problem away. For a model that addresses the problem of unknown start dates, see Nickell, Econometrica 1979, with an application to the length of unemployment spells. (Note how his solution involves modelling the chances of entry at particular dates. Very complicated. And I've not seen the method applied anywhere else.)

            Comment


            • #7
              Hi Stephen,

              ok, the point you make indeed causes problems, but I think (hope) this is not the case for my analysis.
              Here is a sample form of my data structure:

              Code:
              id    year    age    failure
              1     2010     5     0
              1     2011     6     0
              1     2012     7     0
              1     2013     8     1
              2     2008     1     0
              2     2009     2     0
              2     2010     3     0
              2     2011     4     0
              3     2011     12    0
              3     2012     13    0
              3     2013     14    0
              3     2014     15    1
              Thus I do not observe most companies' full history. There is indeed left and right censoring (not truncation) because I only get a peek into company history. But as far as I know there should not be a problem with in using discret modelling.
              I can expand the data such that it reaches back to the year of incorporation as I know observation year and age, thus the binning of the spell at least.
              This does not give extra information on covariates X necessary for the regression, but the spell length can be identified to estimate a survival curve I suppose.

              Furthermore I tried to reproduce the example in Andrea's post #2, lesson 6.
              The data when we start the estimation in lesson 6 section 6 then looks similar to my example (year=j; failure=dead;; age should be constant for individuals but is doesn't change anything), but as I do the cloglog estimation and predict hazard and survival rates, there is one difference:
              In the example of Prof. Jenkins each individual of a particular age has a constant hazard rate, independent of the spell length, which is necessary for Kaplan-Meier curves. When I do the estimation hazard rates differ for each year. I have no idea why this is the case. I must have missed a simple thing though I do not know what.

              Sorry for bothering you, but hopefully this will also help the community.

              Comment


              • #8
                You appear to have left-truncated data with right censoring. You could estimate the baseline hazard with cloglog failure i.age and then follow the steps in my Lessons to derive predicted discrete hazard rates and thence the survival function. To do this you will need observations at all values of age from 1 upwards (and also have at least one event at each age -- else how can one identify the hazard and survivor functions?). If you do not have such observations then non-parametric methods of estimation cannot work. You'd have use some parametric model in order to identify the hazard function and then survivor function

                NB The cloglog command will fail with your data snippet because of 'perfect prediction'. (NB2 thanks for the sample data, but please use dataex to show them in future.)

                Good luck

                Comment


                • #9
                  Thanks again for the comment!
                  OK, I will now use dataex. You will find a real data example below. It is very long because I observe companies over decades with over 2 million observations and therefore high id's, but few failures. The old data snipped was artificially created, thus possible collinearity.
                  I also artificially expanded the dataset such that age (the spell length, labled j in your Lessons; max_age should correspond to your age variable) reaches back to 1.
                  Variable emp represents employees in the corresponding year, just as an illustration of a possible covariate x which is missing wuite often. This is caused due to left truncation as I do not observe a company up to a certain point in time.

                  I still was not able to solve the problem. The goal was to get a Kaplan-Meier curve which is - as far as I understand the concept - nonparametric.
                  The sts command works with such a data structure and maybe we veer away from the initial subject. Or even though sts commands work they are inadequate because of the discrete time structure? In your Lesson 4 you apply sts commands on the cancer dataset. So why can't I as my data are structures similarly?
                  I could simply type sts list if... to get the survivor function for a certain subsample and somehow plot it myself. The problem with sts graph is thet you cannot expand the x axis, my initial point. It only gives (in my case) results for age<=20 and I am interested in long-term survival.

                  However, I was able to get estimate baseline hazard and survival rates using cloglog failure i.max_age for my data, but it did not really help to solve my problem. I am quite confused now...
                  ---
                  could there by a tipo in Lesson 6 page 9 top as you write
                  ge h0 = p if age == 55 & drug == 0
                  but mean
                  ge h0 = h if age == 55 & drug == 0
                  ?
                  ---


                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input double id  float(year age max_age failure) double emp
                  2010197963 1988  1 19 0 59
                  2010197963 1989  2 19 0 59
                  2010197963 1990  3 19 0 59
                  2010197963 1991  4 19 0 59
                  2010197963 1992  5 19 0 59
                  2010197963 1993  6 19 0 59
                  2010197963 1994  7 19 0 59
                  2010197963 1995  8 19 0 59
                  2010197963 1996  9 19 0 59
                  2010197963 1997 10 19 0 59
                  2010197963 1998 11 19 0 59
                  2010197963 1999 12 19 0 59
                  2010197963 2000 13 19 0 59
                  2010197963 2001 14 19 0 59
                  2010197963 2002 15 19 0 59
                  2010197963 2003 16 19 0 59
                  2010197963 2004 17 19 0 59
                  2010197963 2005 18 19 0 59
                  2010197963 2006 19 19 1 59
                  2010210763 1989  1 20 0  2
                  2010210763 1990  2 20 0  2
                  2010210763 1991  3 20 0  2
                  2010210763 1992  4 20 0  2
                  2010210763 1993  5 20 0  2
                  2010210763 1994  6 20 0  2
                  2010210763 1995  7 20 0  2
                  2010210763 1996  8 20 0  2
                  2010210763 1997  9 20 0  2
                  2010210763 1998 10 20 0  2
                  2010210763 1999 11 20 0  2
                  2010210763 2000 12 20 0  2
                  2010210763 2001 13 20 0  2
                  2010210763 2002 14 20 0  2
                  2010210763 2003 15 20 0  2
                  2010210763 2004 16 20 0  2
                  2010210763 2005 17 20 0  2
                  2010210763 2006 18 20 0  2
                  2010210763 2007 19 20 0  .
                  2010210763 2008 20 20 1  .
                  2010218411 1989  1 20 0  .
                  2010218411 1990  2 20 0  .
                  2010218411 1991  3 20 0  .
                  2010218411 1992  4 20 0  .
                  2010218411 1993  5 20 0  .
                  2010218411 1994  6 20 0  .
                  2010218411 1995  7 20 0  .
                  2010218411 1996  8 20 0  .
                  2010218411 1997  9 20 0  .
                  2010218411 1998 10 20 0  .
                  2010218411 1999 11 20 0  .
                  2010218411 2000 12 20 0  .
                  2010218411 2001 13 20 0  .
                  2010218411 2002 14 20 0  .
                  2010218411 2003 15 20 0  .
                  2010218411 2004 16 20 0  .
                  2010218411 2005 17 20 0  .
                  2010218411 2006 18 20 0  .
                  2010218411 2007 19 20 0  .
                  2010218411 2008 20 20 1 17
                  2010226153 1989  1 19 0 22
                  2010226153 1990  2 19 0 22
                  2010226153 1991  3 19 0 22
                  2010226153 1992  4 19 0 22
                  2010226153 1993  5 19 0 22
                  2010226153 1994  6 19 0 22
                  2010226153 1995  7 19 0 22
                  2010226153 1996  8 19 0 22
                  2010226153 1997  9 19 0 22
                  2010226153 1998 10 19 0 22
                  2010226153 1999 11 19 0 22
                  2010226153 2000 12 19 0 22
                  2010226153 2001 13 19 0 22
                  2010226153 2002 14 19 0 22
                  2010226153 2003 15 19 0 22
                  2010226153 2004 16 19 0 22
                  2010226153 2005 17 19 0 22
                  2010226153 2006 18 19 0 22
                  2010226153 2007 19 19 1  .
                  2010234040 1989  1 19 0 45
                  2010234040 1990  2 19 0 45
                  2010234040 1991  3 19 0 45
                  2010234040 1992  4 19 0 45
                  2010234040 1993  5 19 0 45
                  2010234040 1994  6 19 0 45
                  2010234040 1995  7 19 0 45
                  2010234040 1996  8 19 0 45
                  2010234040 1997  9 19 0 45
                  2010234040 1998 10 19 0 45
                  2010234040 1999 11 19 0 45
                  2010234040 2000 12 19 0 45
                  2010234040 2001 13 19 0 45
                  2010234040 2002 14 19 0 45
                  2010234040 2003 15 19 0 45
                  2010234040 2004 16 19 0 45
                  2010234040 2005 17 19 0 45
                  2010234040 2006 18 19 0 45
                  2010234040 2007 19 19 1  1
                  2010235165 1993  1 20 0 17
                  2010235165 1994  2 20 0 17
                  2010235165 1995  3 20 0 17
                  end

                  Comment

                  Working...
                  X