Help on some fundamentals of survival analysis

Man Yang

Join Date: Mar 2016

Posts: 183
#1

Help on some fundamentals of survival analysis

28 Apr 2017, 16:43

Hi all, I am currently working with a dataset which contains information on subject's age, whether go to college (binary) and many other variables. I will need to use survival analysis in the near future but I was stuck at the very first step of declaring the dataset. So below is a snapshot of what the data looks like.

Code:

id censor college time BirthYear age 10 0 0 1990 17 15 0 0 1991 16 21 0 0 1990 17 34 0 0 1990 16 36 1 1 3 1990 16 39 0 0 1990 17 41 0 0 1990 17 42 0 0 1991 17 43 0 0 1991 16 44 0 0 1991 17 50 0 0 1991 16 59 0 0 1991 16 70 1 1 5 1991 16

Basically, I want to examine at what time does participant went to college. I created this censor variable as totally the same as the binary college variable cuz I suppose they are the same? I also created this time variable indicating the number of years before each participant went to college (if they had). Basically is the difference between age variable and the age when they went to college. I read some documents about this model but am really confused about what does censor mean in my case. So how could I first of all make this dataset a st data (STATA gives me an error says "data is not st")?

Thanks!

Last edited by Man Yang; 28 Apr 2017, 16:45.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

29 Apr 2017, 02:41

Man:
as far as I can get your query, right censoring in your case refers to those student who are at risk to but do not go to college within the time span of your analysis.
Put differently, what -stset- classifies as -failure- is the event "gone to college".
I would recommend you to take a thorough look at -stset- (and, more in general, -st-) entries in Stata .pdf manual, as well as the valuable http://www.stata.com/bookstore/survi...-introduction/
As a next step, you should decide whether you survival analysis will be non parametric, semiparametric or parametric.
As an aside that reminds FAQ, please post what you typed and what Stata gave you back. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

29 Apr 2017, 03:12

Carlo gave excellent advice. Just as a side note, it seems one of your difficulties relates to formatting the dataset. You may check different data sets by opening those ones available 'by chapter' as well as 'by command' from the Stata Manual.

Best regards,

Marcos
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#4

29 Apr 2017, 06:38

In addition to the wise words from Carlo and Marcos, also consider which approach to survival analysis you should be using -- whether treating your survival time data as continuous (which leads to Stata's st suite) or as discrete (for which the suite is not required, but which is easy to implement using other Stata commands). See http://www.iser.essex.ac.uk/survival-analysis for some free downloadable resources
Comment

Man Yang

Join Date: Mar 2016
Posts: 183

01 May 2017, 12:21

Hello, thank you all for the suggestions! I chose to set the data as st and I was able to get the model running as well as the survival function graph. However, I feel the graph I get is somehow incorrect. Could you please help me check where the problem is? I first of all stset the data using the following command

Code:

stset cderyear, failure(POSTEDU) time0(BirthYear) id(UCIn) exit(time .)

in which cderyear is the year when survey was conducted, POSTEDU is the outcome variable, BirthYear is the year when the participant was born and UCIn is the participant ID.

The data is a health survey that lasted from 2006 to 2015, each year the subject's information such as IQ, age, disability was captured. My outcome is whether they go to college or not (which is what Carlos pointed out as failure). I ran univariate analysis on all the categorical and continuous predictors and finalize the model as follows:

Code:

stcox age devlvl i.MR i.AUT i.ELL ib5.ethnicrace i.center c.age#c.devlvl c.age#i.AUT, nohr

And the model result is:

Code:

. stcox age devlvl i.MR i.AUT i.ELL ib5.ethnicrace i.center c.age#c.devlvl c.age#i.AUT, nohr

         failure _d:  POSTEDU
   analysis time _t:  cderyear
  exit on or before:  time .
                 id:  UCIn

Iteration 0:   log likelihood = -5330.0512
Iteration 1:   log likelihood = -4909.5545
Iteration 2:   log likelihood = -4784.0392
Iteration 3:   log likelihood = -4771.1325
Iteration 4:   log likelihood = -4770.7201
Iteration 5:   log likelihood = -4770.7194
Refining estimates:
Iteration 0:   log likelihood = -4770.7194

Cox regression -- Breslow method for ties

No. of subjects =        55380                     Number of obs   =     55380
No. of failures =          544
Time at risk    =       946297
                                                   LR chi2(32)     =   1118.66
Log likelihood  =   -4770.7194                     Prob > chi2     =    0.0000

-----------------------------------------------------------------------------------
               _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
              age |   .9485752   .1857542     5.11   0.000     .5845036    1.312647
           devlvl |   .2165812   .0438219     4.94   0.000     .1306918    .3024705
             1.MR |  -.7305106   .1049244    -6.96   0.000    -.9361586   -.5248626
            1.AUT |  -1.608428   .6651524    -2.42   0.016    -2.912103   -.3047537
            1.ELL |   .0022744   .1291604     0.02   0.986    -.2508754    .2554242
                  |
       ethnicrace |
African American  |   .3814536   .1451871     2.63   0.009      .096892    .6660151
  Asian American  |   .1479484   .1688891     0.88   0.381    -.1830682     .478965
        Latino/a  |   .0166968   .1273573     0.13   0.896    -.2329189    .2663124
           Other  |   .2885472   .2391377     1.21   0.228    -.1801541    .7572484
         Unknown  |   .0780732   .2098118     0.37   0.710    -.3331503    .4892967
                  |
           center |
               2  |   .0339904   .3303481     0.10   0.918    -.6134799    .6814607
               3  |   .1803126   .3272488     0.55   0.582    -.4610833    .8217085
               4  |   .6606672   .2994035     2.21   0.027     .0738471    1.247487
               5  |  -.1198088   .4243618    -0.28   0.778    -.9515426    .7119249
               6  |   .5665492     .31731     1.79   0.074    -.0553671    1.188465
               7  |   .5393665   .2599586     2.07   0.038     .0298569    1.048876
               8  |  -.3248821   .2740199    -1.19   0.236    -.8619512    .2121869
               9  |  -.1618291   .4020385    -0.40   0.687    -.9498101    .6261519
              10  |   .5574623    .321429     1.73   0.083    -.0725271    1.187452
              11  |   .7748065   .2294648     3.38   0.001     .3250638    1.224549
              12  |  -.2092461   .2866152    -0.73   0.465    -.7710015    .3525093
              13  |   1.289953    .225901     5.71   0.000     .8471956    1.732711
              14  |   .4058454   .4537355     0.89   0.371    -.4834598    1.295151
              15  |   .6648066    .256777     2.59   0.010     .1615329     1.16808
              16  |   .3420969   .2899587     1.18   0.238    -.2262116    .9104055
              17  |   .5705365   .2442936     2.34   0.020     .0917298    1.049343
              18  |   1.169656   .2552728     4.58   0.000     .6693303    1.669981
              19  |   .4604306    .293384     1.57   0.117    -.1145914    1.035453
              20  |   .1228869   .3380036     0.36   0.716    -.5395881    .7853619
              21  |   .4943846   .3373477     1.47   0.143    -.1668048    1.155574
                  |
   c.age#c.devlvl |  -.0072298   .0022599    -3.20   0.001    -.0116592   -.0028004
                  |
        AUT#c.age |
               1  |   .0781749   .0340879     2.29   0.022     .0113638     .144986
-----------------------------------------------------------------------------------

Then I follow the example listed on this website: http://stats.idre.ucla.edu/stata/sem...tata-survival/ to classify a case and I was able to get the survival graph but the graph looks odd to me. What does this graph say is that when the subject is at age 16, all of them can go to college? I am expecting a graph that looks similar in the above link in terms of the curve. Can anyone tell me where I could be wrong? Thanks a lot!!

Click image for larger version

Name: Graph.png
Views: 1
Size: 48.3 KB
ID: 1386445

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

01 May 2017, 12:34

You didn't show the command you used to have the graph.

That said, and considering the explanation is already in the link you shared, it seems you have a survival function for an individual aged 16.Again, the interpretation for this graph will depend on the command.

I strongly believe the recommendations posted in #2, #3 and #4 will be rewarding to you.

Best regards,

Marcos
Comment

Man Yang

Join Date: Mar 2016
Posts: 183

01 May 2017, 12:35

To give you an idea of how the data looks like, below is the 1/10 observations. Another question is do I have to have a censor variable in the data? What is the difference between discrete time survival analysis and continuous? What're the things I should consider to make a decision on this question?

Code:

. list UCIn cderyear age MR AUT ethnicrace ELL center devlvl BirthYear in 1/10, nodisplay

     +------------------------------------------------------------------------------------------+
     |   UCIn   cderyear   age   MR   AUT         ethnicrace   ELL   center   devlvl   BirthY~r |
     |------------------------------------------------------------------------------------------|
  1. | 315891       2006    16    1     1              White     0       21       63       1990 |
  2. |   7575       2006    16    1     0   African American     0        4       30       1990 |
  3. | 108872       2006    16    1     0           Latino/a     0        1       72       1990 |
  4. | 179904       2006    16    0     1              White     0       13       90       1990 |
  5. | 243264       2006    16    1     0           Latino/a     0        8       76       1990 |
     |------------------------------------------------------------------------------------------|
  6. | 154608       2006    16    1     0              White     0        2       47       1990 |
  7. | 255895       2006    16    1     0           Latino/a     1       16       60       1990 |
  8. | 345717       2006    16    1     0     Asian American     0       20       64       1990 |
  9. | 154789       2006    16    1     0           Latino/a     1        2       67       1990 |
 10. | 300219       2006    16    1     0              White     1       15       55       1990 |
     +------------------------------------------------------------------------------------------+

Comment

Man Yang

Join Date: Mar 2016
Posts: 183

01 May 2017, 12:37

Hi Marcos, below is the command I use for the graph

Code:

*Case1: AUT=1, age=25, ethnicrace=2(Asian), devlvl=74, center=7(HRC)
stcox age devlvl i.MR i.AUT i.ELL ib5.ethnicrace i.center c.age#c.devlvl c.age#i.AUT, nohr basesurv(surv0)
gen surv1 = surv0^exp((0.9485752*25+0.2165812*74-1.608428*1+0.1479484+0.5393665-0.0072298*25*74+0.0781749*1*25)) 
*drop if cderyear<2006
line surv1 _t, sort ylab(0 .1 to 1) xscale(range(2006(1)2015)) xlabel(2006 "16" 2007 "17" 2008 "18" 2009 "19" ///
2010 "20" 2011 "21" 2012 "22" 2013 "23" 2014 "24" 2015 "25", labsize(small))

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

01 May 2017, 13:28

It strikes me as odd, the fact that you have written extense command, albeit facing difficulties concerning introductory interpretation.

That said, I still fail to understand why you calll "autism" the y axis, since you said the outcome is "going to College".

Also, I see that you edited the time variable, hence 16 = 2006, 17 = 2007, [...] and 25 is 2015. This also strikes me as odd, to say the least.

Arcane as the editions as well as command may, and now just hazarding a guess, I'd say the graph shows the survival function from 2006 to 2015, particularly of an individual aged 25, of Asian ethnicity, from center = 7, with autism, etc. As far as I can envisage, it seems that, by 2015 (follow up = 9 years), around 25% of individuals within such "condition" remain "yet to go to College". You may also say the opposite way, that around 75% already reached college.

Again, I insist you should follow the previous recommendations. Survival analysis is surely a sophisticated method, demanding a reasonable amount time to get (minimally) acquainted with.

Hopefully that helps.

Best regards,

Marcos
Comment
Man Yang

Join Date: Mar 2016

Posts: 183
#10

01 May 2017, 13:42

Hi Marcos, to clarify some of the places where you deem as odd. 1. the autism on y axis is something that resulted from the naming of the graph I guess and sorry it's confusing but the y axis should be whether they go to college. 2. I edit the time variable on x axis because I want to basically understand when do people with autism made it to college and age might be a better way to visualize the graph. What you explained make sense to me and I will read more on the resources that you suggested. Thanks a lot.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#11

01 May 2017, 16:34

A couple of things. First, your -stset- command looks wrong. I don't see any role for a -time0()- option here. What you really need here is -origin()-.

Next, the graph you are plotting is based on the predicted hazards following the -stcox- model. But the graph in the link you site to is a graph of the Nelson-Aalen (I hope I spelled that correctly) cumulative hazard graph. The two are related, but very different. Even fixing that up, the N-A cumulative hazard graph is calculated from the data directly, non-parametrically: the results you will get from a Cox model, which adjusts for many variables, will be different in any case.

I think that you need to back off from this project long enough to carefully study the basics in the [ST] manual, then return to it when you have a clearer understanding of what kinds of questions you want to ask of your data, and which commands are likely to help you get those answers, and how to use them.
Comment
Man Yang

Join Date: Mar 2016

Posts: 183
#12

01 May 2017, 17:07

Hi Clyde. thanks and yes you are right that I might stset wrong but the graph I generated is NOT a Nelson-Aalen. If you scroll down close to the bottom of the link I cited, you will see they have a curve based on the model. That is how I got mine. But again you are right that I need to learn more about how to declare the dataset.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#13

01 May 2017, 17:30

Ah, yes, I see the curves near the bottom you are talking about now. I was thinking of the first graph at that link. Sorry about that.
Comment

Announcement