Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help on some fundamentals of survival analysis

    Hi all, I am currently working with a dataset which contains information on subject's age, whether go to college (binary) and many other variables. I will need to use survival analysis in the near future but I was stuck at the very first step of declaring the dataset. So below is a snapshot of what the data looks like.

    Code:
    id    censor    college    time    BirthYear    age
    10    0    0        1990    17
    15    0    0        1991    16
    21    0    0        1990    17
    34    0    0        1990    16
    36    1    1    3    1990    16
    39    0    0        1990    17
    41    0    0        1990    17
    42    0    0        1991    17
    43    0    0        1991    16
    44    0    0        1991    17
    50    0    0        1991    16
    59    0    0        1991    16
    70    1    1    5    1991    16
    Basically, I want to examine at what time does participant went to college. I created this censor variable as totally the same as the binary college variable cuz I suppose they are the same? I also created this time variable indicating the number of years before each participant went to college (if they had). Basically is the difference between age variable and the age when they went to college. I read some documents about this model but am really confused about what does censor mean in my case. So how could I first of all make this dataset a st data (STATA gives me an error says "data is not st")?

    Thanks!
    Last edited by Man Yang; 28 Apr 2017, 16:45.

  • #2
    Man:
    as far as I can get your query, right censoring in your case refers to those student who are at risk to but do not go to college within the time span of your analysis.
    Put differently, what -stset- classifies as -failure- is the event "gone to college".
    I would recommend you to take a thorough look at -stset- (and, more in general, -st-) entries in Stata .pdf manual, as well as the valuable http://www.stata.com/bookstore/survi...-introduction/
    As a next step, you should decide whether you survival analysis will be non parametric, semiparametric or parametric.
    As an aside that reminds FAQ, please post what you typed and what Stata gave you back. Thanks.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Carlo gave excellent advice. Just as a side note, it seems one of your difficulties relates to formatting the dataset. You may check different data sets by opening those ones available 'by chapter' as well as 'by command' from the Stata Manual.
      Best regards,

      Marcos

      Comment


      • #4
        In addition to the wise words from Carlo and Marcos, also consider which approach to survival analysis you should be using -- whether treating your survival time data as continuous (which leads to Stata's st suite) or as discrete (for which the suite is not required, but which is easy to implement using other Stata commands). See http://www.iser.essex.ac.uk/survival-analysis for some free downloadable resources

        Comment


        • #5
          Hello, thank you all for the suggestions! I chose to set the data as st and I was able to get the model running as well as the survival function graph. However, I feel the graph I get is somehow incorrect. Could you please help me check where the problem is? I first of all stset the data using the following command

          Code:
          stset cderyear, failure(POSTEDU) time0(BirthYear) id(UCIn) exit(time .)
          in which cderyear is the year when survey was conducted, POSTEDU is the outcome variable, BirthYear is the year when the participant was born and UCIn is the participant ID.

          The data is a health survey that lasted from 2006 to 2015, each year the subject's information such as IQ, age, disability was captured. My outcome is whether they go to college or not (which is what Carlos pointed out as failure). I ran univariate analysis on all the categorical and continuous predictors and finalize the model as follows:

          Code:
          stcox age devlvl i.MR i.AUT i.ELL ib5.ethnicrace i.center c.age#c.devlvl c.age#i.AUT, nohr
          And the model result is:

          Code:
          . stcox age devlvl i.MR i.AUT i.ELL ib5.ethnicrace i.center c.age#c.devlvl c.age#i.AUT, nohr
          
                   failure _d:  POSTEDU
             analysis time _t:  cderyear
            exit on or before:  time .
                           id:  UCIn
          
          Iteration 0:   log likelihood = -5330.0512
          Iteration 1:   log likelihood = -4909.5545
          Iteration 2:   log likelihood = -4784.0392
          Iteration 3:   log likelihood = -4771.1325
          Iteration 4:   log likelihood = -4770.7201
          Iteration 5:   log likelihood = -4770.7194
          Refining estimates:
          Iteration 0:   log likelihood = -4770.7194
          
          Cox regression -- Breslow method for ties
          
          No. of subjects =        55380                     Number of obs   =     55380
          No. of failures =          544
          Time at risk    =       946297
                                                             LR chi2(32)     =   1118.66
          Log likelihood  =   -4770.7194                     Prob > chi2     =    0.0000
          
          -----------------------------------------------------------------------------------
                         _t |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
          ------------------+----------------------------------------------------------------
                        age |   .9485752   .1857542     5.11   0.000     .5845036    1.312647
                     devlvl |   .2165812   .0438219     4.94   0.000     .1306918    .3024705
                       1.MR |  -.7305106   .1049244    -6.96   0.000    -.9361586   -.5248626
                      1.AUT |  -1.608428   .6651524    -2.42   0.016    -2.912103   -.3047537
                      1.ELL |   .0022744   .1291604     0.02   0.986    -.2508754    .2554242
                            |
                 ethnicrace |
          African American  |   .3814536   .1451871     2.63   0.009      .096892    .6660151
            Asian American  |   .1479484   .1688891     0.88   0.381    -.1830682     .478965
                  Latino/a  |   .0166968   .1273573     0.13   0.896    -.2329189    .2663124
                     Other  |   .2885472   .2391377     1.21   0.228    -.1801541    .7572484
                   Unknown  |   .0780732   .2098118     0.37   0.710    -.3331503    .4892967
                            |
                     center |
                         2  |   .0339904   .3303481     0.10   0.918    -.6134799    .6814607
                         3  |   .1803126   .3272488     0.55   0.582    -.4610833    .8217085
                         4  |   .6606672   .2994035     2.21   0.027     .0738471    1.247487
                         5  |  -.1198088   .4243618    -0.28   0.778    -.9515426    .7119249
                         6  |   .5665492     .31731     1.79   0.074    -.0553671    1.188465
                         7  |   .5393665   .2599586     2.07   0.038     .0298569    1.048876
                         8  |  -.3248821   .2740199    -1.19   0.236    -.8619512    .2121869
                         9  |  -.1618291   .4020385    -0.40   0.687    -.9498101    .6261519
                        10  |   .5574623    .321429     1.73   0.083    -.0725271    1.187452
                        11  |   .7748065   .2294648     3.38   0.001     .3250638    1.224549
                        12  |  -.2092461   .2866152    -0.73   0.465    -.7710015    .3525093
                        13  |   1.289953    .225901     5.71   0.000     .8471956    1.732711
                        14  |   .4058454   .4537355     0.89   0.371    -.4834598    1.295151
                        15  |   .6648066    .256777     2.59   0.010     .1615329     1.16808
                        16  |   .3420969   .2899587     1.18   0.238    -.2262116    .9104055
                        17  |   .5705365   .2442936     2.34   0.020     .0917298    1.049343
                        18  |   1.169656   .2552728     4.58   0.000     .6693303    1.669981
                        19  |   .4604306    .293384     1.57   0.117    -.1145914    1.035453
                        20  |   .1228869   .3380036     0.36   0.716    -.5395881    .7853619
                        21  |   .4943846   .3373477     1.47   0.143    -.1668048    1.155574
                            |
             c.age#c.devlvl |  -.0072298   .0022599    -3.20   0.001    -.0116592   -.0028004
                            |
                  AUT#c.age |
                         1  |   .0781749   .0340879     2.29   0.022     .0113638     .144986
          -----------------------------------------------------------------------------------
          Then I follow the example listed on this website: http://stats.idre.ucla.edu/stata/sem...tata-survival/ to classify a case and I was able to get the survival graph but the graph looks odd to me. What does this graph say is that when the subject is at age 16, all of them can go to college? I am expecting a graph that looks similar in the above link in terms of the curve. Can anyone tell me where I could be wrong? Thanks a lot!!

          Click image for larger version

Name:	Graph.png
Views:	1
Size:	48.3 KB
ID:	1386445

          Comment


          • #6
            You didn't show the command you used to have the graph.

            That said, and considering the explanation is already in the link you shared, it seems you have a survival function for an individual aged 16.Again, the interpretation for this graph will depend on the command.

            I strongly believe the recommendations posted in #2, #3 and #4 will be rewarding to you.
            Best regards,

            Marcos

            Comment


            • #7
              To give you an idea of how the data looks like, below is the 1/10 observations. Another question is do I have to have a censor variable in the data? What is the difference between discrete time survival analysis and continuous? What're the things I should consider to make a decision on this question?

              Code:
              . list UCIn cderyear age MR AUT ethnicrace ELL center devlvl BirthYear in 1/10, nodisplay
              
                   +------------------------------------------------------------------------------------------+
                   |   UCIn   cderyear   age   MR   AUT         ethnicrace   ELL   center   devlvl   BirthY~r |
                   |------------------------------------------------------------------------------------------|
                1. | 315891       2006    16    1     1              White     0       21       63       1990 |
                2. |   7575       2006    16    1     0   African American     0        4       30       1990 |
                3. | 108872       2006    16    1     0           Latino/a     0        1       72       1990 |
                4. | 179904       2006    16    0     1              White     0       13       90       1990 |
                5. | 243264       2006    16    1     0           Latino/a     0        8       76       1990 |
                   |------------------------------------------------------------------------------------------|
                6. | 154608       2006    16    1     0              White     0        2       47       1990 |
                7. | 255895       2006    16    1     0           Latino/a     1       16       60       1990 |
                8. | 345717       2006    16    1     0     Asian American     0       20       64       1990 |
                9. | 154789       2006    16    1     0           Latino/a     1        2       67       1990 |
               10. | 300219       2006    16    1     0              White     1       15       55       1990 |
                   +------------------------------------------------------------------------------------------+

              Comment


              • #8
                Hi Marcos, below is the command I use for the graph

                Code:
                *Case1: AUT=1, age=25, ethnicrace=2(Asian), devlvl=74, center=7(HRC)
                stcox age devlvl i.MR i.AUT i.ELL ib5.ethnicrace i.center c.age#c.devlvl c.age#i.AUT, nohr basesurv(surv0)
                gen surv1 = surv0^exp((0.9485752*25+0.2165812*74-1.608428*1+0.1479484+0.5393665-0.0072298*25*74+0.0781749*1*25)) 
                *drop if cderyear<2006
                line surv1 _t, sort ylab(0 .1 to 1) xscale(range(2006(1)2015)) xlabel(2006 "16" 2007 "17" 2008 "18" 2009 "19" ///
                2010 "20" 2011 "21" 2012 "22" 2013 "23" 2014 "24" 2015 "25", labsize(small))

                Comment


                • #9
                  It strikes me as odd, the fact that you have written extense command, albeit facing difficulties concerning introductory interpretation.

                  That said, I still fail to understand why you calll "autism" the y axis, since you said the outcome is "going to College".

                  Also, I see that you edited the time variable, hence 16 = 2006, 17 = 2007, [...] and 25 is 2015. This also strikes me as odd, to say the least.

                  Arcane as the editions as well as command may, and now just hazarding a guess, I'd say the graph shows the survival function from 2006 to 2015, particularly of an individual aged 25, of Asian ethnicity, from center = 7, with autism, etc. As far as I can envisage, it seems that, by 2015 (follow up = 9 years), around 25% of individuals within such "condition" remain "yet to go to College". You may also say the opposite way, that around 75% already reached college.

                  Again, I insist you should follow the previous recommendations. Survival analysis is surely a sophisticated method, demanding a reasonable amount time to get (minimally) acquainted with.

                  Hopefully that helps.
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Hi Marcos, to clarify some of the places where you deem as odd. 1. the autism on y axis is something that resulted from the naming of the graph I guess and sorry it's confusing but the y axis should be whether they go to college. 2. I edit the time variable on x axis because I want to basically understand when do people with autism made it to college and age might be a better way to visualize the graph. What you explained make sense to me and I will read more on the resources that you suggested. Thanks a lot.

                    Comment


                    • #11
                      A couple of things. First, your -stset- command looks wrong. I don't see any role for a -time0()- option here. What you really need here is -origin()-.

                      Next, the graph you are plotting is based on the predicted hazards following the -stcox- model. But the graph in the link you site to is a graph of the Nelson-Aalen (I hope I spelled that correctly) cumulative hazard graph. The two are related, but very different. Even fixing that up, the N-A cumulative hazard graph is calculated from the data directly, non-parametrically: the results you will get from a Cox model, which adjusts for many variables, will be different in any case.

                      I think that you need to back off from this project long enough to carefully study the basics in the [ST] manual, then return to it when you have a clearer understanding of what kinds of questions you want to ask of your data, and which commands are likely to help you get those answers, and how to use them.

                      Comment


                      • #12
                        Hi Clyde. thanks and yes you are right that I might stset wrong but the graph I generated is NOT a Nelson-Aalen. If you scroll down close to the bottom of the link I cited, you will see they have a curve based on the model. That is how I got mine. But again you are right that I need to learn more about how to declare the dataset.

                        Comment


                        • #13
                          Ah, yes, I see the curves near the bottom you are talking about now. I was thinking of the first graph at that link. Sorry about that.


                          Comment

                          Working...
                          X