Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • STCOX regression with time-varying dummy variables -- please help!

    Hi, I am at my wit’s end and hoping that somebody here can help with a problem I’m having getting a hazard model to run properly in STATA. The problem involves a time-varying independent dummy 'treatment' variable. Here are some relevant details:

    The study is on the relationship between family visits in prison and prison violence. Essentially, we’re trying to look at whether risk of being violent in prison is lower (or higher) after family visits. We are not that interested in the timing between visits and violence, just changes in risk.

    The data are prisoner-level data taken over a 2-year observation period (2009-20010). We have left and right censoring (some prisoners were already in prison when the observation period began, and some prisoners were still in prison when the observation period ended). The main problem we seem to be having is that many prisoners have multiple visits, some of them actually a lot, sometimes multiple visits in a week.The mean number of visits is 22, with a maximum of 121, minimum 0.

    For every prisoner, we have dates for entry into prison, exit from prison, visits and violent events, plus a bunch of static controls. The data are organized by prisoner-event. So for each prisoner, we have a row for entry data and control variables, start of observation period, visit dates (if any), date of violent event, and exit from observation (either left prison or observation period ended).

    So the data are structured as prisoner-event data, which looks like this (with value labels shown instead of numeric values):

    IDvar timevar event_type age_at_entry crime_type visit_dummy
    Prisoner1 1/1/09 entry 26 violent 0
    Prisoner1 2/1/09 visit 26 violent 1
    Prisoner1 3/1/09 visit 26 violent 1
    Prisoner1 3/15/09 violence 26 violent 0
    Prisoner1 6/1/09 release 26 violent 0

    Prisoner2 2/16/10 entry 35 property 0
    Prisoner2 3/8/10 visit 35 property 1
    Prisoner2 9/1/10 release 35 property 0
    .
    .
    There are 28,463 unique prisoners in the data and 366,477 observations (rows). There were 3440 violent events. Some prisoners have repeated risk for violence, but for now we are only including their first event. If we can get it to work then we might include subsequent events.

    Here is the stset command and output:

    stset timevar, failure(event_type==4) id(IDvar) origin(event_type ==1) enter(event_type ==5) exit(event_type ==6)

    id: IDvar
    failure event: event_type == 4
    obs. time interval: (timevar[_n-1], timevar]
    enter on or after: event_type ==5
    exit on or before: event_type ==6
    t for analysis: (time-origin)
    origin: event_type ==1

    ----------------------------------------------------------------------
    366477 total obs.
    37241 multiple records at same instant PROBABLE ERROR
    (timevar[_n-1]==timevar)
    20475 obs. end on or before enter()
    34 obs. begin on or after exit
    ----------------------------------------------------------------------
    308727 obs. remaining, representing
    28460 subjects
    3440 failures in multiple failure-per-subject data
    7862465 total analysis time at risk, at risk from t = 0
    earliest observed entry t = 0
    last observed exit t = 12710


    Here are two cox regressions. The first one has two time-constant variables, age and offense type (as shown in data above). The second one adds the visit_dummy variable shown in the example data above (=1 when there is a visit, 0 otherwise).


    . stcox age crime_type

    failure _d: event_type== 4
    analysis time _t: (timevar-origin)
    origin: event_type ==1
    enter on or after: event_type ==5
    exit on or before: event_type ==6
    id: IDNT_MAASAR

    Iteration 0: log likelihood = -30421.854
    Iteration 1: log likelihood = -30084.938
    Iteration 2: log likelihood = -30077.562
    Iteration 3: log likelihood = -30077.553
    Refining estimates:
    Iteration 0: log likelihood = -30077.553

    Cox regression -- Breslow method for ties

    No. of subjects = 28429 Number of obs = 308477
    No. of failures = 3440
    Time at risk = 7857475
    LR chi2(2) = 688.60
    Log likelihood = -30077.553 Prob > chi2 = 0.0000

    ----------------------------------------------------------------------
    _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
    -------------------+--------------------------------------------------
    age | .9610433 .0017798 -21.46 0.000 .9575613 .964538
    crime| 1.462947 .051784 10.75 0.000 1.364893 1.568046



    So far, so good. Higher age associated with lower risk of violence, violent crime conviction associated with higher risk of violence.

    Now here is the same model with the time-varying, frequently repeated visit dummy:

    . stcox age crime_type visit_dummy


    failure _d: event_type == 4
    analysis time _t: (timevar-origin)
    origin: event_type ==1
    enter on or after: event_type ==5
    exit on or before: event_type ==6
    id: IDNT_MAASAR

    Iteration 0: log likelihood = -30421.854
    Iteration 1: log likelihood = -25230.114
    Iteration 2: log likelihood = -25092.321
    Iteration 3: log likelihood = -25049.136
    Iteration 4: log likelihood = -25033.56
    Iteration 5: log likelihood = -25027.871
    Iteration 6: log likelihood = -25025.783
    Iteration 7: log likelihood = -25025.015
    Iteration 8: log likelihood = -25024.733
    Iteration 9: log likelihood = -25024.629
    Iteration 10: log likelihood = -25024.591
    Iteration 11: log likelihood = -25024.577
    Iteration 12: log likelihood = -25024.572
    Iteration 13: log likelihood = -25024.57
    Iteration 14: log likelihood = -25024.569
    .
    .
    .
    Iteration 50: log likelihood = -25024.569

    Refining estimates:
    Iteration 0: log likelihood = -25024.569

    Cox regression -- Breslow method for ties

    No. of subjects = 28429 Number of obs = 308477
    No. of failures = 3440
    Time at risk = 7857475
    LR chi2(2) = 10794.57
    Log likelihood = -25024.569 Prob > chi2 = 0.0000

    -----------------------------------------------------------------------------
    _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
    -------------------+---------------------------------------------------------
    age | .9489049 .0018013 -27.63 0.000 .9453809 .9524419
    crime | 1.910263 .0677006 18.26 0.000 1.782076 2.047671
    visit_dummy | 7.90e-24 . . . . .
    -----------------------------------------------------------------------------


    The hazard ratio coefficient for visit_dummy is tiny, but more problematically it doesn’t compute standard errors. Is it possible that there are simply too many visits for it to deal with? Or too many ties (as indicated by the “PROBABLY ERROR” warning I got in the stset command?).

    I have looked everywhere for resources on how to deal with this. It seems that time-varying dummy variables are not frequently used in survival models. I have seen suggestions that data with time-varying variables should be separated into separate observations for each status change. But having a visit doesn't really represent a lasting change in status, so this doesn't seem relevant in this case. Also given the number of visits, doing so would make the dataset very large. I would essentially have to make the data a prisoner-day dataset (or I might get by with prisoner-week). But I thought that STATA’s flexibility with dealing with stdata would obviate this necessity. Or maybe the modeling strategy is wrong and I don't even need a survival model?.

    I would have thought that this would be a fairly common modeling issue in medical studies, since it is analogous to patients being treated, sometimes repeatedly, with some medication at particular intervals (except in my case it is visits rather than medications). But I couldn’t find anything.

    Any suggestions you wizards of stata have will be extremely appreciated!!

    EDIT: Sorry this got posted twice. The first time, it told me there was an error so I re-posted. Can't see how to delete one of them.
    Last edited by Josh Guetzkow; 02 Mar 2015, 13:59.

  • #2
    Hello, Josh,

    There is an interesting article from the Stata Journal on categorical time-varying covariates (http://www.stata-journal.com/sjpdf.h...iclenum=st0066).

    Hopefully it might be helpful.

    Best,

    Marcos
    Best regards,

    Marcos

    Comment


    • #3
      I can't read your post with its current formatting Repost with data listings, Stata statements, and Stata results inside code delimiters. See FAQ Section 12 http://www.statalist.org/forums/help. In the meantime, you look at: http://hsphsun3.harvard.edu/cgi-bin/...ticle-805.html
      Last edited by Steve Samuels; 02 Mar 2015, 18:33.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Josh:
        I second all the previous comments, especially Steve's remark concerning the limited readabiity of your Stata session.
        Anyway, have you tried to let Stata calculate the categorical variable for visit (i.e.: i.visit instead of visit_dummy) and run Cox regression with this predictor?
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Hi, thank you for your responses. I am going to re-post my original message with better formatting and then at the bottom I will reply to your messages one-by-one. Apologies for the mess and I appreciate your patience. Oh, and I should note that I am using STATA SE 12.0

          HERE IS THE ORIGINAL POST:

          Hi, I am at my wit’s end and hoping that somebody here can help with a problem I’m having getting a hazard model to run properly in STATA. The problem involves a time-varying independent dummy 'treatment' variable. Here are some relevant details:

          The study is on the relationship between family visits in prison and prison violence. Essentially, we’re trying to look at whether risk of being violent in prison is lower (or higher) after family visits. We are not that interested in the timing between visits and violence, just changes in risk.

          The data are prisoner-level data taken over a 2-year observation period (2009-20010). We have left and right censoring (some prisoners were already in prison when the observation period began, and some prisoners were still in prison when the observation period ended). The main problem we seem to be having is that many prisoners have multiple visits, some of them actually a lot, sometimes multiple visits in a week.The mean number of visits is 22, with a maximum of 121, minimum 0.

          For every prisoner, we have dates for entry into prison, exit from prison, visits and violent events, plus a bunch of static controls. The data are organized by prisoner-event. So for each prisoner, we have a row for entry data and control variables, start of observation period, visit dates (if any), date of violent event, and exit from observation (either left prison or observation period ended).

          So the data are structured as prisoner-event data, which looks like this (with value labels shown instead of numeric values):

          Code:
                  |   IDvar     timevar             event_type     age   crime_type   visit_dummy |
                  |----------------------------------------------------------------------------|
              13. |  727411174523   18 Oct 10         Origin   32.85753          0          0 |
              14. |  727411174523   30 Oct 10          Visit   32.85753          0          1 |
              15. |  727411174523   13 Nov 10          Visit   32.85753          0          1 |
                  |----------------------------------------------------------------------------|
              16. |  727411174523   23 Nov 10          Visit   32.85753          0          1 |
              17. |  727411174523   14 Dec 10          Visit   32.85753          0          1 |
              18. |  727411174523   18 Oct 10          Entry   32.85753          0          0 |
              19. |  727411174523   31 Dec 10        Release   32.85753          0          0 |
              20. |  775827518064   23 Mar 82          Origin  24.58356          1          0 |
              21. |  775827518064   28 Feb 09          Visit   24.58356          1          1 |
              22. |  775827518064   08 Aug 10          Visit   24.58356          1          1 |
              23. |  775827518064   29 Oct 09       Violence   24.58356          1          0 |
                  |----------------------------------------------------------------------------|
              24. |  775827518064   01 Jan 09          Entry   24.58356          1          0 |
              25. |  775827518064   31 Dec 10        Release   24.58356          1          0 |
          There are 28,463 unique prisoners in the data and 366,477 observations (rows). There were 3440 violent events. Some prisoners have repeated risk for violence, but for now we are only including their first event. If we can get it to work then we might include subsequent events.

          Here is the stset command and output:

          Code:
          stset timevar, failure(event_type==4) id(IDvar) origin(event_type ==1) enter(event_type ==5) exit(event_type ==6)
          
          id: IDvar
          failure event: event_type == 4
          obs. time interval: (timevar[_n-1], timevar]
          enter on or after: event_type ==5
          exit on or before: event_type ==6
          t for analysis: (time-origin)
          origin: event_type ==1
          
          ----------------------------------------------------------------------
          366477 total obs.
          37241 multiple records at same instant PROBABLE ERROR
          (timevar[_n-1]==timevar)
          20475 obs. end on or before enter()
          34 obs. begin on or after exit
          ----------------------------------------------------------------------
          308727 obs. remaining, representing
          28460 subjects
          3440 failures in multiple failure-per-subject data
          7862465 total analysis time at risk, at risk from t = 0
          earliest observed entry t = 0
          last observed exit t = 12710

          Here are two cox regressions. The first one has two time-constant variables, age and offense type (as shown in data above). The second one adds the visit_dummy variable shown in the example data above (=1 when there is a visit, 0 otherwise).


          Code:
          . stcox newage newviolent
          
                   failure _d:  fromdataset == 4
             analysis time _t:  (timevar-origin)
                       origin:  fromdataset==1
            enter on or after:  fromdataset==5
            exit on or before:  fromdataset==6
                           id:  IDNT_MAASAR
          
          Iteration 0:   log likelihood = -30421.854
          Iteration 1:   log likelihood = -30084.938
          Iteration 2:   log likelihood = -30077.562
          Iteration 3:   log likelihood = -30077.553
          Refining estimates:
          Iteration 0:   log likelihood = -30077.553
          
          Cox regression -- Breslow method for ties
          
          No. of subjects =        28429                     Number of obs   =    308477
          No. of failures =         3440
          Time at risk    =      7857475
                                                             LR chi2(2)      =    688.60
          Log likelihood  =   -30077.553                     Prob > chi2     =    0.0000
          
          ------------------------------------------------------------------------------------
                          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------------+----------------------------------------------------------------
                      newage |   .9610433   .0017798   -21.46   0.000     .9575613     .964538
          newviolent_offenxe |   1.462947    .051784    10.75   0.000     1.364893    1.568046
          ------------------------------------------------------------------------------------


          So far, so good. Higher age associated with lower risk of violence, violent crime conviction associated with higher risk of violence.

          Now here is the same model with the time-varying, frequently repeated visit dummy:

          Code:
          . stcox newage newviolent newvisit
          
                   failure _d:  fromdataset == 4
             analysis time _t:  (timevar-origin)
                       origin:  fromdataset==1
            enter on or after:  fromdataset==5
            exit on or before:  fromdataset==6
                           id:  IDNT_MAASAR
          
          Iteration 0:   log likelihood = -30421.854
          Iteration 1:   log likelihood = -25230.114
          Iteration 2:   log likelihood = -25092.321
          Iteration 3:   log likelihood = -25049.136
          Iteration 4:   log likelihood =  -25033.56
          Iteration 5:   log likelihood = -25027.871
          Iteration 6:   log likelihood = -25025.783
          Iteration 7:   log likelihood = -25025.015
          Iteration 8:   log likelihood = -25024.733
          Iteration 9:   log likelihood = -25024.629
          Iteration 10:  log likelihood = -25024.591
          Iteration 11:  log likelihood = -25024.577
          Iteration 12:  log likelihood = -25024.572
          Iteration 13:  log likelihood =  -25024.57
          Iteration 14:  log likelihood = -25024.569
          .
          .
          .
          Iteration 50:  log likelihood = -25024.569
          Refining estimates:
          Iteration 0:   log likelihood = -25024.569
          
          Cox regression -- Breslow method for ties
          
          No. of subjects =        28429                     Number of obs   =    308477
          No. of failures =         3440
          Time at risk    =      7857475
                                                             LR chi2(2)      =  10794.57
          Log likelihood  =   -25024.569                     Prob > chi2     =    0.0000
          
          ------------------------------------------------------------------------------------
                          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
          -------------------+----------------------------------------------------------------
                      newage |   .9489049   .0018013   -27.63   0.000     .9453809    .9524419
          newviolent_offenxe |   1.910263   .0677006    18.26   0.000     1.782076    2.047671
                    newvisit |   7.90e-24          .        .       .            .           .
          ------------------------------------------------------------------------------------

          The hazard ratio coefficient for visit_dummy is tiny, but more problematically it doesn’t compute standard errors. Is it possible that there are simply too many visits for it to deal with? Or too many ties (as indicated by the “PROBABLY ERROR” warning I got in the stset command?).

          I have looked everywhere for resources on how to deal with this. It seems that time-varying dummy variables are not frequently used in survival models. I have seen suggestions that data with time-varying variables should be separated into separate observations for each status change. But having a visit doesn't really represent a lasting change in status, so this doesn't seem relevant in this case. Also given the number of visits, doing so would make the dataset very large. I would essentially have to make the data a prisoner-day dataset (or I might get by with prisoner-week). But I thought that STATA’s flexibility with dealing with stdata would obviate this necessity. Or maybe the modeling strategy is wrong and I don't even need a survival model?.

          I would have thought that this would be a fairly common modeling issue in medical studies, since it is analogous to patients being treated, sometimes repeatedly, with some medication at particular intervals (except in my case it is visits rather than medications). But I couldn’t find anything.

          -----------------------------------------------

          HERE ARE MY REPLIES TO YOUR COMMENTS:

          Originally posted by Marcos Almeida View Post
          There is an interesting article from the Stata Journal on categorical time-varying covariates .
          Hi Marcos, thank you. I have read that article and actually alluded to the solution it gives in my query. The example given there is a surgery treatment variable, where you divide the observations by duplicating them with one as a 'pre-treatment' observation and the other as a 'post-treatment' status. This is what I referred to as a 'status change' in my message -- but I should have made myself clearer.

          I am not sure this approach is appropriate for a number of reasons. To begin with, many prisoners have multiple visits. In that case, once they have a visit, then their status is post-visit and any subsequent visits would not be accounted for. I guess I could divide them up into weeks and months and return them to a 'pre-visit' status in weeks or months when they don't have a visit. But this assumes that I have some knowledge of how long the 'treatment' of a visit is supposed to have an affect. But how do I know that they can be returned to pre-visit status after a week or a month? Also, doing it that way would make the dataset much larger, though I'm willing if necessary. It just seems like a cumbersome solution that makes problematic assumptions -- I was hoping for something more elegant.

          In principle I guess I could count each additional visit as a new type of status. Maybe that's how to deal with it? But in that case wouldn't I need a dummy variable for each new visit (which would end up being 121 dummy variables)...?

          Originally posted by Steve Samuels View Post
          Thanks for the code link. I could not open the second link you pasted.

          Originally posted by Carlo Lazzaro View Post
          have you tried to let Stata calculate the categorical variable for visit (i.e.: i.visit instead of visit_dummy) and run Cox regression with this predictor?
          So, given the way the data are structured, I'm not sure how I would do this. The data are not structured horizontally (single-record data), so there is no variable for 'visit' that stata can calculate. Instead, the data are multiple record, so essentially each event has its own observation and there is a variable with dates and a variable denoting what event occurred on that date, and they are tied together by the ID number. This is following examples #6-#10 in the STATA Survival Analysis Manual on pages 474-479, with our data being most parallel to the data given in example #10. In the example in the manual, they have a variable 'date' which is like our 'timevar' and a variable 'code' which is like out 'event_type'.

          So I would need STATA to calculate a categorical variable out of the 'event_type' variable where I guess it would be equal to 1 when the event is a violent event and 0 otherwise. But that's what our violent event dummy variable does. But maybe I'm missing something?

          Thank you everyone for your input. I hope with the nicer formatting it will be easier for you to see what's going on.


          Comment


          • #6
            Josh:
            what follows might well be a late afternoon slip-up, but I would take a look at shared frailty model, Example 10, -stcox- entry, Stata 13.1 .pdf manual.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Thanks for reformatting. I apologize for the bad link; I should have checked it after posting. Here it is entered as plain text, without using the forum editor link feature.

              http://hsphsun3.harvard.edu/cgi-bin/lwgate/STATALIST/archives/statalist.1310/Author/article-805.html

              Unfortunately it doesn't address your question, now that I've read it.
              Last edited by Steve Samuels; 03 Mar 2015, 16:41.
              Steve Samuels
              Statistical Consulting
              [email protected]

              Stata 14.2

              Comment


              • #8
                Originally posted by Carlo Lazzaro View Post
                what follows might well be a late afternoon slip-up, but I would take a look at shared frailty model Example 10 -stcox- entry Stata 13.1 .pdf manual.

                Carlo, thank you. From what I understand, shared frailty models are akin to fixed and random effects models in standard regression. This from Roberto Guteirrez in the Stata Journal 2002, vol. 2, no. 1, p22: "Frailty models are the survival data analog to regression models, which account for heterogeneity and random effects. A frailty is a latent multiplicative effect on the hazard function and is assumed to have unit mean and variance θ, which is estimated along with the other model parameters. A frailty model is an heterogeneity model where the frailties are assumed to be individual- or spell-specific. A shared frailty model is a random effects model where the frailties are common (or shared) among groups of individuals or spells and are randomly distributed across groups."

                So, I don't think it's relevant in my case.

                In the meantime, I have gotten advice from elsewhere to model this as a discrete-time survival model with days as the unit of time. So in this approach, I would indeed transform the data into prisoner-day observations, with dummy variables for each type of event (visit or violence) that is equal to 1 only on the day of the event and use logistic regression.

                With such data, I could generate different dummy variables corresponding to "had a visit in the last 7 days" or "had a visit in the last 30 days" and see if that affects risk of violence. Or I wonder if I could simply have a count variable that starts at zero and increases by 1 every day until they have a visit, at which point it resets to 0. This would act as a simple continuous "days since last visit (or entry into prison)" variable. Though one problem with that is it would confound exposure time in prison to time since last visit, especially for people who don't have a visit. I will have to give it more thought.

                If you have any other ideas, I'd be happy to hear. Thanks again!!

                Originally posted by Steven Samuels View Post
                Unfortunately it doesn't address your question, now that I've read it.
                Too bad. Thanks for your efforts!! (BTW the link still wouldn't open for me... but never mind).

                Comment


                • #9
                  (Sigh): try this one: http://www.stata.com/statalist/archi.../msg00323.html
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment


                  • #10
                    Josh:
                    what follows may be quite unuseful, but I would take a look at it as a source of further inspiration, at least:
                    Code:
                    use http://www.stata-press.com/data/r13/catheter, clear
                    save "C:\Users\user\Desktop\Josh.dta"
                    g visit=1 in 1/40
                    replace visit=0 in 41/50
                    replace visit=2 if visit==.
                    rename infect crime
                    stset time, failure(crime) scale(1)
                    rename patient inmate
                    stcox age female visit , shared( inmate )
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      Your visit indicator isn't doing the job. It just indicates that a visit took place on a date t. What is crucial is that you must define the time varying covariates at other times of follow-up. Below I create three new "experience" variables at each time: 1) anyv indicates whether there had been any visits up to that time; 2) nvisits, the count of visits up to that time; 3) tfrom_v is the time from the most recent visit, if any. For stcox it's only necessary to define these at all times of failure in the data. This is done with stsplit. In the example, I give visit dates a code of 7.

                      Code:
                      clear
                      input id    time  event
                      1     0        1
                      1     12       7
                      1     30       7
                      1     46       4
                      2     0        1
                      2     5        7
                      2     40       0
                      3     0        1
                      3     30       4
                      4     0        1
                      4     20       7
                      4     29       4
                      5     0        1
                      5     5        4
                      end
                      
                      /* Create indicators */
                      gen visit = event==7
                      gen fail  = event == 4
                      
                      /* Generate visit experience variables at time t */
                      bys id (time): gen nvisits = sum(visit)
                      
                      gen anyv = nvisits>=1
                      
                      gen tv = time if visit
                      bys id (time): gen tfrom_v = time-tv[_n-1]
                      label var tfrom_v "Time from most recent visit"
                      drop tv  // not needed further
                      
                      stset time, fail(fail) id(id)
                      
                      /* Get values for these variables at all failure times */
                      stsplit, at(failures)
                      sort id time
                      list, sepby(id)
                      
                      stcox anyv, hr nolog
                      
                      
                      Log likelihood  =   -3.3620702                     Prob > chi2     =    0.2262
                      
                      ------------------------------------------------------------------------------
                                _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                              anyv |   .2372227   .2931778    -1.16   0.244     .0210463    2.673843
                      ------------------------------------------------------------------------------







                      Last edited by Steve Samuels; 04 Mar 2015, 07:43.
                      Steve Samuels
                      Statistical Consulting
                      [email protected]

                      Stata 14.2

                      Comment


                      • #12
                        I have to apologize. I left out a crucial snapspan statement in my example. The truth is that I started doing Cox analyses long before I took up Stata, and I usually create data sets by hand: generate time-dependent values at all followup times; merge with the failure times; and keep only the merged values. The process sometimes generates millions of observations, but this has never caused problems.

                        I discovered the error by printing out the risk sets generated by stsplit, something I should have done earlier. Risk sets are a crucial concept for Cox models; there is one per failure time and it consists of all individuals at risk at that failure time . stcox compares covariate values for failure(s) and non-failures in each risk set.

                        Here is the corrected do file. I've changed the example a bit and added a covariate "x", which is just the id times 100. The Manual entry for snapspan is a little murky. One must list variables whose values at a given time, apply up through that time. This includes the failure event (which ends the time interval) and baseline variables. All omitted variables apply to times after they occur: these are the time-varying covariates.
                        Code:
                        clear
                        input id    time  event  x
                        1     0       1   100
                        1     12      7   100
                        1     15      7   100
                        1     20      4   100
                        2     0       1   200
                        2     4       7   200
                        2     40      0   200
                        3     0       1   300
                        3     30      6   300
                        4     0       1   400
                        4     20      7   400
                        4     29      4   400
                        5     0       1   400
                        5     5       4   500
                        end
                        list , sepby(id)
                        gen visit = event==7
                        gen fail  = event == 4
                        
                        
                        /* Create indicators */
                        
                        /* Generate visit experience variables at time t */
                        bys id (time): gen nvisits = sum(visit)
                        gen anyv = nvisits>=1
                        
                        gen tv = time if visit
                        bys id (time): gen tfrom_v = time-tv[_n-1]
                        label var tfrom_v "Time from most recent visit"
                        drop tv
                        
                        snapspan id time  fail x, replace gen(t0)
                        drop if t0==.
                        order id t0 time
                        list, sepby(id)
                        
                        
                        stset time, fail(fail) id(id)
                        
                        stsplit, at(failures) riskset(riskset)
                        order riskset id
                        sort riskset id
                        
                        list riskset time id x nvisits anyv tfrom_v  fail if riskset!=., sepby(riskset)
                        stcox anyv
                        stcox nvisits
                        Last edited by Steve Samuels; 05 Mar 2015, 09:32.
                        Steve Samuels
                        Statistical Consulting
                        [email protected]

                        Stata 14.2

                        Comment


                        • #13
                          Originally posted by Steve Samuels View Post
                          I have to apologize.
                          Steve, really, apologize? You've been really extremely super helpful, and I can't thank you enough for taking the time to write this up and spell it all out. It will probably take me some time to absorb and implement. But again, thank you!! Oh, and your link finally worked!

                          Originally posted by Carlo Lazzaro View Post
                          what follows may be quite unuseful, but I would take a look at it as a source of further inspiration, at least
                          Thank you for suggesting this exercise -- I'm sure it will be helpful.

                          Good day to you both!
                          Cheers,
                          Josh

                          Comment

                          Working...
                          X