Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cox proportional hazards model : Time before failure

    Hello,

    I am using Stata 16 (have access to 17 also) in a closed system lab. I am trying to conduct a study to see the effects of living near a "factory" on various health outcomes. I am trying to see if I am using the time variant covariate option correctly on a single observation data. An example of the dataset is given below, which is longitudinal in nature and quite large.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id str30(start_date end_date) float(start_dt end_dt) int(exposure diag) float diag_dt
     1 "1oct1981"  "1oct1986"     7944  9770 0 0     .
     1 "2oct1986"  "2oct1991"     9771 11597 1 1  9060
     2 "15nov1991" "15nov1993"   11641 12372 0 1 15724
     3 "20dec1995" "20dec2001"   13137 15329 0 0     .
     3 "21dec2001" "21dec2006"   15330 17156 0 1 16501
     4 "15mar2005" "15mar2011"   16510 18701 1 0     .
     4 "16mar2011" "16mar2016"   18702 20529 0 0     .
     5 "7feb1995"  "7feb2000"    12821 14647 0 0     .
     5 "8feb2000"  "8feb2005"    14648 16475 0 1 16363
     5 "9feb2005"  "9feb2010"    16476 18302 1 1     .
     6 "2jun2003"  "2jun2005"    15858 16589 0 0     .
     7 "16aug1995" "10march1997" 13011 13583 0 0     .
     8 "2sep2001"  "10aug2005"   15220 16658 0 1 18128
     9 "8jan2014"  "9jul2017"    19731 21009 0 0     .
    10 "14mar2005" "14mar2010"   16509 18335 0 0     .
    10 "15mar2010" "15mar2015"   18336 20162 1 1 17583
    11 "1feb2013"  "15mar2014"   19390 19797 0 0     .
    12 "15may1997" "15may2002"   13649 15475 0 0     .
    13 "25nov2004" "19jan2007"   16400 17185 1 1 18351
    14 "1jan2016"  "1jan2017"    20454 20820 1 0     .
    15 "21feb2012" "21feb2017"   19044 20871 0 0     .
    16 "17jul2001" "17jul2016"   15173 20652 0 1 18938
    17 "8jan2014"  "9jun2016"    19731 20614 1 0     .
    18 "1may2008"  "31jan2011"   17653 18658 0 0     .
    end
    Since the dataset is large, I chose to bring it down to one observation per id using these codes.
    Code:
    format start_dt %td
    format end_dt %td
    format diag_dt %td
    
    
    *Exposure date, if not exposure then earliest start date* 
    gen exp_dt=start_dt if exposure==1
    sort id exp_dt
    bys id: replace exp_dt=exp_dt[_n-1] if exp_dt==.
    format exp_dt %td
    label var exp_dt "Exposure date"
    
    *Flagging exposed individual
    bys id: replace exposure=1 if exp_dt!=. & exposure==0
    label var exposure "Exposed to factories" 
    
    *Diagnosis date
    sort id diag_dt
    bys id: replace diag_dt=diag_dt[_n-1] if diag_dt==.
    label var diag_dt "Diagnosis Date"
    
    *Earliest start date (living in the area)
    sort id start_dt end_dt
    bys id: gen seq=_n
    gen org_dt=start_dt if seq==1
    sort id org_dt
    bys id: replace org_dt=org_dt[_n-1] if org_dt==. 
    format org_dt %td
    label var org_dt "Time since living in the area"
    
    *End of study date
    gen xyz="31mar2019"
    gen study_end=date(xyz, "DMY")
    format study_end %td
    
    keep if seq==1
    drop seq xyz start_date end_date start_dt end_dt
    
    *If exposure date is blank replace with earliest start date
    replace exp_dt=org_dt if exp_dt==.
    The data with single output looks like this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id int(exposure diag) float(diag_dt exp_dt org_dt study_end)
    1 1 0  9060  9771  7944 21639
    2 0 1 15724 11641 11641 21639
    3 0 0 16501 13137 13137 21639
    4 1 0     . 16510 16510 21639
    5 1 0 16363 16476 12821 21639
    6 0 0     . 15858 15858 21639
    7 0 0     . 13011 13011 21639
    8 0 1 18128 15220 15220 21639
    9 0 0     . 19731 19731 21639
    10 1 0 17583 18336 16509 21639
    11 0 0     . 19390 19390 21639
    12 0 0     . 13649 13649 21639
    13 1 1 18351 16400 16400 21639
    14 1 0     . 20454 20454 21639
    15 0 0     . 19044 19044 21639
    16 0 1 18938 15173 15173 21639
    17 1 0     . 19731 19731 21639
    18 0 0     . 17653 17653 21639
    end
    format %td diag_dt
    format %td exp_dt
    format %td org_dt
    format %td study_end
    And then I declared the data as survival-time data to conduct Cox proportional hazards model.

    Code:
    **STSET & EVENT**
    gen event=[diag_dt!=.]
    replace diag_dt=study_end if event==0
    *replace event=0 if diag_dt<exp_dt /*censoring*/
    
    stset diag_dt, origin(time org_dt) enter(time exp_dt) id(id) failure(event)
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id int(exposure diag) float(diag_dt exp_dt org_dt study_end event) byte(_st _d) int(_origin _t) byte _t0
     1 1 0  9060  9771  7944 21639 1 0 .  7944    . .
     2 0 1 15724 11641 11641 21639 1 1 1 11641 4083 0
     3 0 0 16501 13137 13137 21639 1 1 1 13137 3364 0
     4 1 0 21639 16510 16510 21639 0 1 0 16510 5129 0
     5 1 0 16363 16476 12821 21639 1 0 . 12821    . .
     6 0 0 21639 15858 15858 21639 0 1 0 15858 5781 0
     7 0 0 21639 13011 13011 21639 0 1 0 13011 8628 0
     8 0 1 18128 15220 15220 21639 1 1 1 15220 2908 0
     9 0 0 21639 19731 19731 21639 0 1 0 19731 1908 0
    10 1 0 17583 18336 16509 21639 1 0 . 16509    . .
    11 0 0 21639 19390 19390 21639 0 1 0 19390 2249 0
    12 0 0 21639 13649 13649 21639 0 1 0 13649 7990 0
    13 1 1 18351 16400 16400 21639 1 1 1 16400 1951 0
    14 1 0 21639 20454 20454 21639 0 1 0 20454 1185 0
    15 0 0 21639 19044 19044 21639 0 1 0 19044 2595 0
    16 0 1 18938 15173 15173 21639 1 1 1 15173 3765 0
    17 1 0 21639 19731 19731 21639 0 1 0 19731 1908 0
    18 0 0 21639 17653 17653 21639 0 1 0 17653 3986 0
    end
    format %td diag_dt
    format %td exp_dt
    format %td org_dt
    format %td study_end
    Initially, I censored events before exposure date (code commented out) e.g., id#1. However, I have been advised to take into account events before factory opens as contribution to unexposed time using time varying option in Stata.

    So, I have used "tvc" and chose "exposure" as time varying variable. In actual dataset the code works and the output changes (I know example has no SEs and CIs). But I am not sure if it is correct. In a single observation dataset would Stata recognize "exposure" as time varying? I tried to use generate pre-exposure time and include that in texp, but quite sure I am using it wrong.

    Code:
    *Cox proportional hazard model
    stcox i.exposure
    
    *Time-varient model*
    stcox i.exposure, tvc(exposure)
    
    *Duration before exposure
    gen pre_exp=exp_dt - org_dt
    
    *Time-varient model with pre-duration*
    stcox i.exposure, tvc(exposure) texp(pre_duration)
    I would appreciate any comments on the methodology and codes. Apologies if the post is too long and if something is unclear.

    Thanks in advance.







  • #2
    Just wanted to add, perhaps I am misunderstanding the tvc function and don't need to add texp, since the exposure variable would by default interact with _t.

    Comment


    • #3
      I would have to check but pending another (presumably better) response, in texp() it seems like you choose the function of time, e.g. natural logarithm, with which you want to interact your tvc.

      2 remarks other than that: have you tested the proportional hazards assumption?

      Also, you seem to have two events, happening at diag_dt and exp_dt. Would you not be better off running a competing risks model as in Thomas (1996) to avoid treating the competing event as censored but precisely as a competing event?

      Hope this helps,
      Maxence

      Comment

      Working...
      X