Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Average hazard rate with cox proportional hazard model always larger than the hazard rate without any explanatory variables?

    Note: question is also posted here without the stata specifics

    I am estimating a cox proportional hazard model with and without explanatory variables. Without explanatory variables, the hazard rate is just the proportion of all individuals that failed at time t out of all individual that lasted at least time t .

    After estimating the model with covariates, I calculate the predicted hazard for each observation in the sample used to fit the model by multiplying the baseline hazard by exp(z_i′β) . I then average the predicted baseline hazard for each unit of time across all individuals that have not yet failed at that time.

    My problem is that the averaged hazard rate is always larger than the hazard rate from the model without any explanatory variables. I do not understand how the average hazard rate from including a covariate is always larger than the hazard rate without any covariates. It makes sense if they are not always equal but I feel like the average of the two should be the same. If it helps, below is the stata code I am using:

    Code:
      /*
      
      My data is in the form, one observation per individual for each month they have not yet failed.  So if an individual survives for three months and then fails the data looks like :
      
      
      
      */
     
      
      
      
        **** Stsets the data
    stset spell_duration, failure(failure_indicator) id(individual_id)
    
    stcox explanatory_variable
    
    **** Predicts the baseline hazard that is the same for each person at each time t
    predict baseline_hazard_rate, basehc
    
    **** Predicts the hazard ratio for the individual that varies across individuals
    predict hazard_ratio_for_individual, hr
    
    **** baseline_hazard_rate estimate only appears for observations that failed at the specific time so this replaces the missing values and then redefines the baseline hazard
    bysort spell_duration: egen temp = max(baseline_hazard_rate)
    replace baseline_hazard_rate = temp
    drop temp
    
    **** This should be the unique hazard for each individual at time t
    gen individual_hazard_rate = baseline_hazard_rate*hazard_ratio_for_individual
    
    **** Estimates the simple model
    /*
    
        Note: this estimate is the same as simply taking the proportion of individuals that failed at period t out of all individuals  that lasted until at least time t.  In stata I could do this as:
    
    collapse failure_indicator, by(spell_duration)
    
    */
    stcox, estimate
    
    **** Predicts the simple hazard rate estimate that is the same for all individuals at time t
    predict simple_hazard_rate_estimate, basehc
    
    collapse  simple_hazard_rate_estimate individual_hazard_rate, by(spell_duration)
      
     
     
    My problem is that individual_hazard_rate average is always larger than simple_hazard_rate_estimate . See the figure below (the explanatory variable hazard is always larger but the difference decreases over time):


    I have also tried this with and without using sample weights, made sure that none of the explanatory variables have any missing values, and made sure none of the explanatory variables change over the duration of the survey and experience the same problem

    Thank you for any help you can provide.


  • #2
    The table should look like:

    Click image for larger version

Name:	avg_hazard_rates.png
Views:	1
Size:	2.6 KB
ID:	797635

    Comment


    • #3
      The two averages will be close, but I can think of some reasons they might be systematically different in a single sample.

      1. The estimated model is incorrect. You have only a linear term in your explanatory variable: have you checked the model fit via estat phtest (test of proportional hazards) or linktest (test of linearity of fit)?

      2. Bias of the default Breslow likelihood method in the presence of heavily tied data.

      3. Random variability in the estimated regression coefficient in a single sample.

      4. The estimated hazards with coefficients are the average of non-linear terms in \(\hat{\beta}\), with weights that depend on the estimates. So I'd expect systematic bias even if the true \(\beta\) were substituted, bias that would diminish as the size of the risk sets increased.
      Last edited by Steve Samuels; 11 Feb 2015, 07:07.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment

      Working...
      X