Discrete survival analysis - Question re: functional form of baseline hazard and out-of-sample-prediction

Guest

Discrete survival analysis - Question re: functional form of baseline hazard and out-of-sample-prediction

27 Mar 2017, 03:09

I am conducting a discrete-time hazard analysis using Stata 14.0. I am working with Prof. Jenkins material*.

To me, my baseline hazard appears to lead to a non-parametric approach. Those are my coefficients (odds ratios) for the baseline hazard (one unit is one year) and some are empty.

HTML Code:

  event | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  c |
          0  |          1  (empty)
          1  |   1.553132   2.016981     0.34   0.735     .1218407    19.79815
          2  |   1.378796   1.792739     0.25   0.805     .1078326     17.6299
          3  |    .862181   1.121018    -0.11   0.909     .0674305    11.02403
          4  |   .8890559   1.156885    -0.09   0.928     .0693909    11.39084
          5  |   .7423318   .9660042    -0.23   0.819     .0579323    9.512084
          6  |   .7198207   .9404603    -0.25   0.801     .0556048    9.318292
          7  |   .9072992   1.190255    -0.07   0.941     .0693566    11.86897
          8  |   .7818488   1.024988    -0.19   0.851     .0598708    10.21011
          9  |   .5215442   .6845979    -0.50   0.620     .0398082    6.832967
         10  |    .644339   .8533693    -0.33   0.740      .048059    8.638823
         11  |   1.204835   1.592372     0.14   0.888     .0903515    16.06644
         12  |   .8921704   1.195162    -0.09   0.932     .0645904    12.32332
         13  |   .5165349   .6978924    -0.49   0.625     .0365627    7.297289
         14  |   .4082628   .5556115    -0.66   0.510     .0283482    5.879682
         15  |   .4803247   .6430928    -0.55   0.584     .0348246    6.624971
         16  |   1.056018   1.412588     0.04   0.967     .0767462    14.53067
         17  |   .8581945   1.180493    -0.11   0.911     .0579044     12.7192
         18  |   1.024795   1.409555     0.02   0.986     .0691593    15.18529
         19  |   1.203183   1.741118     0.13   0.898     .0705608    20.51633
         20  |   .3300895   .6263513    -0.58   0.559     .0080068    13.60828
         21  |   .1806894   .2850472    -1.08   0.278     .0082057    3.978769
         22  |   1.126723   1.597316     0.08   0.933     .0700002    18.13574
         23  |   .2355325   .3772079    -0.90   0.367     .0102057    5.435749
         24  |          1  (empty)
         25  |   6.051808    10.6546     1.02   0.306     .1919951    190.7569
         26  |          1  (empty)
         27  |   1.522701   2.443217     0.26   0.793     .0655899    35.35021
         28  |          1  (empty)
         29  |          1  (empty)
         30  |          1  (empty)
         33  |          1  (empty)
         35  |          1  (empty)
         36  |          1  (empty)
         41  |          1  (omitted)
         42  |          1  (empty)
         44  |          1  (empty)

Here is the problem: I want to attain a graph of my hazard function. However, because I have about a douzen covariates and about 45 spells for my baseline hazard variable, I think I need to do out-of-sample-prediction. According to Prof. Jenkins, this means that I need to have a underlying parametric functional form of my baseline hazard** (because the -predict- command will extrapoliate).
But looking at my baseline hazard coefficients, I do not see how I can establish a parametric functional form without using some higher-order polynomial function.

My question therefore is, what would be a good approach in this case? Could I also use a piecewise-constant functional form as a semi-parametric approach and still use out-of-sample-prediction?

I am very grateful for your input. Thank you in advance!

*https://www.iser.essex.ac.uk/resourc...sis-with-stata
**https://www.iser.essex.ac.uk/files/t...s/ec968st6.pdf (page 14ff)

Tags: None

Guest
#2

28 Mar 2017, 01:34

May I push it up again? I would in particular like to ask Stephen Jenkins again, because I am sure you can help me answer me question. I just would like to understand what to do if I do not see a parametric functional form in my baseline hazard, but still need to do out-of-sample prediction and whether maybe a piecewise-constant form would be a solution?
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#3

28 Mar 2017, 02:25

Guest: Sorry, but I have a real job and can't always respond immediately.

Your models are fit within-sample. Out of sample you have to make some (unverifiable) assumption about the variation of the hazard with survival time. Using a parametric model "solves" the problem; the assumption is built-in, so to speak. If you fit a piecewise constant model, you still have to decide how to extrapolate beyond the survival times observed in your data. You might use the hazard rate corresponding to the last "piece", but that's still an assumption. What you think reasonable is likely to be depend on context and your specialist knowledge of what counts as plausible. There is no magic solution here,

Last edited by sladmin; 06 Feb 2018, 09:40. Reason: anonymize original poster
Comment
Guest
#4

29 Mar 2017, 03:37

Dear Prof. Jenkins, I am so sorry you got under the impression I take your help for granted. I always think that once my post is vanishing from the first page of this forum, it will go unnoticed. Thank you very much for answering so quickly and your very helpful responses!

I understand the predictions behind an out-of-sample prediction and would generally prefer to predict within my sample. The reason I did not attempt that was that I have many covariates and I cannot find any combination that generates hazards. I tried to use the biggest categories and I tried to use the median of every continuous variable etc., but my hazard variable remains empty.

This e.g. is the version with the median of my continuous variables:

HTML Code:

. predict h, p (17506 missing values generated) . . g h0 = h if ls == 0 & agree == 5.333333 & neuro == 3.666667 /// > & consc == 5.666667 & extra == 4.666667 & open == 4.333333 & age == 38 & isc_2 == 0 & isc_3 == > 1 /// > & sex == 1 & chil == 0 & married == 0 & migration == 0 & region == 1 & uerate == 7.7 /// > & linc == 9.522861 & workexp == 8 (20,611 missing values generated) . . g h1 = h if ls == 5 & agree == 5.333333 & neuro == 3.666667 /// > & consc == 5.666667 & extra == 4.666667 & open == 4.333333 & age == 38 & isc_2 == 0 & isc_3 == > 1 /// > & sex == 1 & chil == 0 & married == 0 & migration == 0 & region == 1 & uerate == 7.7 /// > & linc == 9.522861 & workexp == 8 (20,611 missing values generated) . . g h2 = h if ls == 10 & agree == 5.333333 & neuro == 3.666667 /// > & consc == 5.666667 & extra == 4.666667 & open == 4.333333 & age == 38 & isc_2 == 0 & isc_3 == > 1 /// > & sex == 1 & chil == 0 & married == 0 & migration == 0 & region == 1 & uerate == 7.7 /// > & linc == 9.522861 & workexp == 8 (20,611 missing values generated) . end of do-file . tab h0 no observations . tab h1 no observations . tab h2 no observations

You say my model is fit for within-sample. Would you still agree with this statement with so many covariates? If so, do I overlook a way to find combinations that exist in my data?

I tried the most basic version, only age as a covariate and only for three categories of "ls". Even this does not really work. Unfortunately I can only attach images, not insert them (or at least I did not managed to do so yet), but the image was consisting of one line, one dot and one very short line. Really not at all a complete image. Apparently event his small combination of just "ls" and age does not exist for many time points.

I am still thinking about what assumption I can make about the hazard rate in the missing time points. From a logical standpoint, it would be very plausible to assume that the hazard remains stable after 20+ years, so with that regard I would like to group the duration above 20 years (the unit is years).

Last edited by sladmin; 06 Feb 2018, 09:40. Reason: anonymize poster
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#5

29 Mar 2017, 09:32

The model is fit using the combinations of survival times and covariates that you have in your estimation sample -- hence "within-sample". Suppose that one of the covariates is "age", a positive integer, as in the Cancer data set in my Lessons. In that data set, I don't think there are values of every possible age between the minimum and maximum. So, when you do a post-estimation predict using your estimation sample (a within-sample prediction), you will not find estimates of the hazard (etc.) for each and every possible value of age. But you could derive them with suitable adaption of the post-estimation prediction ideas that we've been talking about. Put differently, out-of-sample prediction can refer to covariates as well predictions of survival times beyond the maximum value observed in your estimation data set. As for how to approach the latter, your final paragraph answers your own question ...
Comment
Guest
#6

30 Mar 2017, 02:32

I grouped my duration variable after 20 units. So they are 1,2,3,..20+.

Unfortunately I am still really confused and unsure whether I really understood everything. So here is how I argument in my head:
Because "h0" etc remain empty, I come to the conclusion that I do not have a lot of possible combinations in my data to actually do within-sample. Because in your material I understand that to predict hazards within-sample, I actually have to have cases with the specific criteria I choose. That is easier if you only look at e.g. age. But a lot harder if you over-specify on age, gender, personality, living circumstances etc. And because "h0" etc. remain empty, I think that I simply cannot find any simulation in my data that works with so many covariates, and I still get to the conclusion that I need to do out-of-sample-prediction to simulate specific scenarios. Which is why I am so confused if you say my data is "fit" for within-sample. Which, if it would be, "h0" etc. should not remain empty, unless of course, I do something wrong while trying to create those variables.

So, if I do out-of-sample prediction (and to me it sounds at the end of your post, like you think while the data is "fit" within-sample, out-of-sample with grouped duration might be the solution), then I wonder, if I use my grouped duration (grouped after 20 years), do I still have to use a parametric baseline hazard, even though I do not "need" to extrapolate in between my missing baseline hazards anymore or do I only need a parametric functional form if I have missings? Is that correct? Because I got confused, when you talked about age, which I think you meant in replace of survival time, because age in your cancer-analysis is just a covariate. Or does the analysis also extrapolate for every covariate?

I really hope I was able to express my thoughts and confusion. And I really hope I have not missed s.th. in your posts. I have read them over and over again.
Comment
Guest
#7

03 Apr 2017, 02:14

I would just like to correct my own mistake, because I am just reading s.th. about it and I now see my mistake: Of course out-of-sample prediction ist always the prediction beyond s.th., so I understand now that I technically do not need to do this, because I do not need to extrapolate, I have all time variables I need.

So the only question remains, is, whether I can still use the "simulated data" (as in the out-of-sample-prediction) to do within-sample prediction? Or what is the reason that when I try to predict specific hazard scenarios, they remain empty (as in the example in post #4)?
Comment

Announcement

Discrete survival analysis - Question re: functional form of baseline hazard and out-of-sample-prediction

Comment

Comment

Comment

Comment

Comment

Comment