I'm trying to estimate a set of survival analysis regressions with stcox, looking at the effect of the local unemployment rate on the hazard that an individual has an outcome. Because unemployment is varying constantly, I already have the data expanded to have one observation per individual per hazard time until their exit, such that for each failure time, the unemployment rate covariate reflects the current rate (where stcox would otherwise use the last observed rate). My data are quite large, with around a million individuals, but singe the data is quarterly, I have a relatively small number of failure times. I would now like to add an interaction between unemployment and time at risk. In theory this is what the tvc() option is for, and per the manual, in large data it should be more efficient than using stsplit and manually creating the interactions. However, in part because I already had to fill in observations to make the unemployment covariate, tvc() is in fact drastically slower. If I stsplit my data, no observations are created (confirms that I did it right using expand etc.), and just adding c.unemployment#c._t to the cox regression runs relatively quickly (maybe an hour with my full data). Using the tvc option took at least 6 days, then our server was rebooted for maintenance so it never finished.
For my immediate problem this is solved (the data is effectively stsplit, so just include whatever interactions), but this has me wondering what is going on under the hood with the tvc() option such that it is theoretically faster with un-stsplit data, but is significantly slower with data that is mostly or entirely stsplit already and/or has a relatively small number of failure times. I.e. when can I expect tvc is superior to filling in data, and when not, and why?
For my immediate problem this is solved (the data is effectively stsplit, so just include whatever interactions), but this has me wondering what is going on under the hood with the tvc() option such that it is theoretically faster with un-stsplit data, but is significantly slower with data that is mostly or entirely stsplit already and/or has a relatively small number of failure times. I.e. when can I expect tvc is superior to filling in data, and when not, and why?