stset use with multiple records per person

Roos Schyns

Join Date: Aug 2024

Posts: 10
#16

08 Aug 2024, 16:31

Hi Clyde,

Does this mean that repetition of 2019 in my dataset will not result in my time-varying variables (e.g. Unemployment) being repeated twice for that particular year when running the regression?
Since Unemployment is my only faulty variable when trying to run the Cox regression, do you recommend omitting this variable or do you have any advice on what to do, as I am not sure what is wrong with the variable to cause error r430.

Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#17

08 Aug 2024, 20:06

Does this mean that repetition of 2019 in my dataset will not result in my time-varying variables (e.g. Unemployment) being repeated twice for that particular year when running the regression?

Yes. The survival analysis ignores the Start variable. Once you -stset- the data, the survival analysis uses only the variables that are mentioned in -stset-, plus any variables mentioned in your -stcox- command. It doesn't even know, let alone care, that Start has a repeat value. As for the independent variables that are in your -stcox- command, while they appear to have repeated values in two observations, you have to look at the time sequence. Stata interprets the values of the independent variables as prevailing from the time variable in the preceding observation to the time variable in the current observation. Of the two observations that have Start = 2019, one of them has time = 1/1/2019 and the other has time = DOF. Now, the code guarantees that the one with time != DOF has Event = 0. So there are two possibilities. If the time == DOF event is earlier than 1/1/2019, then the whole process stops at time DOF and the subsequent event with time = 1/1/2019 is ignored. If the time == DOF event is later than 1/1/2019, then the values of the predictor variables are "extended" to DOF. But that is what you need You cannot leave those variables as missing, because if you do, then the DOF observation will be excluded from analysis, which means that Stata will never see a failure event for that company. So some value must be supplied to cover the period ending at time == DOF. In the absence of actual data on that, the most reasonable approach is to carry forward the value immediately preceding that observation, which is what the code does.

As for your unemployment variable, there is nothing obviously wrong with the variable itself. But, for some reason, it breaks the analysis when you add it to your model. For some reason it creates a singularity in the likelihood function when you do that. But these likelihood functions are very complicated and it is really impossible to develop an intuition for how particular variable distributions cause singularities, or non-concave areas. You can try experimenting with some transformations of Unemployment to see if they work better. But there is no principled approach that I can recommend. It's trial and error at best, and leaving the variable out is the usual end result of the process.
Comment
Roos Schyns

Join Date: Aug 2024

Posts: 10
#18

09 Aug 2024, 10:39

Hi there,

I did not change my data for Unemployment but instead I used commands 'tostring Unemployment, replace' and 'destring Unemployment, replace' and for some reason the Cox model is now running for all variables with no issues!

Clyde Schechter thank you so much for your ongoing support, I was really stuck with my model and have my deadline approaching soon so you have helped me out massively! I really appreciate it.
Comment
Roos Schyns

Join Date: Aug 2024

Posts: 10
#19

10 Aug 2024, 09:11

Hi Clyde Schechter,

Apologies, I have one last question. As mentioned before, my model uses constant variables (for example ownership type) but also variables that change over time (the macroeconomic variables that take a different value for each observed year). For this reason, the proportional hazard assumption is violated: when I run the Schoenfield test ('estat phtest'), I get a p-value of 0.00.
This of course makes sense since I am using variables that change over time that are not expected to constant hazard over time. However, existing literature on business survival also employs the Cox model with variables that change over time and they mention 'modification' or 'augmentation' of the Cox model to accommodate this - for example Arnab Bhattacharjee, 2005. "Models of Firm Dynamics and the Hazard Rate of Exits: Reconciling Theory and Evidence using Hazard Regression Models," Econometrics 0503021, University Library of Munich, Germany.

I am not sure how to augment my model or can I just say I violate the proportional hazard assumption without this affecting the robustness of my results?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30064
#20

10 Aug 2024, 10:39

This of course makes sense since I am using variables that change over time that are not expected to constant hazard over time.

No, "variables that change over time" and variables "that are not expected to constant hazard over time" are two different situations. All four combinations of those two properties of a variable in a model are possible. This is, in fact, the same confusion that I spoke about earlier in connection with the phrase "time varying covariates."

Violation of the proportional hazards model means, by definition, that the hazard ratio for the variable changes over time. This can happen equally to variables that themselves are constant over time or to variables that vary over time. You need to identify which variable(s) in your model are contributing to the violation. Run your -stcox- command again and follow it with -estat phtest, detail-. That will give you a separate proportional hazards test for each variable. Then you will have to deal with the ones that are found to violate PH. Some of them may be variables that vary over time, and some of them may not.

Once you have identified those variables, the remedy is typically to make them -tvc()- variables. Then re-check proportional hazards. If they are still in violation, try -stcox- again, adding the -texp(ln_t)- option. If that doesn't solve the problem, you may have to experiment with transformations of those variables--it can get very complicated. In connection with that, if your sample is large, -estat phtest- can be "too sensitive" to PH violations, picking up minor deviations that are of no practical consequence. So, if you cannot placate -estat phtest- with the simpler approaches based just on -tvc()- perhaps with -texp(ln_t)-, consider using the -stphplot- command instead: if the graphs look OK, you can ignore the statistically significant phtest results.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment