Cox model with an unbalanced panel

Ana Guerrero

Join Date: Oct 2022

Posts: 12
#1

Cox model with an unbalanced panel

28 Oct 2022, 00:05

Hello everyone,

I'm running a cox model with an unbalanced panel from 2015-2021 to measure the factors that influence the survival of companies, I have my time variable and the failure variable, my covariates vary over time, that is, they are GDP , agglomeration index, growth rate of economic sectors and other indexes, these variables are different each year and also vary depending on the region. When the covariates are included in the analysis, they turn out to be significant, but none of them meet the proportional hazards assumption. My question is, is there an error in my database?
The commands I use to run the model are
-snapspan idn time died, generate(time0) replace-
-stset time, id(idn) failure (died)-
-stcox- (with my covariates)
-stat phtest-

I have reviewed some works where tests of the proportional hazards assumption are not included, so I don't know if it is something necessary,

thanks in advance
Ana
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

28 Oct 2022, 13:27

My question is, is there an error in my database?

Well, I would be the last person to reassure somebody that their data is OK, especially if I haven't seen it. Most real world data sets contain errors. But to answer the thrust of your question, even if your data are perfectly correct, it is entirely possible that the proportional hazards assumption simply isn't true of your data. In fact, in real world situations, it is often not true.

The failure of the proportional hazards assumption implies that the variables in question do not actually have a single hazard ratio that characterizes their effect on survival. It means that their effects change over time, so that separate hazard ratios are needed for different time periods. This is not hard to understand: different factors may have major effects on the viability of a firm during its startup period, but be less important once it is mature, or vice versa. Sometimes, you can overcome this problem by dividing the timeline up into a small number of eras, creating a discrete era variable, and then expanding the model to add the era variables themselves and their interactions with the variables for which proportional hazards fails.

I have reviewed some works where tests of the proportional hazards assumption are not included, so I don't know if it is something necessary,

In my experience, tests of the proportional hazards assumption are usually not included. But that doesn't make it OK. It's OK if you can convince yourself that the departure from proportional hazards is actually too small to matter for practical purposes. Remember that using statistical tests of the proportional hazards assumption can be tricky. If your sample is very large, you can get statistically significant rejection of the assumption even when the departures from proportional hazards are too small to be of any practical importance. On the other hand, sometimes failing to report such tests is just sweeping the problem under the rug and hoping nobody will notice.
1 like
Comment
Ana Guerrero

Join Date: Oct 2022

Posts: 12
#3

03 Nov 2022, 19:57

Originally posted by Clyde Schechter View Post

The failure of the proportional hazards assumption implies that the variables in question do not actually have a single hazard ratio that characterizes their effect on survival. It means that their effects change over time, so that separate hazard ratios are needed for different time periods. This is not hard to understand: different factors may have major effects on the viability of a firm during its startup period, but be less important once it is mature, or vice versa. Sometimes, you can overcome this problem by dividing the timeline up into a small number of eras, creating a discrete era variable, and then expanding the model to add the era variables themselves and their interactions with the variables for which proportional hazards fails.

Thank you very much for your answer, it has been very helpful,

I had a question about dividing the timeline into a small number of eras, I don't know if this refers to using the -recode- command and recode the time variable, for example, in three. If the comment does not refer to this, could you tell me how I could divide my timeline?

Thanks a lot again for everything
Best regards
Ana
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#4

03 Nov 2022, 23:17

No, I'm sorry for the misunderstanding; I should have been clearer and less metaphorical.

Supposing that we settle on three eras, I'm suggesting that you pick two times that define the cutpoints between the consecutive eras. Let's say, for sake of illustration that you decide the three eras are from time 0 to time 50, time 50 to time 70, and from time 70 onward. Then you would define a new variable:

Code:

gen byte era = 1 if time < 50 replace era = 2 if inrange(time, 50, 70) replace era = 3 if time > 70

Then in your -stcox- command you would include i.era##(those covariates that violate the proportional hazards assumption). Don't forget to prefix those covariates with c. or i. as appropriate. Evidently, the success of this plan depends on a good choice of the time cutpoints (and how many--three eras was just an example). You need to have a sense of how the effects of those covariates on failure risk are likely to vary over time to get this right. That depends, in turn, on the depth of your knowledge about how these covariates work.
Comment
Ana Guerrero

Join Date: Oct 2022

Posts: 12
#5

05 Nov 2022, 19:33

Originally posted by Clyde Schechter View Post

No, I'm sorry for the misunderstanding; I should have been clearer and less metaphorical.

Supposing that we settle on three eras, I'm suggesting that you pick two times that define the cutpoints between the consecutive eras. Let's say, for sake of illustration that you decide the three eras are from time 0 to time 50, time 50 to time 70, and from time 70 onward. Then you would define a new variable:

Code:

gen byte era = 1 if time < 50 replace era = 2 if inrange(time, 50, 70) replace era = 3 if time > 70

Then in your -stcox- command you would include i.era##(those covariates that violate the proportional hazards assumption). Don't forget to prefix those covariates with c. or i. as appropriate. Evidently, the success of this plan depends on a good choice of the time cutpoints (and how many--three eras was just an example). You need to have a sense of how the effects of those covariates on failure risk are likely to vary over time to get this right. That depends, in turn, on the depth of your knowledge about how these covariates work.

Thank you very much for the explanation, it was really helpful.

I just have a few more questions, related to this.
I tested each one of my variables individually to see if they met the proportional hazards condition, with the variable of era my variables that did not meet the assumption, now they meet it, however by including all the variables that meet with the assumption in my general model (using the -stcox- command), the condition is not met when all variables are included. My question is, although my individual variables meet this assumption, does the general model that includes all these variables also have to meet the assumption?
I hope I have expressed my ideas correctly.

Finally, by any chance, do you know of any document that could recommend me to include this type of "era" variables in the Cox model, it would be of great help.

Thanks a lot again for everything,
Best regards
Ana
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#6

05 Nov 2022, 22:38

My question is, although my individual variables meet this assumption, does the general model that includes all these variables also have to meet the assumption?

Strictly speaking, yes. But, as you have already noted, many people do not bother testing proportional hazards at all. If your individual variables are passing the test, but all together they don't, it suggests that the departures from the assumption for each variable are small enough to "get by" the testing process, but totaled up they aren't. One issue we never dealt with is, what is your sample size? In any kind of model, when testing fit (and the PH assumption test are like a fit test), we have to recognize that in real life our simple models are rarely the actual data generating process, and a sufficiently large sample will always find statistically significant deviations from fit, even if those deviations are too small to matter for any practical purpose. If you are working with a large sample, I would be inclined to ignore the overall test at this point.

I'm sorry but I do not have any reference to offer you about the approach taken. I think if you find a text on survival analysis, where they discuss the Cox proportional hazards model, it will likely say that one solution to the departure from the proportional hazards model is to include a predictor#time interaction, and that is what you have done here.
Comment
Ana Guerrero

Join Date: Oct 2022

Posts: 12
#7

06 Nov 2022, 13:20

Originally posted by Clyde Schechter View Post

One issue we never dealt with is, what is your sample size? In any kind of model, when testing fit (and the PH assumption test are like a fit test), we have to recognize that in real life our simple models are rarely the actual data generating process, and a sufficiently large sample will always find statistically significant deviations from fit, even if those deviations are too small to matter for any practical purpose. If you are working with a large sample, I would be inclined to ignore the overall test at this point.

I see, there are three samples of panel data and the three exceed one million data, even in two of the samples I have 4 million data, so I think it would apply to ignore the test as a whole and only consider the tests of each of my variables, right?

Thank you very much, again
Ana
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#8

06 Nov 2022, 16:14

Yes. With a sample that size, it is going to be extremely difficult not to have these tests reject pretty much any hypothesis you give them. p-values and hypothesis testing in data sets that large are pretty useless.
Comment
Ana Guerrero

Join Date: Oct 2022

Posts: 12
#9

09 Nov 2022, 12:52

Originally posted by Clyde Schechter View Post

Yes. With a sample that size, it is going to be extremely difficult not to have these tests reject pretty much any hypothesis you give them. p-values and hypothesis testing in data sets that large are pretty useless.

I understand very well, thank you very much, everything has been very helpful.

I would like to ask you a few more questions related to the predictor variable of time.
In the command I used double ##, from what I was reading it refers to complete factorial interactions, however seeing some examples and comparing them with my results I have a couple of doubts.

When I run the regression in stata I only get the hazard ratio for my "era" variable, but the standard error, z, etc are marked with a "." is this correct?

Then the command returns the results of the variable without interaction and finally returns the results with interaction with the "era" variable, my question is, when analyzing the hazard ratio, should I analyze the two results (with interaction and without interaction)?

Finally, I have seen that the "era" variable can also be used with a single #. Is this also correct or should the interaction be with ##? or what is the difference between # and ##?
I tested with a single # and the results are the same if I include variables with threshold (or dummy for each time group).

Attached is a table of the regression result.

Thank you very much again
Ana

Last edited by Ana Guerrero; 09 Nov 2022, 12:55.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#10

09 Nov 2022, 13:22

When I run the regression in stata I only get the hazard ratio for my "era" variable, but the standard error, z, etc are marked with a "." is this correct?

This is OK. When you get a variable that demarcates a part of the timeline, it is, in effect, built into the baseline hazard function, and you get no standard error for it.

when analyzing the hazard ratio, should I analyze the two results (with interaction and without interaction)?

For present purposes, only the era1#c.inn results are relevant. The coefficient of the era variable itself is, as I said just above, basically an adjustment to the baseline hazard and has no bearing on the effects of the variable inn. The important thing to understand now, however, is that by using this interaction model, you are stipulating that there is no longer any unique hazard ratio that describes the effect of inn. Instead, there are two hazard ratios, one in each of the two eras. In the base value of era, the hazard ratio for inn is given by the hazard ratio shown for inn in the -stcox- output. For the other era, the hazard ratio for inn is the product of the hazard ratio shown for inn with the "hazard ratio" (it is actually a ratio of hazard ratios) for the interaction coefficient. You can calculate the latter most easily with -lincom inn + 2.eral#c.inn-.

Finally, I have seen that the "era" variable can also be used with a single #. Is this also correct or should the interaction be with ##? or what is the difference between # and ##?
I tested with a single # and the results are the same if I include variables with threshold (or dummy for each time group).

In any Stata regression model, -a##b- is the same thing as -a b a#b-. That is, in fact, the definition of ##. The a#b term is the interaction itself. If you run the model with just a#b, you get equivalent results to running it with a##b, but the models are parameterized differently and the meaning of the coefficients of what look like the same terms in both models is different. If you are algebraically inclined, it is possible to transform the coefficients in either model to calculate the coefficients in the other.
Comment
Ana Guerrero

Join Date: Oct 2022

Posts: 12
#11

16 Nov 2022, 11:55

Originally posted by Clyde Schechter View Post

For present purposes, only the era1#c.inn results are relevant. The coefficient of the era variable itself is, as I said just above, basically an adjustment to the baseline hazard and has no bearing on the effects of the variable inn. The important thing to understand now, however, is that by using this interaction model, you are stipulating that there is no longer any unique hazard ratio that describes the effect of inn. Instead, there are two hazard ratios, one in each of the two eras. In the base value of era, the hazard ratio for inn is given by the hazard ratio shown for inn in the -stcox- output. For the other era, the hazard ratio for inn is the product of the hazard ratio shown for inn with the "hazard ratio" (it is actually a ratio of hazard ratios) for the interaction coefficient. You can calculate the latter most easily with -lincom inn + 2.eral#c.inn-.

Thank you very much for all your comments, these days I have been working a bit with the database and trying different things, because in some cases the "era" variable did not change my results, so this led me to try new things and Other doubts arose.

My database is made up of companies that belong to different economic sectors, there is a sector that predominates over the others, so one option was to separate that sector from the rest, thus leaving me with a smaller sample of about 300 thousand data, with That sample ran the model again and the proportional hazards tests for the variables and the result was that some of the variables did pass the tests, when previously none of the variables passed the test.
I did the same but with the companies that belonged to the other sector (the one that predominated), where the base is more than 1 million data, and in that case the variables do not pass the proportional hazards test, so this led me to wonder if in my case, where I have a fairly large database, these tests are efficient or if they continue to be a necessary condition of these models? I know that you previously told me that the global test was not necessary due to the size of the sample However, it seems that this also influences the individual tests of the variables.

Thank you very much again for thrush
Best regards,
Ana
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#12

16 Nov 2022, 12:45

I know that you previously told me that the global test was not necessary due to the size of the sample However, it seems that this also influences the individual tests of the variables.

I'm sorry I was not clearer in my explanation. The sample size affects all p-value based statistical tests. With very large sample sizes, any p-value test may provide a "significant" verdict when the actual difference being tested is too small to matter for practical purposes. This is a general statistical principle, and does not depend on the particular test. This is one of the reasons why I seldom use p-value based tests to choose among models.

Let me also clarify something else. The proportional hazards assumption needs to be true, to a good approximation, for the results of a proportional hazards analysis to be valid. The important part of that sentence is "to a good approximation." I think it is fair to say that in the real world, it is very rare for the proportional hazards assumption to be exactly true. In fact, I doubt it ever happens. But it is often close to true, close enough that the deviation can be ignored for practical purposes. The problem with p-value based tests of the assumption is that in very large samples like the one you are working with, small deviations that are ignorable for practical purposes show up as "statistically significant" deviations. So while the proportional hazards assumption itself is necessary, the tests that are commonly used to check it are sometimes going to give misleading results, especially in very large samples (or in very small samples they can fail in the opposite way.)

Looking at your results in #9, I see that the interaction hazard "ratio" (it is actually a ratio of hazard ratios, not a hazard ratio) is 0.97. That is close enough to 1 that I think the departure from proportional hazards is small enough that you need not worry about it, unless an error of 3% in your original results is, in your judgment, unacceptably high.
Comment
Ana Guerrero

Join Date: Oct 2022

Posts: 12
#13

20 Nov 2022, 12:58

Originally posted by Clyde Schechter View Post

I'm sorry I was not clearer in my explanation. The sample size affects all p-value based statistical tests. With very large sample sizes, any p-value test may provide a "significant" verdict when the actual difference being tested is too small to matter for practical purposes. This is a general statistical principle, and does not depend on the particular test. This is one of the reasons why I seldom use p-value based tests to choose among models.

Thank you very much for the clarifications, I have been reading about the p-value issue and as you mention in very large samples like the one I am using, these tests can be misleading.

This problem led me to do another analysis of my data, I generated smaller random samples, ran the model again for each of the variables and the result was that some of the variables passed the assumption that they did not before with the complete sample, Can this serve as a test to see if the variables comply with the assumption?

Thank you very much again
Best regards,
Ana
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#14

20 Nov 2022, 16:35

Can this serve as a test to see if the variables comply with the assumption?

Well, let me try to elaborate on my remarks in #12. There are few, if any, real world data-generating processes that exactly satisfy the proportional hazards assumption. Any sufficiently sensitive test (especially in large samples) will pick up that departure even if the proportional hazards assumption approximately true and close enough for practical purposes. At the other end, any sufficiently insensitive test (especially in small samples) will fail to detect even gross departures that are large enough to make the use of the Cox model inappropriate.

It seems to me you are seeking some kind of test that you can use to prove your model is correct. But, and I know it's a cliche, but it's nevertheless true and important, as Box said, all models are wrong and some models are useful. The question is whether your model is useful.

What might "useful" mean in this context? The proportional hazards assumption says that there is a unique ratio that characterizes the hazards of subsets defined by different values of the predictors and covariates in your model. It says, in effect, that if the hazard for, say, the agricultural sector is 60% of that for the financial sector now, it remains 60% at all times, past and future. A shorter way of saying that is that there is no interaction between the sector effect and time. That is the definition, the meaning, of the proportional hazards assumption. It is simply the assertion that there is no omitted interaction term involving time. Now, as I have repeatedly emphasized, this assumption is unlikely to be exactly true in any real-world situation. That is because in the real world things are complex and in nearly all situation you can imagine, there will always be some degree of interaction between the effects of any two variables. Yet, in other contexts, we do not routinely include interaction terms among all variables in all regression models. That is because we are often confident that interactions are small enough that they can be ignored for practical purposes. The same applies here. Given the complexity of the world, there will almost always be interactions between the effects of time period and the effects of any variables in your survival analysis. The important question is whether they are small enough to ignore. That is, if we shoehorn our analysis into the assumption that the hazard ratio really is a single unique number for all time, how far will we go astray?

Since you have already looked at the effect of relaxing that assumption and allowing the hazard ratio for variable inn to differ in two eras, you can easily quantify how far astray you go. Looking at the output you show in #9, we can see that in era 1, the hazard ratio is 2.70. In era 2, it is 2.70 * 1.03 = 2.78. (I'm rounding everything to 2 decimal places). So what you have to ask yourself is whether ignoring the difference between a hazard ratio of 2.70 and 2.78 matters. There is no "test" that can answer this question. It is a matter of professional judgment. Is this 3% error important? Are the data themselves even accurate to within 3%? Are there any decisions or interpretations that would be made differently in this situation (2.70 in one era and 2.78 in another) than if we were to just elide the difference and treat them as the same? You will have to bring to bear on these questions your substantive understanding of the variables involved, the mechanisms by which they affect each other (to the extent those are understood), and to what uses the results of your analysis might be put. And then you will have to make a judgment as to whether the proportional hazards model, though wrong, is nevertheless useful. Again, it's a judgment call. There is no test that answers this question. All of the tests that can be applied to this give (approximate) answers to a different question: is the proportional hazards assumption true. But for real world research, that is the wrong question, and applying those tests creates the illusions of objectivity and of the model being definitely wrong or definitely not wrong. But they cannot answer the question that really matters: is the proportional hazards model useful for this data?

Last edited by Clyde Schechter; 20 Nov 2022, 16:40.
Comment

summer jia

Join Date: Apr 2022
Posts: 4

#15

25 Nov 2022, 01:14

Setup COX PH data

Dear Stata Forum,

I'm running a cox model with an unbalanced panel from 2004-2021 to see whether var1 influences the failure of companies.

My data looks like this:

id	year	var1	var1_stime	failure
1	2004	0	2018	0
1	2005	0	2018	0
1	2006	0	2018	0
1	2007	0	2018	0
1	2008	0	2018	0
1	2009	0	2018	0
1	2010	0	2018	0
1	2011	0	2018	0
1	2012	0	2018	0
1	2013	0	2018	0
1	2014	0	2018	0
1	2015	0	2018	0
1	2016	0	2018	0
1	2017	0	2018	0
1	2018	1	2018	0
1	2019	1	2018	0
1	2020	0	2018	0
2	2005	0	2006	0
2	2006	1	2006	0
2	2007	1	2006	0
2	2008	0	2006	0
2	2009	0	2006	0
2	2010	0	2006	1
3	2008	1	2008	0
3	2004	1	2004	0
3	2005	1	2004	0
3	2006	1	2004	0
3	2007	0	2004	1

An observation enters the study in the year when var1=1.
An observation leaves the study in the year when failure=1; or (2) the end of the sample period if failure=0 over the sample period.

How should I write the code?

Announcement

Cox model with an unbalanced panel

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment