Dear all,
I have a question regarding a Cox Proportional Hazards Analysis, and more specifically the proportional hazards assumption and the interpretation.
Side note: I am using STATA MP/15 at work and have STATA 18/SE at home
I have a large dataset (> 250K firms in CP format) that looks at firm failure on a quarterly basis (up to 40 quarters). My independent variables are team size, gender & nationality diversity (using Blau index, minimum 0 and max 1 for nationality and 0.5 for gender as it is a dummy var) and age diversity (measured using coefficient of variation). These variables can vary over the quarters due to changes in the team composition.
When checking for the proportional hazard assumption I used the following methods:
Schoenfeld residuals using estat phtest, detail (side note the industry & country are my control variables using dummies)
Here the assumption is violated based on significance for Gender & Size. But as read in other sources this may be caused by the large sample size.
Using tvc on these covariates also are signficant, further showing violation of the assumption.
However, when plotting the variables using estat phtest, plot(VARIABLES) yline(0), I visually have (quite) horizontal lines (for all covariates similar results, therefore not all pics are posted). Also changing the bandwidth has little effect. Not sure as of when you would visually say the assumption is not met.


But when dividing the sample into subsamples based on quarters (eg. firms in Y0-2 (table 1) vs Y2-4 (table 2)) I see clear differences in the hazard ratio's. It differs even more when comparing year 0-2 with year 8-10.
So now I am confused as to how to interpret these results. Specifically:
1) Can I conclude the assumption is violated or not?
2) Why would one method of checking for the assumption be more valid/correct in my case, than another?
Any advice would be very welcome!
I have also asked similar question on Cross Validated (link1; link2), but have at time of writing not received an answer on this "issue".
Thanks in advance!
Best regards,
Laura
I have a question regarding a Cox Proportional Hazards Analysis, and more specifically the proportional hazards assumption and the interpretation.
Side note: I am using STATA MP/15 at work and have STATA 18/SE at home
I have a large dataset (> 250K firms in CP format) that looks at firm failure on a quarterly basis (up to 40 quarters). My independent variables are team size, gender & nationality diversity (using Blau index, minimum 0 and max 1 for nationality and 0.5 for gender as it is a dummy var) and age diversity (measured using coefficient of variation). These variables can vary over the quarters due to changes in the team composition.
When checking for the proportional hazard assumption I used the following methods:
Schoenfeld residuals using estat phtest, detail (side note the industry & country are my control variables using dummies)
Code:
---------------------------------------------------------------- | rho chi2 df Prob>chi2 ------------+--------------------------------------------------- 1.countryc~2| 0.02250 72.46 1 0.0000 2.countryc~2| -0.00361 1.92 1 0.1655 4.countryc~2| 0.00286 1.23 1 0.2675 5.countryc~2| 0.01147 17.51 1 0.0000 6.countryc~2| 0.01066 16.62 1 0.0000 7.countryc~2| -0.00676 6.78 1 0.0092 8b.countr~t2| . . 1 . 1.industry~2| -0.06753 611.13 1 0.0000 2.industry~2| -0.06685 582.30 1 0.0000 4.industry~2| -0.07532 727.20 1 0.0000 5b.industr~2| . . 1 . TMTsize | 0.01306 25.74 1 0.0000 BlauGender | 0.05834 516.33 1 0.0000 VariationAge| 0.00189 0.55 1 0.4595 BlauNation~y| -0.00200 0.58 1 0.4454 ------------+--------------------------------------------------- global test | 1693.43 13 0.0000 ----------------------------------------------------------------
Using tvc on these covariates also are signficant, further showing violation of the assumption.
Code:
Cox regression -- Breslow method for ties No. of subjects = 272,653 Number of obs = 5,043,835 No. of failures = 211,759 Time at risk = 6757574 Wald chi2(13) = 12296.65 Log pseudolikelihood = -1783341.2 Prob > chi2 = 0.0000 (Std. Err. adjusted for 246,784 clusters in BvdIdNumber) -------------------------------------------------------------------------------------- | Robust _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------------+---------------------------------------------------------------- main | countrycat2 | Finland | .8091496 .0310387 -5.52 0.000 .7505454 .8723297 Italy | 2.439466 .078066 27.87 0.000 2.291159 2.597373 Romania | .289436 .015444 -23.24 0.000 .2606954 .3213451 Russian Federation | 1.324375 .0570622 6.52 0.000 1.217128 1.441073 Switzerland | 1.395914 .0454626 10.24 0.000 1.309593 1.487924 United Kingdom | 2.468952 .0722197 30.90 0.000 2.331386 2.614637 | industrycat2 | C - Manufacturing | 2.200137 .0495874 34.99 0.000 2.105063 2.299505 M - Professional,.. | 2.222844 .0464667 38.21 0.000 2.133612 2.315809 J - Information a.. | 2.700483 .0545709 49.16 0.000 2.595616 2.809586 ---------------------+---------------------------------------------------------------- tvc | TMTsize | .9992306 .0001653 -4.65 0.000 .9989065 .9995547 BlauGender | .9817832 .0004862 -37.12 0.000 .9808307 .9827367 VariationAge | .9951163 .0010039 -4.85 0.000 .9931507 .9970858 BlauNationality | 1.019324 .0006307 30.93 0.000 1.018088 1.020561 --------------------------------------------------------------------------------------
But when dividing the sample into subsamples based on quarters (eg. firms in Y0-2 (table 1) vs Y2-4 (table 2)) I see clear differences in the hazard ratio's. It differs even more when comparing year 0-2 with year 8-10.
Code:
Cox regression with Breslow method for ties No. of subjects = 226,587 Number of obs = 1,975,546 No. of failures = 25,492 Time at risk = 1,975,546 Wald chi2(13) = 3738.62 Log pseudolikelihood = -307650.59 Prob > chi2 = 0.0000 (Std. err. adjusted for 226,587 clusters in BvdIdNumber) ------------------------------------------------------------------------------------------------------------------------ | Robust _t | Haz. ratio std. err. z P>|z| [95% conf. interval] -------------------------------------------------------+---------------------------------------------------------------- industrycat2 | C - Manufacturing | 10.45771 1.487319 16.50 0.000 7.913645 13.81963 M - Professional, scientific and technical activities | 9.003804 1.269628 15.59 0.000 6.829642 11.87009 J - Information and communication | 12.30777 1.723383 17.93 0.000 9.353859 16.19452 | countrycat2 | Finland | .6878698 .1086084 -2.37 0.018 .5047883 .9373531 Italy | 2.819074 .3243733 9.01 0.000 2.249904 3.53223 Romania | .0988451 .0330192 -6.93 0.000 .0513584 .1902386 Russian Federation | .4502098 .1378809 -2.61 0.009 .2470168 .8205467 Switzerland | 1.388524 .1677931 2.72 0.007 1.0957 1.759604 United Kingdom | 4.049271 .4326389 13.09 0.000 3.284213 4.992549 | TMTsize | .9584146 .0086527 -4.70 0.000 .9416047 .9755246 BlauGender | .3379189 .0089028 -41.18 0.000 .3209125 .3558265 VariationAge | .7572294 .0380637 -5.53 0.000 .6861833 .8356316 BlauNationality | 1.80255 .0552501 19.22 0.000 1.697451 1.914157 ------------------------------------------------------------------------------------------------------------------------
Code:
Cox regression with Breslow method for ties No. of subjects = 189,827 Number of obs = 1,481,065 No. of failures = 60,585 Time at risk = 1,481,065 Wald chi2(13) = 5373.72 Log pseudolikelihood = -718002.85 Prob > chi2 = 0.0000 (Std. err. adjusted for 189,827 clusters in BvdIdNumber) ------------------------------------------------------------------------------------------------------------------------ | Robust _t | Haz. ratio std. err. z P>|z| [95% conf. interval] -------------------------------------------------------+---------------------------------------------------------------- industrycat2 | C - Manufacturing | 5.223083 .3122031 27.66 0.000 4.64566 5.872275 M - Professional, scientific and technical activities | 5.496831 .3200672 29.27 0.000 4.903983 6.161349 J - Information and communication | 6.445158 .3710698 32.36 0.000 5.757408 7.215064 | countrycat2 | Finland | .4826577 .0377151 -9.32 0.000 .4141198 .5625389 Italy | 2.658978 .1430748 18.17 0.000 2.392837 2.95472 Romania | .3838614 .0327841 -11.21 0.000 .3246957 .4538082 Russian Federation | 1.025024 .0888774 0.29 0.776 .8648256 1.214898 Switzerland | 1.237009 .0697365 3.77 0.000 1.107609 1.381527 United Kingdom | 2.429062 .1207806 17.85 0.000 2.203506 2.677707 | TMTsize | .9487184 .0059191 -8.44 0.000 .9371878 .9603909 BlauGender | .5693154 .0096876 -33.10 0.000 .5506411 .588623 VariationAge | .9751457 .032161 -0.76 0.445 .9141053 1.040262 BlauNationality | 1.507205 .0313781 19.71 0.000 1.446943 1.569977 ------------------------------------------------------------------------------------------------------------------------
1) Can I conclude the assumption is violated or not?
2) Why would one method of checking for the assumption be more valid/correct in my case, than another?
Any advice would be very welcome!
I have also asked similar question on Cross Validated (link1; link2), but have at time of writing not received an answer on this "issue".
Thanks in advance!
Best regards,
Laura