Hello
I am investigating modelling diagnostic delay in patients using a time to event model. I am just using a fake dataset to begin with, made up by me. I have hospital names (just A, B, C...etc) and hospital type (district, base and tertiary) in order of size and presumed capability) and there is some correlation between these variables as you would expect (although thet correrlation coefficient is only about 0.3).
Depending on how I compose the data, Stata will drop 2 of the hospital names because of collinearity. For example, if I have ( there are other independent variables but dataex won't display them all)
then I use
I get
with hospitals J and K omitted.
However if I use a slightly different dataset (removing the the single entries at the end)
then do the Cox analysis I get
with I and J omitted.
I'm curious how/why Stata chooses these different hospital names to omit. I guess this is just the way the algorithms go but it always chooses two categories to omit no matter how I change the data.
If you leave hospital type out of the model, no hospital names are dropped.
Thanks and regards
Chris
I am investigating modelling diagnostic delay in patients using a time to event model. I am just using a fake dataset to begin with, made up by me. I have hospital names (just A, B, C...etc) and hospital type (district, base and tertiary) in order of size and presumed capability) and there is some correlation between these variables as you would expect (although thet correrlation coefficient is only about 0.3).
Depending on how I compose the data, Stata will drop 2 of the hospital names because of collinearity. For example, if I have ( there are other independent variables but dataex won't display them all)
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int pid byte age str1(sex hosp_name) str8 hosp_type 100 88 "F" "H" "district" 101 92 "F" "H" "district" 102 83 "M" "H" "district" 103 22 "F" "H" "district" 104 36 "F" "H" "district" 105 23 "F" "H" "district" 106 54 "M" "H" "district" 107 22 "F" "H" "district" 108 24 "F" "H" "district" 109 40 "F" "H" "district" 110 35 "F" "H" "district" 111 54 "M" "I" "tertiary" 112 38 "M" "I" "tertiary" 113 69 "F" "I" "tertiary" 114 44 "F" "I" "tertiary" 115 78 "F" "I" "tertiary" 116 22 "M" "I" "tertiary" 117 18 "F" "I" "tertiary" 118 54 "M" "I" "tertiary" 119 78 "M" "I" "tertiary" 120 82 "M" "J" "base" 121 75 "M" "J" "base" 122 29 "F" "J" "base" 123 33 "F" "J" "base" 124 9 "M" "J" "base" 125 28 "F" "J" "base" 126 5 "F" "J" "base" 127 34 "F" "J" "base" 128 67 "F" "J" "base" 129 82 "M" "J" "base" 130 76 "F" "J" "base" 131 14 "M" "K" "tertiary" 132 52 "F" "L" "district" end
Code:
stcox age interventionyn work_diagmade ib1.work_diagnostician ib1.time_pres ib1.hosp_type_cat ib5.hosp_name_cat ib6.class_cat ib1.hosp_dept_cat diag_delay_cat
Code:
failure _d: finaldxmadeevent == 1
analysis time _t: timetofdhrs
note: 10.hosp_name_cat omitted because of collinearity
note: 11.hosp_name_cat omitted because of collinearity
Iteration 0: log likelihood = -508.85853
Iteration 1: log likelihood = -455.5346
Iteration 2: log likelihood = -445.76463
Iteration 3: log likelihood = -445.36341
Iteration 4: log likelihood = -445.36065
Iteration 5: log likelihood = -445.36065
Refining estimates:
Iteration 0: log likelihood = -445.36065
Cox regression -- Breslow method for ties
No. of subjects = 132 Number of obs = 132
No. of failures = 127
Time at risk = 6599
LR chi2(33) = 127.00
Log likelihood = -445.36065 Prob > chi2 = 0.0000
------------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
age | 1.003524 .004595 0.77 0.442 .9945586 1.012571
interventionyn | 16.86043 22.57084 2.11 0.035 1.222859 232.4668
work_diagmade | 1.284512 .537458 0.60 0.550 .5656964 2.916708
|
work_diagnostician |
2 | 2.314053 .8398244 2.31 0.021 1.136193 4.712967
3 | .704744 .3231542 -0.76 0.445 .2868933 1.731181
4 | 9.244435 11.13115 1.85 0.065 .872881 97.9052
|
time_pres |
2 | .5770304 .1574091 -2.02 0.044 .3380632 .9849165
3 | .7103709 .2494742 -0.97 0.330 .3569052 1.413896
|
hosp_type_cat |
district | 86.27767 161.0894 2.39 0.017 2.221346 3351.048
tertiary | 1.685938 2.812441 0.31 0.754 .0641045 44.33994
|
hosp_name_cat |
A | 18.94269 32.88749 1.69 0.090 .6304086 569.1953
B | 2.255655 4.541459 0.40 0.686 .0436006 116.6952
C | 1.240626 1.678445 0.16 0.873 .0875082 17.58867
D | 5.45929 6.9921 1.33 0.185 .4435492 67.19399
F | 17.18861 28.00398 1.75 0.081 .7054219 418.8251
G | 30.2716 54.84421 1.88 0.060 .8687224 1054.848
H | .1174483 .0947504 -2.65 0.008 .0241628 .570882
I | 1.086235 1.284395 0.07 0.944 .1070138 11.02575
J | 1 (omitted)
K | 1 (omitted)
L | .8346748 .9255692 -0.16 0.871 .0949777 7.335218
|
class_cat |
CARD | 1.139817 .8413863 0.18 0.859 .2682239 4.843645
DERM | .089733 .1292684 -1.67 0.094 .0053299 1.510722
ENDO | 5.394387 5.95128 1.53 0.127 .6206777 46.88329
ENT | .7240182 .6892986 -0.34 0.734 .1120383 4.678778
GIT | .5346554 .4234097 -0.79 0.429 .1132353 2.524446
NEURO | .1985368 .1604488 -2.00 0.045 .0407321 .9677098
O&G | 2.595949 2.143298 1.16 0.248 .5146561 13.09408
OPTH | .0492302 .0735976 -2.01 0.044 .0026285 .9220434
ORTH | .4876306 .3422696 -1.02 0.306 .1232054 1.929977
RESP | .4791978 .358318 -0.98 0.325 .1106707 2.074899
RHEU | .2413165 .2718251 -1.26 0.207 .0265321 2.194837
|
hosp_dept_cat |
OP | .0904296 .0515044 -4.22 0.000 .0296147 .2761303
WARD | .3969466 .2390002 -1.53 0.125 .1219626 1.291926
|
diag_delay_cat | .125937 .0572148 -4.56 0.000 .0516942 .3068071
------------------------------------------------------------------------------------
However if I use a slightly different dataset (removing the the single entries at the end)
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int pid byte age str1(sex hosp_name) str8 hosp_type 100 88 "F" "H" "district" 101 92 "F" "H" "district" 102 83 "M" "H" "district" 103 22 "F" "H" "district" 104 36 "F" "H" "district" 105 23 "F" "H" "district" 106 54 "M" "H" "district" 107 22 "F" "H" "district" 108 24 "F" "H" "district" 109 40 "F" "H" "district" 110 35 "F" "H" "district" 111 54 "M" "I" "tertiary" 112 38 "M" "I" "tertiary" 113 69 "F" "I" "tertiary" 114 44 "F" "I" "tertiary" 115 78 "F" "I" "tertiary" 116 22 "M" "I" "tertiary" 117 18 "F" "I" "tertiary" 118 54 "M" "I" "tertiary" 119 78 "M" "I" "tertiary" 120 82 "M" "J" "base" 121 75 "M" "J" "base" 122 29 "F" "J" "base" 123 33 "F" "J" "base" 124 9 "M" "J" "base" 125 28 "F" "J" "base" 126 5 "F" "J" "base" 127 34 "F" "J" "base" 128 67 "F" "J" "base" 129 82 "M" "J" "base" 130 76 "F" "J" "base" end
Code:
failure _d: finaldxmadeevent == 1
analysis time _t: timetofdhrs
note: 9.hosp_name_cat omitted because of collinearity
note: 10.hosp_name_cat omitted because of collinearity
Iteration 0: log likelihood = -498.92338
Iteration 1: log likelihood = -446.34793
Iteration 2: log likelihood = -436.90194
Iteration 3: log likelihood = -436.51654
Iteration 4: log likelihood = -436.51395
Iteration 5: log likelihood = -436.51395
Refining estimates:
Iteration 0: log likelihood = -436.51395
Cox regression -- Breslow method for ties
No. of subjects = 130 Number of obs = 130
No. of failures = 125
Time at risk = 6543
LR chi2(31) = 124.82
Log likelihood = -436.51395 Prob > chi2 = 0.0000
------------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------------+----------------------------------------------------------------
age | 1.003466 .0045866 0.76 0.449 .9945163 1.012496
interventionyn | 16.63677 22.2671 2.10 0.036 1.207253 229.2661
work_diagmade | 1.278301 .5334783 0.59 0.556 .5641544 2.896466
|
work_diagnostician |
2 | 2.299247 .8330164 2.30 0.022 1.130305 4.677089
3 | .7099638 .3245912 -0.75 0.454 .2897824 1.739404
4 | 9.025128 10.8631 1.83 0.068 .8529121 95.4998
|
time_pres |
2 | .576285 .1569979 -2.02 0.043 .3378654 .9829489
3 | .7086565 .2483589 -0.98 0.326 .3565496 1.408483
|
hosp_type_cat |
district | 86.06734 160.6045 2.39 0.017 2.22059 3335.865
tertiary | 1.861716 2.431047 0.48 0.634 .1440145 24.06693
|
hosp_name_cat |
A | 17.32781 25.69319 1.92 0.054 .9475577 316.8704
B | 2.090338 3.6887 0.42 0.676 .0657885 66.4176
C | 1.260545 1.703829 0.17 0.864 .0891297 17.82765
D | 4.9162 3.42367 2.29 0.022 1.25559 19.24913
F | 15.52633 20.80612 2.05 0.041 1.123087 214.6468
G | 30.46847 55.1695 1.89 0.059 .8761396 1059.566
H | .1188044 .0956162 -2.65 0.008 .024534 .5753029
I | 1 (omitted)
J | 1 (omitted)
|
class_cat |
CARD | 1.140103 .8399422 0.18 0.859 .269056 4.831094
DERM | .0914465 .1316415 -1.66 0.097 .0054428 1.536424
ENDO | 5.369706 5.922601 1.52 0.128 .618165 46.64409
ENT | .7334168 .697303 -0.33 0.744 .1137792 4.727579
GIT | .5348655 .4225605 -0.79 0.428 .1137022 2.516057
NEURO | .2021763 .1629472 -1.98 0.047 .0416573 .9812273
O&G | 2.583215 2.129421 1.15 0.250 .51344 12.99665
OPTH | .050093 .0748447 -2.00 0.045 .0026791 .9366326
ORTH | .4921213 .3442852 -1.01 0.311 .1249041 1.938955
RESP | .4831219 .3605901 -0.97 0.330 .1118771 2.086279
RHEU | .2495317 .2809191 -1.23 0.218 .0274698 2.266709
|
hosp_dept_cat |
OP | .0925485 .0525888 -4.19 0.000 .0303873 .281869
WARD | .4003935 .2408648 -1.52 0.128 .1231486 1.301801
|
diag_delay_cat | .1281343 .0579422 -4.54 0.000 .0528144 .3108694
------------------------------------------------------------------------------------
I'm curious how/why Stata chooses these different hospital names to omit. I guess this is just the way the algorithms go but it always chooses two categories to omit no matter how I change the data.
If you leave hospital type out of the model, no hospital names are dropped.
Thanks and regards
Chris

Comment