Categorical variables and univariate regression

Thomas Weilz

Join Date: Jun 2019

Posts: 3
#1

Categorical variables and univariate regression

04 Jun 2019, 12:26

Hi all!

I want to use univariate regression to look at the relationship between level of differentiation of a tumor and mortality. I have a variable called 'differentiation' with 3 levels: well, moderate, poor. I have also made 3 separate indicator variables ('well_differentiation', 'moderate_differentiation', 'poor_differentiation') storing either yes or no. Which of the these approaches should I be using for univariable analysis and what is the difference between them? They all seem to give different results.
stcox i.differentiation

stcox well_differentiation moderate_differentiation poor_differentiation

Three separate statements: stcox well_differentiation; stcox moderate_differentiation; stcox poor_differentiation

If I use competing risks regression with stcrreg, does this change which of the approaches above I should use or would it be the same as for stcox?

Thanks a lot in advance!
Tags: categorical, data, regression

Clyde Schechter

Join Date: Apr 2014
Posts: 30084

04 Jun 2019, 12:42

Definitely not #3. This will not give you results that enable you to compare the outcomes across the different levels of differentiation. Each of those will only contrast that particular level of differentiation with an unspecified mix of the other two levels. Not what you want, I'm confident.

#1 and #2 should produce the same results. You are probably either doing something wrong in the way you have set these analyses up, or you are misinterpreting the results. Because #1 is much simpler, you are less prone to make mistakes with it and I would recommend proceeding with that.

To understand how you may be misinterpreting your results in 1 and 2, look at the following outputs:

Code:

. webuse cancer, clear
(Patient Survival in Drug Trial)

. 
. forvalues i = 1/3 {
  2.         gen drug`i' = `i'.drug
  3. }

. 
. stcox i.drug

         failure _d:  died
   analysis time _t:  studytime

Iteration 0:   log likelihood = -99.911448
Iteration 1:   log likelihood = -86.958129
Iteration 2:   log likelihood = -86.375607
Iteration 3:   log likelihood = -86.345511
Iteration 4:   log likelihood = -86.345483
Refining estimates:
Iteration 0:   log likelihood = -86.345483

Cox regression -- Breslow method for ties

No. of subjects =           48                  Number of obs    =          48
No. of failures =           31
Time at risk    =          744
                                                LR chi2(2)       =       27.13
Log likelihood  =   -86.345483                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |
          2  |   .2307664    .111665    -3.03   0.002     .0893895    .5957423
          3  |   .0707403   .0433131    -4.33   0.000     .0213055    .2348783
------------------------------------------------------------------------------

. 
. stcox drug?

         failure _d:  died
   analysis time _t:  studytime

note: drug3 omitted because of collinearity
Iteration 0:   log likelihood = -99.911448
Iteration 1:   log likelihood = -86.958129
Iteration 2:   log likelihood = -86.375607
Iteration 3:   log likelihood = -86.345511
Iteration 4:   log likelihood = -86.345483
Refining estimates:
Iteration 0:   log likelihood = -86.345483

Cox regression -- Breslow method for ties

No. of subjects =           48                  Number of obs    =          48
No. of failures =           31
Time at risk    =          744
                                                LR chi2(2)       =       27.13
Log likelihood  =   -86.345483                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       drug1 |   14.13621   8.655353     4.33   0.000     4.257524    46.93629
       drug2 |   3.262161    2.11951     1.82   0.069     .9129727    11.65609
       drug3 |          1  (omitted)
------------------------------------------------------------------------------

The results of these two regressions look very different. But, in fact, they are the same. Notice that in the first regression, Stata has chosen to break the colinearity among the drug indicators by omitting 1.drug, but in the second one it does so by omitting drug 3. The second analysis suggests that the hazard for drug1 is 14.13... times that of drug 3. You don't see that reflected directly in the first analysis--but it is there if you ferret it out: the hazard ratio for 3.drug there is .0707403, and, of course, with 1.drug being the reference, this means that the hazard ratio for 1.drug:3.drug is 1/0..0707403, which is 14.13621--precisely the result from the first analysis.

Now, of course, you don't want to have to be doing these kinds of calculations, or even figuring out which calculations to do. All you need to do is to be sure to get the same reference category omitted. So, for example if we run

Code:

. stcox ib3.drug

         failure _d:  died
   analysis time _t:  studytime

Iteration 0:   log likelihood = -99.911448
Iteration 1:   log likelihood = -86.958129
Iteration 2:   log likelihood = -86.375607
Iteration 3:   log likelihood = -86.345511
Iteration 4:   log likelihood = -86.345483
Refining estimates:
Iteration 0:   log likelihood = -86.345483

Cox regression -- Breslow method for ties

No. of subjects =           48                  Number of obs    =          48
No. of failures =           31
Time at risk    =          744
                                                LR chi2(2)       =       27.13
Log likelihood  =   -86.345483                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        drug |
          1  |   14.13621   8.655353     4.33   0.000     4.257524    46.93629
          2  |   3.262161    2.11951     1.82   0.069     .9129727    11.65609
------------------------------------------------------------------------------

we can see directly that the results are exactly the same as those using the hand-created indicators.

So go with method #1 because it's just so much easier and less prone to coding errors.

Comment

Thomas Weilz

Join Date: Jun 2019

Posts: 3
#3

08 Jun 2019, 12:19

Thank you so much for explaining this - that was incredibly helpful!

I assume the same applies to competing risk analysis with stcrreg?

Last edited by Thomas Weilz; 08 Jun 2019, 12:24.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#4

08 Jun 2019, 12:53

Yes. It applies to all regression models involving a linear predictor.
1 like
Comment

Announcement