Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorical variables and univariate regression

    Hi all!

    I want to use univariate regression to look at the relationship between level of differentiation of a tumor and mortality. I have a variable called 'differentiation' with 3 levels: well, moderate, poor. I have also made 3 separate indicator variables ('well_differentiation', 'moderate_differentiation', 'poor_differentiation') storing either yes or no. Which of the these approaches should I be using for univariable analysis and what is the difference between them? They all seem to give different results.
    1. stcox i.differentiation
    2. stcox well_differentiation moderate_differentiation poor_differentiation
    3. Three separate statements: stcox well_differentiation; stcox moderate_differentiation; stcox poor_differentiation
    If I use competing risks regression with stcrreg, does this change which of the approaches above I should use or would it be the same as for stcox?

    Thanks a lot in advance!

  • #2
    Definitely not #3. This will not give you results that enable you to compare the outcomes across the different levels of differentiation. Each of those will only contrast that particular level of differentiation with an unspecified mix of the other two levels. Not what you want, I'm confident.

    #1 and #2 should produce the same results. You are probably either doing something wrong in the way you have set these analyses up, or you are misinterpreting the results. Because #1 is much simpler, you are less prone to make mistakes with it and I would recommend proceeding with that.

    To understand how you may be misinterpreting your results in 1 and 2, look at the following outputs:

    Code:
    . webuse cancer, clear
    (Patient Survival in Drug Trial)
    
    . 
    . forvalues i = 1/3 {
      2.         gen drug`i' = `i'.drug
      3. }
    
    . 
    . stcox i.drug
    
             failure _d:  died
       analysis time _t:  studytime
    
    Iteration 0:   log likelihood = -99.911448
    Iteration 1:   log likelihood = -86.958129
    Iteration 2:   log likelihood = -86.375607
    Iteration 3:   log likelihood = -86.345511
    Iteration 4:   log likelihood = -86.345483
    Refining estimates:
    Iteration 0:   log likelihood = -86.345483
    
    Cox regression -- Breslow method for ties
    
    No. of subjects =           48                  Number of obs    =          48
    No. of failures =           31
    Time at risk    =          744
                                                    LR chi2(2)       =       27.13
    Log likelihood  =   -86.345483                  Prob > chi2      =      0.0000
    
    ------------------------------------------------------------------------------
              _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            drug |
              2  |   .2307664    .111665    -3.03   0.002     .0893895    .5957423
              3  |   .0707403   .0433131    -4.33   0.000     .0213055    .2348783
    ------------------------------------------------------------------------------
    
    . 
    . stcox drug?
    
             failure _d:  died
       analysis time _t:  studytime
    
    note: drug3 omitted because of collinearity
    Iteration 0:   log likelihood = -99.911448
    Iteration 1:   log likelihood = -86.958129
    Iteration 2:   log likelihood = -86.375607
    Iteration 3:   log likelihood = -86.345511
    Iteration 4:   log likelihood = -86.345483
    Refining estimates:
    Iteration 0:   log likelihood = -86.345483
    
    Cox regression -- Breslow method for ties
    
    No. of subjects =           48                  Number of obs    =          48
    No. of failures =           31
    Time at risk    =          744
                                                    LR chi2(2)       =       27.13
    Log likelihood  =   -86.345483                  Prob > chi2      =      0.0000
    
    ------------------------------------------------------------------------------
              _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           drug1 |   14.13621   8.655353     4.33   0.000     4.257524    46.93629
           drug2 |   3.262161    2.11951     1.82   0.069     .9129727    11.65609
           drug3 |          1  (omitted)
    ------------------------------------------------------------------------------
    The results of these two regressions look very different. But, in fact, they are the same. Notice that in the first regression, Stata has chosen to break the colinearity among the drug indicators by omitting 1.drug, but in the second one it does so by omitting drug 3. The second analysis suggests that the hazard for drug1 is 14.13... times that of drug 3. You don't see that reflected directly in the first analysis--but it is there if you ferret it out: the hazard ratio for 3.drug there is .0707403, and, of course, with 1.drug being the reference, this means that the hazard ratio for 1.drug:3.drug is 1/0..0707403, which is 14.13621--precisely the result from the first analysis.

    Now, of course, you don't want to have to be doing these kinds of calculations, or even figuring out which calculations to do. All you need to do is to be sure to get the same reference category omitted. So, for example if we run
    Code:
    . stcox ib3.drug
    
             failure _d:  died
       analysis time _t:  studytime
    
    Iteration 0:   log likelihood = -99.911448
    Iteration 1:   log likelihood = -86.958129
    Iteration 2:   log likelihood = -86.375607
    Iteration 3:   log likelihood = -86.345511
    Iteration 4:   log likelihood = -86.345483
    Refining estimates:
    Iteration 0:   log likelihood = -86.345483
    
    Cox regression -- Breslow method for ties
    
    No. of subjects =           48                  Number of obs    =          48
    No. of failures =           31
    Time at risk    =          744
                                                    LR chi2(2)       =       27.13
    Log likelihood  =   -86.345483                  Prob > chi2      =      0.0000
    
    ------------------------------------------------------------------------------
              _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            drug |
              1  |   14.13621   8.655353     4.33   0.000     4.257524    46.93629
              2  |   3.262161    2.11951     1.82   0.069     .9129727    11.65609
    ------------------------------------------------------------------------------
    we can see directly that the results are exactly the same as those using the hand-created indicators.

    So go with method #1 because it's just so much easier and less prone to coding errors.

    Comment


    • #3
      Thank you so much for explaining this - that was incredibly helpful!

      I assume the same applies to competing risk analysis with stcrreg?
      Last edited by Thomas Weilz; 08 Jun 2019, 12:24.

      Comment


      • #4
        Yes. It applies to all regression models involving a linear predictor.

        Comment

        Working...
        X