Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpretation of interaction between binary and categorical variables (and margins) after Cox regression

    Dear all,

    I have a question on the interpretation of interaction effect between binary and categorical variable after Cox regression. I am studying if having a diagnosis affects the risk of dying differently in different educational levels.

    I have read several posts (and the links included in these) related to this topic:
    https://www.statalist.org/forums/for...interpretation
    https://www.statalist.org/forums/for...ent-categories
    https://www.stata.com/statalist/arch.../msg01122.html
    https://www.statalist.org/forums/for...vival-analysis
    https://www.statalist.org/forums/for...ferent-samples
    https://www.statalist.org/forums/for...ns-after-stcox

    I have also studied the examples of Maarten L. Buis: http://www.maartenbuis.nl/publications/interactions.html.
    I have interpreted my results especially following the "Example of a categorical by continuous interaction in a Cox regression model for survival data” (https://www.stata.com/statalist/arch.../msg01332.html). However, my case is slightly different since I have a binary and categorical variable.

    My question is:
    1. Have I misinterpreted the results on Cox regression's interaction effect between diagnosis and educational level? If so, how?
    2. Or have I misinterpreted the results of margins and marginsplot instead?

    I'm using Stata/MP 15.1. The information on education and diagnosis is measured in 2010. Individuals are followed from 2011 to 2015.

    Results:

    Code:
    stset time, failure(died) id (id)
    
    stcox i.diag ##i.edu
    margins, at(diag=(0 1) edu=(1 2 3))
    marginsplot, scheme(s1mono)
     
    Cox regression -- Breslow method for ties
     
    No. of subjects =       99,760                  Number of obs    =      99,760
    No. of failures =       10,287
    Time at risk    =  401420.2051
                                                    LR chi2(9)       =     2827.01
    Log likelihood  =   -112414.25                  Prob > chi2      =      0.0000
     
    ------------------------------------------------------------------------------------
                    _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
                1.diag |   3.676446   .4114305    11.63   0.000     2.952367    4.578108
                       |
                   edu |
            secondary  |   1.646723   .0612913    13.40   0.000     1.530871    1.771342
                basic  |   2.898637   .0890877    34.63   0.000     2.729183    3.078612
                       |
              diag#edu |
          1#secondary  |   .8658759    .118324    -1.05   0.292     .6624253    1.131812
              1#basic  |    .631256   .0739209    -3.93   0.000     .5017977     .794113
    ------------------------------------------------------------------------------------
     
    . margins, at(diag=(0 1) edu=(1 2 3))
     
    Predictive margins                              Number of obs     =     99,760
    Model VCE    : OIM
     
    Expression   : Predicted hazard ratio, predict()
     
    1._at        : diag            =           0
                   edu             =           1
     
    2._at        : diag            =           0
                   edu             =           2
     
    3._at        : diag            =           0
                   edu             =           3
     
    4._at        : diag            =           1
                   edu             =           1
     
    5._at        : diag            =           1
                   edu             =           2
     
    6._at        : diag            =           1
                   edu             =           3
     
    ------------------------------------------------------------------------------
                 |            Delta-method
                 |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             _at |
              1  |   1.008644   .0292157    34.52   0.000     .9513822    1.065906
              2  |   1.660957   .0789297    21.04   0.000     1.506258    1.815657
              3  |   2.923693   .1268029    23.06   0.000     2.675164    3.172222
              4  |   3.708225   .4283425     8.66   0.000     2.868689    4.547761
              5  |   5.287402   .4463702    11.85   0.000     4.412533    6.162272
              6  |   6.785243   .3498909    19.39   0.000      6.09947    7.471017
    ------------------------------------------------------------------------------
     
    . marginsplot
     
    Variables that uniquely identify margins: diag edu
    Click image for larger version

Name:	interaction_diag_edu.png
Views:	1
Size:	50.6 KB
ID:	1453969




    diag = have diagnosis (1=yes, 0=no), edu = education level (1=tertiary, 2=secondary, 3=basic), event = died between 2010-2017 (yes/no).

    Interpretation:
    Having a diagnosis increases the hazard by 3.67 times among tertiary educated.

    Secondary educated have a 1.64 times higher risk for mortality when compared to highly educated. Those with only basic education are 2.89 times more likely to die than the highly educated.

    Those with secondary education and a diagnosis, have a 14% (1-0.86) smaller risk of dying compared to highly educated with a diagnosis - However, the difference is not statistically significant. Those with basic education and a diagnosis have a 37% smaller risk of dying than tertiary educated (statistically significant). In other words, wouldn't these results suggest that having a diagnosis is more "harmful" for tertiary educated than for those with only secondary or basic education?

    However, looking at the results after margins and marginsplot: here the results do not suggest that the diagnosis would have different effect in different educational levels. These results are more similar to what is produced after running a Cox regression without the main effects:

    Code:
    stcox i.diag#i.edu
    
    ------------------------------------------------------------------------------------
                    _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
         diag#edu |
        0#secondary  |   1.646723   .0612913    13.40   0.000     1.530871    1.771342
             0#basic  |   2.898637   .0890877    34.63   0.000     2.729183    3.078612
         1#tertiary  |   3.676446   .4114305    11.63   0.000     2.952367    4.578108
        1#secondary  |    5.24209   .4157252    20.89   0.000     4.487451    6.123634
             1#basic  |   6.727095   .2851373    44.97   0.000      6.19082    7.309824
    ------------------------------------------------------------------------------------
    How could I produce marginsplot including also the main effects?

    Please let me know if I have missed something and that I my problem could be solved by revisiting some links that I have listed above.

    Best,
    Inge

  • #2
    Your interpretation of the results is almost right. Suggested improvements in italics:

    Code:
    Having a diagnosis increases the hazard by 3.67 times among tertiary educated.
    
    Secondary educated have a 1.64 times higher hazard for mortality when compared to highly educated, when there is no diagnosis. Those with only basic education have a hazard 2.89 times as high as the highly educated, when there is no diagnosis..
    
    Those with secondary education and a diagnosis, have a 14% (1-0.86) lower hazard of dying compared to highly educated with a diagnosis - However, the difference is not statistically significant. Those with basic education and a diagnosis have a 37% smaller hazard  of dying than tertiary educated with a diagnosis (statistically significant).
    Note that a hazard is not a risk and the two terms should not be used interchangeably. A risk refers to the probability of dying over some defined interval of time. By contrast, the hazard refers to an instantaneous rate of mortality. The hazard is like the reading of the speedometer at any moment during a drive, whereas the risk is analogous to the odometer reading at the end of the trip. More to the point, if a long enough time period is contemplated, everybody eventually dies so that the risk ratio converges to 1 as time goes to infinity, whereas the hazard ratio remains constant (if, as is assumed in Cox regression, the proportional hazards assumption is correct.)

    In other words, wouldn't these results suggest that having a diagnosis is more "harmful" for tertiary educated than for those with only secondary or basic education?
    Yes.

    Code:
    stcox i.diag#i.edu
    is a mis-specified model. The results shown near the end of your post are meaningless and should be ignored and discarded.

    Any model with an interaction must include the constituent ("main") effects as well. (Well, it must include the information provided by the constituent effects. It is OK to omit a constituent effect if it is colinear with something else in the model--in that case the necessary information is there under a different name. This happens, for example, in fixed-effects regression models, where one of the constituent effects is defined at the group level. Even so, when writing Stata code for such a model, it is best to use the ## notation, or to explicitly mention the constituent effects. Stata will recognize the colinearity and drop something for you. And, as a bonus, if there is something wrong with the data so that the expected colinearity does not hold, Stata's failure to omit that variable is an early warning sign that you have a problem..

    Comment


    • #3
      Thank you, Clyde, for your deliberate answer and clarifications!

      Comment


      • #4
        Hi Inge. Another way you can make sense of the output from -stcox- is to construct the coefficients from the -margins- results. Note that there may be a bit of rounding error.

        Code:
        . /*
        > Predictive margins                              Number of obs     =     99,760
        > Model VCE    : OIM
        >  
        > Expression   : Predicted hazard ratio, predict()
        > 1._at        : diag            =           0
        >                edu             =           1 <-- Tertiary (ref)
        > 2._at        : diag            =           0
        >                edu             =           2 <-- Secondary
        > 3._at        : diag            =           0
        >                edu             =           3 <-- Basic
        > 4._at        : diag            =           1
        >                edu             =           1 <-- Tertiary (ref)
        > 5._at        : diag            =           1
        >                edu             =           2 <-- Secondary
        > 6._at        : diag            =           1
        >                edu             =           3 <-- Basic
        >  ------------------------------------------------------------------------------
        >              |            Delta-method
        >              |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
        > -------------+----------------------------------------------------------------
        >          _at |
        >           1  |   1.008644   .0292157    34.52   0.000     .9513822    1.065906
        >           2  |   1.660957   .0789297    21.04   0.000     1.506258    1.815657
        >           3  |   2.923693   .1268029    23.06   0.000     2.675164    3.172222
        >           4  |   3.708225   .4283425     8.66   0.000     2.868689    4.547761
        >           5  |   5.287402   .4463702    11.85   0.000     4.412533    6.162272
        >           6  |   6.785243   .3498909    19.39   0.000      6.09947    7.471017
        > ------------------------------------------------------------------------------
        > */
        .
        . * Compute coefficients from -margins- results
        . * NOTE:  di is short for display
        . * Coefficient for 1.diag
        . di 3.708225 / 1.008644
        3.6764458
        
        . * Coefficients for secondary and basic
        . di 1.660957 / 1.008644
        1.6467227
        
        . di 2.923693 / 1.008644
        2.8986372
        
        . * Coefficients for 1#secondary and 1#basic
        . di (5.287402 / 3.708225) / (1.660957 / 1.008644)
        .86587614
        
        . di (6.785243 / 3.708225) / (2.923693 / 1.008644)
        .63125593
        
        .
        . /*
        > ------------------------------------------------------------------------------------
        >                 _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
        > -------------------+----------------------------------------------------------------
        >             1.diag |   3.676446   .4114305    11.63   0.000     2.952367    4.578108
        >                    |
        >                edu |
        >         secondary  |   1.646723   .0612913    13.40   0.000     1.530871    1.771342
        >             basic  |   2.898637   .0890877    34.63   0.000     2.729183    3.078612
        >                    |
        >           diag#edu |
        >       1#secondary  |   .8658759    .118324    -1.05   0.292     .6624253    1.131812
        >           1#basic  |    .631256   .0739209    -3.93   0.000     .5017977     .794113
        > ------------------------------------------------------------------------------------
        > */
        Here's the basic code without output:
        Code:
        * Compute coefficients from -margins- results
        * NOTE:  di is short for display
        * Coefficient for 1.diag
        di 3.708225 / 1.008644
        * Coefficients for secondary and basic
        di 1.660957 / 1.008644
        di 2.923693 / 1.008644
        * Coefficients for 1#secondary and 1#basic
        di (5.287402 / 3.708225) / (1.660957 / 1.008644)
        di (6.785243 / 3.708225) / (2.923693 / 1.008644)
        HTH.
        --
        Bruce Weaver
        Email: [email protected]
        Version: Stata/MP 19.5 (Windows)

        Comment


        • #5
          Thank you Bruce, that was very helpful!

          Comment


          • #6
            Hi all, thank you so much for this detailed and thorough post. I have a very similar dataset and I have run the code exactly as above but the main covariates always get omitted from colinearity, but based on the margins plot I do believe there is a significant interaction, but are the margins invalid since there aren't main covariates and interaction covariates?

            I am looking at hypoxia (binary 0 == no Pa02 <70, 1 == Pa02 <70) ever and then the number of days with hypoxia (categorical 1 day, 2 days, or >-3 days with Pa02 <70) and wondering if there is an interaction with the burden (more days) and ever having hypoxia on the outcome of recovery of consciousness.

            Code:
            stset outcome_time , id(person_id) failure(recovered == 1)
            stcox ever70##i.day_sev_cat_roc
            estimates store j
            margins ever70#i.day_sev_cat_roc

            Output:

            . stcox ever70##i.day_sev_cat_roc

            failure _d: recovered == 1
            analysis time _t: outcome_time
            id: person_id

            note: 3.day_sev_cat_roc omitted because of collinearity
            note: 0.ever70#1.day_sev_cat_roc identifies no observations in the sample
            note: 0.ever70#2.day_sev_cat_roc identifies no observations in the sample
            note: 0.ever70#3.day_sev_cat_roc identifies no observations in the sample
            note: 1.ever70#0.day_sev_cat_roc identifies no observations in the sample
            note: 1.ever70#1.day_sev_cat_roc omitted because of collinearity
            note: 1.ever70#2.day_sev_cat_roc omitted because of collinearity
            note: 1.ever70#3.day_sev_cat_roc omitted because of collinearity
            Iteration 0: log likelihood = -1001.2145
            Iteration 1: log likelihood = -999.11113
            Iteration 2: log likelihood = -954.42654
            Iteration 3: log likelihood = -953.85581
            Iteration 4: log likelihood = -953.8531
            Refining estimates:
            Iteration 0: log likelihood = -953.8531

            Cox regression -- Breslow method for ties

            No. of subjects = 401 Number of obs = 401
            No. of failures = 214
            Time at risk = 10006.72959
            LR chi2(3) = 94.72
            Log likelihood = -953.8531 Prob > chi2 = 0.0000


            _t Haz. Ratio Std. Err. z P>z [95% Conf. Interval]

            1.ever70 .130783 .02673 -9.95 0.000 .0876152 .1952196

            day_sev_cat_roc
            1 Day 6.513637 1.742334 7.01 0.000 3.855986 11.00301
            2 Days 1.947421 .5185962 2.50 0.012 1.155542 3.281965
            3 Days or Greater 1 (omitted)

            ever70#day_sev_cat_roc
            0#1 Day 1 (empty)
            0#2 Days 1 (empty)
            0#3 Days or Greater 1 (empty)
            1#0 Days 1 (empty)
            1#1 Day 1 (omitted)
            1#2 Days 1 (omitted)
            1#3 Days or Greater 1 (omitted)


            . estimates store j

            . margins ever70#i.day_sev_cat_roc

            Adjusted predictions Number of obs = 401
            Model VCE : OIM

            Expression : Predicted hazard ratio, predict()


            Delta-method
            Margin Std. Err. z P>z [95% Conf. Interval]

            ever70#day_sev_cat_roc
            0#0 Days 1 . . . . .
            0#1 Day . (not estimable)
            0#2 Days . (not estimable)
            0#3 Days or Greater . (not estimable)
            1#0 Days . (not estimable)
            1#1 Day .8518732 .2434289 3.50 0.000 .3747613 1.328985
            1#2 Days .2546896 .0778952 3.27 0.001 .1020178 .4073615
            1#3 Days or Greater .130783 .02673 4.89 0.000 .0783933 .1831728


            Comment


            • #7
              This is badly specified data and is not suitable for the type of analysis you want. It's not clear how you coded your variable day_sev_cat_roc in the case where ever70 is 0, but whether you made it missing value or 0, either way you create a situation where you get a cascade of empty categories and colinear variables being dropped so that you have essentially nothing left in your model. The truth is, given the meaning of day_sev_cat_roc and ever70 it doesn't even make sense to speak of an interaction between them: they are two versions, both inadequate, of the same variable. The dichotomous version is inadequate because it does not distinguish different frequencies of sever hypoxia, and the other variable is inadequate because it does not deal with the possibility of no such days (or, if it is coded 0 in that instance, the attempt to reckon with this case fails because it is then redundant of ever70.)

              The solution to your problem is to put the two pieces together. Get rid of ever 70, and have a single variable that gives the number of days of severe hypoxia (including 0 as a possible value). If there is some good reason to believe that there is no difference between 3 and numbers greater than three in terms of impact on your outcome, then capping it at three as you did originally is fine. But that seems arbitrary to me, and I doubt that is true. It might make sense to just leave it as the actual number of days of severe hypoxia. Either way, you can run your analysis with just that one variable and things should work out fine.

              Comment

              Working...
              X