How can we set a single base level for an interaction between two variables?

Chris Boulis

Join Date: Feb 2019

Posts: 368
#1

How can we set a single base level for an interaction between two variables?

14 Oct 2020, 20:46

Hi Statalist.

I want to set a single base level for an interaction between two categorical variables marital status (de facto or married) and education level (edlev1 ... edlev5) - as a de facto couple with the lowest level of education (i.e. base: de facto # edlev1) - with which to compare all other categories as I want to understand the effect of both increasing levels of education and marital status, not just one or the other. At present, if I run the interaction:

Code:

i.marstat#i.edlev

Stata will create multiple base levels (de facto"# edlev1 ... edlev5), e.g. "married # edlev1" compared with base "de facto # edlev1", then "married # edlev2" with the base of "de facto # edlev2", "married # edlev3" with the base of "de facto # edlev3", etc. Is it possible to compare all options back to a single base level of "de facto # edlev1"? If so, how can I code that such that I can apply it to categorical variables with more than two categories.

Thank you in advance.

N.B. I did not provide sample data as I thought this is a fairly generic coding issue, though I can do so if needed.
Stata v.15.1. I'm not sure it matters, but I am using Cox Regression analysis.
Tags: None
Chris Boulis

Join Date: Feb 2019

Posts: 368
#2

14 Oct 2020, 23:33

A copy of the output if that helps: (educm = edlev)
Comment

Paul Dickman

Join Date: Apr 2014
Posts: 294

15 Oct 2020, 14:28

I have Stata 16, but I don't believe the syntax for factor variables has changed from 15 to 16.

Could you show the command you used to get the output you showed in #2. That output looks like what one would get from marstat##educm rather than marstat#educm.

The way I understand "compare all options back to a single base level" can be achieved by marstat#educm, so I don't understand what you want to achieve. Here's a simple example:

Code:

clear
webuse brcancer
stset rectime, f(censrec==1) scale(365.24)
stcox i.hormon#i.x4
stcox i.hormon##i.x4

Here's the output from the first model:

Code:

. stcox i.hormon#i.x4

Cox regression -- Breslow method for ties

No. of subjects =          686                  Number of obs    =         686
No. of failures =          299
Time at risk    =  2112.035922
                                                LR chi2(5)       =       32.78
Log likelihood  =   -1771.7816                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   hormon#x4 |
        0 1  |          1  (base)
        0 2  |   2.430562   .7320229     2.95   0.003     1.346935    4.385982
        0 3  |   2.886813   .9192041     3.33   0.001      1.54662    5.388326
        1 1  |   .6954293   .3479263    -0.73   0.468     .2608514    1.854013
        1 2  |    1.61053   .5062316     1.52   0.129     .8697888     2.98211
        1 3  |   2.533564    .902651     2.61   0.009     1.260283    5.093256
------------------------------------------------------------------------------

This, to me, has hormon==0 and x4(grade)==1 as the joint reference category with all other combinations of hormone therapy and grade compared to this joint reference category.

Here's the output from the second model, which looks similar to the output you showed:

Code:

. stcox i.hormon##i.x4

Cox regression -- Breslow method for ties

No. of subjects =          686                  Number of obs    =         686
No. of failures =          299
Time at risk    =  2112.035922
                                                LR chi2(5)       =       32.78
Log likelihood  =   -1771.7816                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      hormon |
          0  |          1  (base)
          1  |   .6954293   .3479263    -0.73   0.468     .2608514    1.854013
             |
          x4 |
          1  |          1  (base)
          2  |   2.430562   .7320229     2.95   0.003     1.346935    4.385982
          3  |   2.886813   .9192041     3.33   0.001      1.54662    5.388326
             |
   hormon#x4 |
        1 2  |   .9528158    .497765    -0.09   0.926     .3422384    2.652706
        1 3  |   1.262002   .7043855     0.42   0.677     .4226322    3.768405
------------------------------------------------------------------------------

Could you possible explain what it is you want using this example.

Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

15 Oct 2020, 17:48

Hi Paul Dickman. Thank you for your reply. Yes you are correct, I did use i.marstat##i.educm. I wanted the output the same as the first output you provided in #3, where all cases are compared back to "patients that do not receive hormonal therapy and have a grade 1 tumor".

While I now understand how to obtain a single base level when regressing a single interaction term, how do I obtain a single base level when I either code:

Code:

i.hormon##i.x4

or include other explanatory variables? Can you explain why multiple base levels are created?

Code:

. stcox x1 i.x4 i.hormon#i.x4, allbaselevels

Cox regression -- Breslow method for ties

No. of subjects =          686               Number of obs    =    686
No. of failures =          299
Time at risk    =  2112.035922               LR chi2(6)       =    32.78
Log likelihood  =   -1771.7815               Prob > chi2      =    0.0000
--------------------------------------------------------------------------------------        
_t          Haz. Ratio   Std. Err.     z     P>z       [95% Conf.    Interval]
--------------------------------------------------------------------------------------        
x1          1.000057    .0060456     0.01    0.992     .9882781    1.011977

x4
1           1  (base)
2           2.430624    .7320697     2.95    0.003      1.346939    4.386192
3           2.887143    .9199652     3.33    0.001      1.546108    5.391343

hormon#x4
0 1         1  (base)
0 2         1  (base)
0 3         1  (base)
1 1         6953394     .3480102    -0.73    0.468     .2607229    1.854447
1 2         6624327     .1018999    -2.68    0.007     .4900091    .8955284
1 3         8773638     .2193149    -0.52    0.601     .5375324    1.432039

Where we see that "1 1" is compared to a base of "0 1" and "1 2" is compared to "0 2", ... I want the output to be based on a single base level of "0 1" and all other variations compared back to that. I appreciate your assistance.

Last edited by Chris Boulis; 15 Oct 2020, 17:59.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#5

15 Oct 2020, 18:17

It's a matter of getting the degrees of freedom correct. If you have an interaction between two variables x and y with m and n levels, respectively, the total number of combinations of values is mn, so the number of degrees of freedom to represent that in a model is mn-1. When you use x#y, Stata treats it as a single entity and you get mn-1 included levels and 1 base level.

But when you use x##y, Stata treats that as x y x#y. x and y each have their own base levels, but they then contribute m-1 and n-1 degrees of freedom to the model. So you no longer have mn-1 degrees of freedom left for the x#y term, as m-1 + n-1 have been taken up by x and y separately. So Stata omits from the x#y interaction terms any combination of x and y which involves the base level of either x or y. That way the degrees of freedom add up just right. If you attempted to force inclusion of any of those terms in the model, Stata would still omit them, because they would now be colinear with x, y, and the remaining levels of x#y.

By the way, these are just two different ways of parameterizing an interaction model. If you use -margins- or -predict- you will see that the predictions of both models are exactly the same (except possibly trivial rounding errors). So it doesn't really matter which way you do it. But there are some technical reasons, in many situations, to prefer the ## method.

Added: Consequently, you cannot get a single base category in x#y when you use the x##y notation.

Last edited by Clyde Schechter; 15 Oct 2020, 18:25.
2 likes
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#6

16 Oct 2020, 06:01

Thank you for your explanation Clyde Schechter. That helps. In terms of the results in #4, that means I cannot compare the hazard for "1 3" (a patient with a grade 3 tumor receiving the hormonal therapy) with respect to "0 1" (a patient with a grade 1 tumor not receiving this treatment). I can only compare it (1 3) with "0 3" (a patient with a grade 3 tumor not receiving the hormonal therapy) as this is the base level for that coefficient (1 3). That is a shame.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#7

16 Oct 2020, 14:38

In terms of the results in #4, that means I cannot compare the hazard for "1 3" (a patient with a grade 3 tumor receiving the hormonal therapy) with respect to "0 1" (a patient with a grade 1 tumor not receiving this treatment)....That is a shame.

Not true. You can make that comparison. As I said, the two models are entirely equivalent: any result you can get from one you can also get from the other. You just have to write the commands appropriate for the particular model. For this particular comparison:

Code:

lincom _b[0.hormon] + _b[1.x4] + _b[0.hormon#1.x4] - (_b[1.hormon] + _b[3.x4] +_b[1.hormon#3.x4]), eform
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#8

16 Oct 2020, 16:55

Thank you Clyde Schechter. I loaded the data as in #3 then ran your code in #7 and not sure why but Stata provided:

Code:

[0.hormon] not found r(111);

Stata v15.1

Last edited by Chris Boulis; 16 Oct 2020, 17:24.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#9

16 Oct 2020, 18:48

Oh, sorry, I read the output too quickly and didn't realize you have yet a different model going in #4. There you are using i.x4 and i.hormon#i.x4 but you omitted i.hormon by itself. So that requires a different -lincom-:

Code:

lincom _b[1.x4] + _b[0.hormon#1.x4] - (_b[3.x4] + _b[1.hormon#3.x4]), eform

That said, I'm also noticing now that the output you show in #4 looks really wrong. The hazard ratios of the non-base interaction terms are huge and completely out of line with the other results. Also the hazard ratios shown are not even remotely in the range of their confidence intervals, nor are the coefficients and their standard errors consistent with the z-statistics--not even having the right sign of the z-statistics. Something went horribly wrong here, but I can't discern what.
1 like
Comment

Paul Dickman

Join Date: Apr 2014
Posts: 294

#10

17 Oct 2020, 01:55

Originally posted by Chris Boulis View Post

In terms of the results in #4, that means I cannot compare the hazard for "1 3" (a patient with a grade 3 tumor receiving the hormonal therapy) with respect to "0 1" (a patient with a grade 1 tumor not receiving this treatment). I can only compare it (1 3) with "0 3" (a patient with a grade 3 tumor not receiving the hormonal therapy) as this is the base level for that coefficient (1 3). That is a shame.

No, you can compare any levels you like. You can parameterise the model so the parameter estimates have the interpretation you want or you can use -lincom- to get the estimates you want. Taking the example you gave, where we want to "compare the hazard for "1 3" (a patient with a grade 3 tumor receiving the hormonal therapy) with respect to "0 1" (a patient with a grade 1 tumor not receiving this treatment)". We'll also adjust for the age (x1) as you did in #4.

Code:

. stcox x1 i.hormon#i.x4, noshow nolog allbaselevels

Cox regression -- Breslow method for ties

No. of subjects =          686                  Number of obs    =         686
No. of failures =          299
Time at risk    =  2112.035922
                                                LR chi2(6)       =       32.78
Log likelihood  =   -1771.7815                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   1.000057   .0060456     0.01   0.992     .9882781    1.011977
             |
   hormon#x4 |
        0 1  |          1  (base)
        0 2  |   2.430624   .7320697     2.95   0.003     1.346939    4.386192
        0 3  |   2.887143   .9199652     3.33   0.001     1.546108    5.391343
        1 1  |   .6953394   .3480102    -0.73   0.468     .2607229    1.854447
        1 2  |   1.610125   .5078996     1.51   0.131     .8676717    2.987882
        1 3  |   2.533075   .9039433     2.60   0.009     1.258611    5.098055
------------------------------------------------------------------------------

This, I believe, gives what you want.

There are 6 combinations of hormone therapy (2 levels) and grade (3 levels) so as Clyde wrote we estimate 5 parameters. In the above parameterisation, the baseline is "0 1" (a patient with a grade 1 tumor not receiving this treatment)" and the 5 parameters represent the contrasts (expressed as hazard ratios) between the other 5 categories and the reference. The HR estimate we want for comparing "1 3" to "0 1" is 2.533.

We can also fit the same model but where the parameters have an alternative interpretation (reparameterising the model). There are many ways to do that; here is one (the same as you used in #4).

Code:

. stcox x1 i.x4 i.hormon#i.x4, noshow nolog allbaselevels

Cox regression -- Breslow method for ties

No. of subjects =          686                  Number of obs    =         686
No. of failures =          299
Time at risk    =  2112.035922
                                                LR chi2(6)       =       32.78
Log likelihood  =   -1771.7815                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   1.000057   .0060456     0.01   0.992     .9882781    1.011977
             |
          x4 |
          1  |          1  (base)
          2  |   2.430624   .7320697     2.95   0.003     1.346939    4.386192
          3  |   2.887143   .9199652     3.33   0.001     1.546108    5.391343
             |
   hormon#x4 |
        0 1  |          1  (base)
        0 2  |          1  (base)
        0 3  |          1  (base)
        1 1  |   .6953394   .3480102    -0.73   0.468     .2607229    1.854447
        1 2  |   .6624327   .1018999    -2.68   0.007     .4900091    .8955284
        1 3  |   .8773638   .2193149    -0.52   0.601     .5375324    1.432039
------------------------------------------------------------------------------

As Clyde said in #5, this is the same model. This parameterisation might be useful if we were interested in the effect of hormone therapy and if the effect of hormone therapy was modified by grade. The final three HRs are the estimated hazard ratios for the effect of hormone therapy estimated for each of the three levels of grade. We can see that the effect of hormone therapy for grade 1 (i.e., "1 1" versus "0 1") is also estimated in the first model but not the other two.

The contrast "1 3" to "0 1" is not represented by a single parameter in this model, but we can get it using -lincom-.

Code:

. lincom _b[3.x4] + _b[1.hormon#3.x4] - (_b[1.x4] + _b[0.hormon#1.x4]), eform

 ( 1)  - 1b.x4 + 3.x4 - 0b.hormon#1b.x4 + 1.hormon#3.x4 = 0

------------------------------------------------------------------------------
          _t |     exp(b)   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |   2.533075   .9039433     2.60   0.009     1.258611    5.098055
------------------------------------------------------------------------------

The estimated HR is 2.533, same as we saw previously when this contrast was represented by a single parameter.

The code for -lincom- can be simplified because "0 1" is the reference level.

Code:

. lincom _b[3.x4] + _b[1.hormon#3.x4], eform

 ( 1)  3.x4 + 1.hormon#3.x4 = 0

------------------------------------------------------------------------------
          _t |     exp(b)   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |   2.533075   .9039433     2.60   0.009     1.258611    5.098055
------------------------------------------------------------------------------

To get the HR we want we just have to multiply two HRs (2.887 * .877 = 2.533). -lincom- works on the scale of the parameters, so we add the two parameter estimates and exponentiate.

In short, you can reparameterise the model so the contrasts you want are represented by a single parameters or you can use -lincom- (or similar commands) if the contrast you want is represented by a combination of parameters.

Comment

Paul Dickman

Join Date: Apr 2014

Posts: 294
#11

21 Oct 2020, 07:13

I previously made a short video lecture (with bad production quality) on this topic with examples in R; I took the slides and redid the example with Stata. It illustrates the same concepts and Stata syntax above but I also make an attempt to explain the underlying parameterisation and why the parameter estimates are interpreted as described in the post above. Here are the slides:

http://pauldickman.com/video/interac...ions_stata.pdf
2 likes
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

#12

24 Oct 2020, 02:52

Thank you Clyde Schechter. I really appreciate your explanation about base levels - that really helped. And thank you for introducing me to -lincom- I will definitely use that. As you have probably noticed by Paul's output in #10 that my output in #4 was missing the "." point before each of the last three HRs - a copying error - my apologies.

Further to your comment in #5

So it doesn't really matter which way you do it. But there are some technical reasons, in many situations, to prefer the ## method

Do you mind explaining(or redirect me to a document) why is ## preferred to #?

If we compare

Code:

. stcox i.hormon#i.x4, allbaselevels

Code:

Cox regression -- Breslow method for ties

No. of subjects =          686                  Number of obs    =         686
No. of failures =          299
Time at risk    =  2112.035922
                                                LR chi2(5)       =       32.78
Log likelihood  =   -1771.7816                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   hormon#x4 |
        0 1  |          1  (base)
        0 2  |   2.430562   .7320229     2.95   0.003     1.346935    4.385982
        0 3  |   2.886813   .9192041     3.33   0.001      1.54662    5.388326
        1 1  |   .6954293   .3479263    -0.73   0.468     .2608514    1.854013
        1 2  |    1.61053   .5062316     1.52   0.129     .8697888     2.98211
        1 3  |   2.533564    .902651     2.61   0.009     1.260283    5.093256
------------------------------------------------------------------------------

with

Code:

. stcox i.hormon##i.x4, allbaselevels

Code:

Cox regression -- Breslow method for ties

No. of subjects =          686                  Number of obs    =         686
No. of failures =          299
Time at risk    =  2112.035922
                                                LR chi2(5)       =       32.78
Log likelihood  =   -1771.7816                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------
          _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      hormon |
          0  |          1  (base)
          1  |   .6954293   .3479263    -0.73   0.468     .2608514    1.854013
             |
          x4 |
          1  |          1  (base)
          2  |   2.430562   .7320229     2.95   0.003     1.346935    4.385982
          3  |   2.886813   .9192041     3.33   0.001      1.54662    5.388326
             |
   hormon#x4 |
        0 1  |          1  (base)
        0 2  |          1  (base)
        0 3  |          1  (base)
        1 1  |          1  (base)
        1 2  |   .9528158    .497765    -0.09   0.926     .3422384    2.652706
        1 3  |   1.262002   .7043855     0.42   0.677     .4226322    3.768405
------------------------------------------------------------------------------

I notice that (0 2) and (0 3) in the first table are the same as 2.x4 and 3.x4 in the second (that makes sense to me). Also that (1 1) in first table is the same as 1.hormon in the second (is that the case as 1.x4 is a base level?). My last observation is that the results for (1 2) and (1 3) in the first table are different to those in the second and I'm not sure why. Which are correct / more accurate - those using # or ##? Should I therefore always include ## as in i.hormon##i.x4 (or "i.hormon i.x4 i.hormon#i.x4 - this is the same correct?),

Stata v.15.1.

Comment

Paul Dickman

Join Date: Apr 2014

Posts: 294
#13

24 Oct 2020, 11:11

My last observation is that the results for (1 2) and (1 3) in the first table are different to those in the second and I'm not sure why. Which are correct / more accurate

They're different because they represent different contrast.

The (1 3) parameter in the first table represents the contrast between (1 3) and (0 1) if that's the contrast you are interested in then to get it from the second table you need to multiply 3 parameters. 0.6954293 * 2.886813 * 1.262002 = 2.533564 (same as in first table)

Both parameterisations are equally correct and equally accurate. You use the one that is most useful.
3 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30191
#14

24 Oct 2020, 12:14

Following up on Paul Dickman's excellent response, I will note that the preference for using the ## notation in interaction models stems from the fact that you do not encounter the difficulties you got in #12, explained in #13, because with the ## notation, every output from -margins- means exactly what it looks like. You don't have to think about whether you need to do any additions or multiplications: you just read everything straight off the -margins- output. It's the simplest.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#15

24 Oct 2020, 18:40

Thank you Paul Dickman. I really appreciate your explanations in #10, and #13 - I have a much better understanding now.

One reason I asked about using # or ## is because my supervisor previously said I should use ## - and at the time I didn't understand why. I understand now that he may prefer ## as it provides results for both the main effect and the interaction effect (as opposed to just the total effect in table 1 when using #). As you said, we can calculate the total effect by multiplying the relevant HRs and we get the same result - but with more work.

N.B. Thanks for the slides in #11 - I'm still working through them.
Comment

Announcement