Losing categories when using interaction term

Danny Wellens

Join Date: May 2023

Posts: 12
#1

Losing categories when using interaction term

24 May 2023, 09:39

Good day,

I am trying to use an interaction term within my logistic regression. For this i simply use the '##' in between the specific variables. A problem arises when i try to read the output: I notice that some categories of the variables are not shown in the interaction output. For example: 'Proxy_gezondheid' and 'HIL_abovebelow' must lead to 3 coefficients (a combination of the different categories). My output only shows 1 coëfficiënt instead of 3. Also, when not using the '##' command, but using the '#' command i DO get all the categories. BUT this leaves out the coefficients of the original variables.

As you can see here, in the second model i can clearly observe the categories within the interaction term. In the first model this is not possible. Is there any way to make this possible, while also including the 2 coefficients from the original variables?
Thanks in advance! Help is very much appriciated. p.s. I am new to STATA...
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17714
#2

24 May 2023, 09:55

Danny:
you relief with your second code is not expected to last.
Your first code is the correct one, as it includes the main conditional terms (via -##-) of your interaction.

Kind regards,
Carlo
(Stata 19.0)
Comment
Danny Wellens

Join Date: May 2023

Posts: 12
#3

24 May 2023, 10:09

Thanks Carlo for your reply! How am i supposed to compare the different groups when i only get one coëfficient from the interaction term?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17714

24 May 2023, 11:05

Danny:
omissions are the results of the structure of our data + the way the regression estimator works.
That said, I do hope that you can tweak the following toy-example according to your research goals:

Code:

. use "C:\Program Files\Stata17\ado\base\a\auto.dta"
(1978 automobile data)

. logistic foreign i.rep78##c.headroom, allbase
note: 1.rep78 != 0 predicts failure perfectly;
      1.rep78 omitted and 2 obs not used.

note: 2.rep78 != 0 predicts failure perfectly;
      2.rep78 omitted and 8 obs not used.

note: 5.rep78 omitted because of collinearity.
note: 2.rep78#c.headroom omitted because of collinearity.
note: 5.rep78#c.headroom omitted because of collinearity.

Logistic regression                                     Number of obs =     59
                                                        LR chi2(5)    =  31.56
                                                        Prob > chi2   = 0.0000
Log likelihood = -22.632503                             Pseudo R2     = 0.4108

----------------------------------------------------------------------------------
         foreign | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-----------------+----------------------------------------------------------------
           rep78 |
              1  |          1  (empty)
              2  |          1  (empty)
              3  |   257.6621   1583.313     0.90   0.366     .0015153    4.38e+07
              4  |   27401.63   169810.9     1.65   0.099     .1454749    5.16e+09
              5  |          1  (omitted)
                 |
        headroom |   13.08764   31.30503     1.08   0.282     .1204556    1421.987
                 |
rep78#c.headroom |
              1  |          1  (empty)
              2  |          1  (empty)
              3  |   .0272784   .0697827    -1.41   0.159     .0001813     4.10518
              4  |   .0121483   .0309255    -1.73   0.083     .0000827    1.783961
              5  |          1  (omitted)
                 |
           _cons |   .0088579    .049539    -0.85   0.398     1.54e-07    510.2431
----------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

. mat list e(b)

e(b)[1,12]
        foreign:     foreign:     foreign:     foreign:     foreign:     foreign:     foreign:     foreign:     foreign:     foreign:     foreign:     foreign:
             1b.          2o.           3.           4.          5o.                 1b.rep78#    2o.rep78#     3.rep78#     4.rep78#    5o.rep78#             
          rep78        rep78        rep78        rep78        rep78     headroom  co.headroom  co.headroom   c.headroom   c.headroom  co.headroom        _cons
y1            0            0     5.551649    10.218358            0    2.5716683            0            0   -3.6016588   -4.4105625            0   -4.7264434

. test 1b.rep78=3.rep78#co.headroom

 ( 1)  [foreign]1b.rep78 - [foreign]3.rep78#c.headroom = 0

           chi2(  1) =    1.98
         Prob > chi2 =    0.1592

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#5

24 May 2023, 11:47

How am i supposed to compare the different groups when i only get one coëfficient from the interaction term?

In fact, the two models are equivalent. If you run -margins- or -predict- after both models you will get the same results (allowing for possible very small rounding errors). You just need to understand the way to "translate" the results of one model to the results of the other.

First, think about counting degrees of freedom. No matter how you slice it, you have two dichotomous variables, so a total of four combinations of their values. With four combinations you should have exactly 3 degrees of freedom for them in the analysis. And, either way, you do. With the # representation you get three # terms. With the ## representation you get two "main" terms plus one # term. That still adds to three.

So you need to understand what each term represents in each model--and, be warned, the same terms mean different things in the two models.

In the both models, the reference categories of the two variables are "Below Average" and "Low", respectively. In the # model, the reference category for the interaction as a whole is the interaction corresponding to the reference categories of the two variables, i.e. Below Average # Low, which is, indeed, the omitted category in the # model. So the expected odds of VERjaofnee when abovebelow == "Below Average" and Gezondheid = "Low" is given by the constant term. The odds ratios in the other combinations of abovebelow and Gezondheid are given explicitly by the corresponding # terms in the output.

The ## model is a bit more complicated. The term for "Above Average" in this model must be understood as the outcome odds ratio when abovebelow == "Above Average" and Gezondheit = its reference category, so it corresponds to the combination Above Average#Low. You will notice that, in fact, the Above Average odds ratio in the ## model is exactly the same as the Above Average#Low odds ratio in the # model. Similarly, the term for High in the ## model represents the outcome odds ratio when Gezondheid = "High" and abovebelow = its reference value, "Below Average." And again, you can see that the odds ratio for High in the ## model is exactly the same as the odds ratio for Below Average#High in the # model. The hardest part is understanding the Above Average#High term in the ## model. This one does not directly correspond to any term in the # model, and is not, in fact, the outcome odds ratio in any combination of the predictor variable values. Instead, the Above Average#High term represents the increment that must be multiplied by the Above Average and High odds ratios to get the odds ratio for having both abovebelow = Above Average (non-reference category) and Gezondheid = High (also non-reference category). And, indeed, if you do that calculation:

Code:

. display 2.450397 * .708061 * 2.346188 4.0707079

you see that the result agrees to 6 decimal places with the Above Average # Gezondheid odds ratio in the # model.

Which model to use? Well, since they are just algebraic transforms of each other, you can use either one if you don't mind doing some algebra. But if, like most of us, you would prefer not to do algebra to interpret your results, you choose the one that gives the statistics you are most in need of to answer your research question. If your research question hinges on finding the odds in each combination of categories of abovebelow and Gezondheid, you would use the # model, which gives those results directly. If, however, your question asks about whether these variables modify each others effects, and, if so, how much and in what direction, then the ## model gives you the answer by looking just at the abovebelow#Gezondheid odds ratio.
3 likes
Comment
Danny Wellens

Join Date: May 2023

Posts: 12
#6

26 May 2023, 03:52

Thank you very much Clyde!, however, I am still not certain how one OR of the interaction term can be used to interpret the ORs of all different combinations of High/Low gezondheid(Proxy_gezondheid) and above/below HIL. 4 different combinations must be possible and the output only shows one OR. I'm hoping you can clarify it a bit more to me as i am a beginner with STATA. I want to obtain information whether the effect of HIL (high/low) is different for the 2 levels of Health (Proxy_gezondheid).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#7

26 May 2023, 09:26

I am still not certain how one OR of the interaction term can be used to interpret the ORs of all different combinations of High/Low gezondheid(Proxy_gezondheid) and above/below HIL. 4 different combinations must be possible and the output only shows one OR.

First, when there are 4 different combinations, you can expect only 3 ORs, one of the combinations being the base category to which the ORs are relative.

Next, the ## model does output 3 OR's: they just don't have # appearing in their names. You are being misled by the names of the output into thinking you only have one combnination. But in the ## model, the "main effects" are actually themselves interactions--they are just "mislabeled." The 3 ORs are called High, Above Average, and High#Above Average. These three, when interpreted and combined in the way I described in #5 give the three odds ratios relative to the reference category of Low#Below Average. I suggest you re-read the 5th paragraph of my response in #5 to see how those 3 ORs work out in your example. I also suggest that you refresh your knowledge of the basics of models with interaction terms. The clearest explanation I know of is found in the excellent Richard Williams' https://www3.nd.edu/~rwilliam/stats2/l53.pdf. Unfortunately, your model is logistic, and produces odds ratios, which interact multiplicatively, so it may be a bit of extra effort to translate that to the framework in the PDF I just recommended, where the model is linear, and produces coefficients, which interact additively. But the principle is the same in both cases.

I should add that if you want to see the probability of VERjaofnee in each of the four combinations, you can do that by re-running the ## model and then following it with -margins ProxyGezondheid#HIL_abovebelow- (N.B. #, not ## in the -margins- command, but ##, not #, in the -logistic- command that precedes it.)

I want to obtain information whether the effect of HIL (high/low) is different for the 2 levels of Health (Proxy_gezondheid).

To accomplish this goal you do not need to see the four combinations, or even three of them. Using the ## model, this particular question is completely answered by examining the output in the High#Above average row of the logistic regression output--you don't need anything else for this. Of course, I think seeing the combinations is useful for context, and that requires looking at the full logistic output or the -margins- output. But just to achieve the above quoted goal, all you need is that one row of output from the ## model.
1 like
Comment
Danny Wellens

Join Date: May 2023

Posts: 12
#8

26 May 2023, 11:47

Thank you very much Clyde for the simple explanation! I would like to ask you if you could confirm the ORs i found for the different categories (I used the ## model and included several confounders):

Above_average_HIL/Low_gezondheid = 0.6859182
Below_average_HIL/High_gezondheid = 2.327303
Below_average_HIL/Low_gezondheid = 0.1721115 (Which is, to my understanding, the constant term?)
Above_average_HIL/High_gezondheid = 0.6859182*2.327303*2.284481 = 3.646807

Instead, the Above Average#High term represents the increment that must be multiplied by the Above Average and High odds ratios to get the odds ratio for having both abovebelow = Above Average (non-reference category) and Gezondheid = High (also non-reference category).

The Above_average_HIL/High_gezondheid OR must be calculated, like you said in your previous response. Could you maybe explain why this is the case? Why would I multiply by 2 ORs i am not particularely interested in?

Thank you very much, you are being very helpfull which is appreciated.

Attached Files
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#9

26 May 2023, 12:22

I would like to ask you if you could confirm the ORs i found for the different categories (I used the ## model and included several confounders):

Your calculations are correct.

Why would I multiply by 2 ORs i am not particularely interested in?

The picture is somewhat muddied by the use of logistic regression and odds ratios. It is simpler to think about these things in terms of logistic regression coefficients. The coefficients are the natural logarithms of the odds ratios. We can think of a logistic regression in the coefficient metric as an equation:

Code:

log odds = constant + b1*x1 + b2*x2 + b12*x1*x2

(If you exponentiate both sides of that equation you get the same model in the odds ratio metric.) For brevity I am using x1 and x2 to refer to your Gezondheit and abovebelow variables. Both x1 and x2 are dichotomous variables taking on the values 0 and 1.

What happens when x1 and x2 are both 0? We get log odds = constant + b1*0 + b2*0 + b12*0*0 = constant. So the constant term is the log odds for that case.
What happens if x1 = 0 and x2 = 1? We get log odds = constant + b1*0 + b2*1 + b12*0*1 = constant + b2. Now the odds ratio we are interested in for this case is odds(x2 = 1, x1 = 0) / odds(x2 = 0, x1 = 0). But the log of a ratio is the difference of the logs. So the log of this odds ratio is the difference of the two log odds ratios, so it is constant + b2 - constant, or just b2. And it then follows that the odds ratio for x2 is exp(b2).
Exactly the same reasoning applies to the case where x1 = 1 and x2 = 0. The log odds ratio is b1. And the odds ratio itself is exp(b1)

Finally, what happens when x1 and x2 are both 1. Now we have log odds = constant + b1*1 + b2*1 + b12*1*1 = constant + b1 + b2 + b12. Again, to get the log odds ratio, we subtract the log odds. So the log odds ratio for this case is constant + b1 + b2 + b12 - constant, which is b1 + b2 + b12. So if the log odds ratio is b1 + b2 + b12, then the odds ratio will be exp(b1 + b2 + b12) = exp(b1) * exp(b2) * exp(b12) = odds ratio for x1 * odds ratio for x2 * odds ratio for x1*x2.
2 likes
Comment
Danny Wellens

Join Date: May 2023

Posts: 12
#10

29 May 2023, 09:18

Thank you so much! You are a life saver Clyde. One more question. In the example above, the group Below_HIL/Low gezondheid was the reference category. I can now read from the table: Above_HIL/Low gezondheid-> the influence of High or Low HIL does not influence one's choice to take a voluntary deductible (p = 0.324). Say i want to compare the influence of Above and below HIL on the High_gezondheid group, can i just put for example Below_average_HIL*High_gezondheid as the reference group? And then compare the Above_averave_HIL#High gezondheid to that reference?

In the end i want to know if the effect of Above_ or Below_ average_HIL is different for the 2 gezondheid groups.

Last edited by Danny Wellens; 29 May 2023, 09:36.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#11

29 May 2023, 09:34

Say i want to compare the influence of Above and below HIL on the High_gezondheid group, can i just put for example Below_average_HILHigh_gezondheid as the reference group? And then compare the Above_averave_HIL#High gezondheid to that reference?

You could do that. But I think it is just as easy, and a bit more transparent, to go back to the # model and use -lincom- to get the contrast. See -help lincom- for details.

I can now ready from the table: Above_HIL/Low gezondheid-> the influence of High or Low HIL does not influence one's choice to take a voluntary deductible (p = 0.324).

You are relying here on a fallacious, but widely taught, interpretation of statistical significance. It is not true that a non-statistically significant result implies no effect. Look at your confidence interval. It extends from 0.32 to 1.45. That means that the data are compatible with those values and everything in between. Now if the actual OR is 0.32, that is a very strong negative effect. And if it's 1.45, that is a moderately strong positive effect. So you are nowhere near able to conclude that there is no effect. The appropriate conclusion is that your analysis is inconclusive about the effect: the data are consistent with strong effects in either direction and cannot distinguish between them.
1 like
Comment
Danny Wellens

Join Date: May 2023

Posts: 12
#12

29 May 2023, 10:02

Thanks Clyde. Can i assume that the wide 95% C.I. is due to the sample size being too small? For example: in the Above_average_HIL / High gezondheid group, 55 people opted for a voluntary deductible. In the Below_average_HIL / High gezondheid group, 60 people opted for a voluntary deductible. The difference is only 5 individuals, so maybe a too small number to conclude that the difference is due to the HIL being above or below average?

In other words: people with High Health (gezondheid in Dutch) are not significantly influenced by their level of HIL when choosing to opt for a voluntary deductible? (I know that significance does not tell the whole story thanks to your previous post).

Last edited by Danny Wellens; 29 May 2023, 10:05.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#13

29 May 2023, 10:11

Can i assume that the wide 95% C.I. is due to the sample size being too small?

Yes.

In other words: people with High Health (gezondheid in Dutch) are not significantly influenced by their level of HIL when choosing to opt for a voluntary deductible?

No, you can't say that. You can say only that your data sample is too small to determine whether they are significantly influenced by their level of HIL or not, nor even in which direction they might be influenced. Inconclusive means no conclusion can be drawn.
1 like
Comment

Announcement

Losing categories when using interaction term

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment