Specifying and interpreting models with interactions between a continuous variable and a factor variable

James Lee

Join Date: Sep 2022

Posts: 35
#1

Specifying and interpreting models with interactions between a continuous variable and a factor variable

22 Sep 2023, 15:35

Hi Statalist community,

I am trying to specify and interpret models with interactions between a continuous variable and a factor variable. Below is a toy dataset. The outcome of interest is a binary variable for being employed with 1 being employed and 0 otherwise. I have a 7 category factor variable to describe a person's prior education. I also have a continuous variable for age. I am trying to create an interaction between the education and age variable. I have never seen a continuous variable interacted with a factor variable that has more than two levels. Is this the correct way to do it?
Also, would it be correct to interpret the interaction terms as "for a particular education level, a one year increase in age is associated with a ______ unit increase in being employed." Thank you so much.

Code:

clear all set obs 1000 *Dummy for whether someone is employed generate employ = runiformint(0, 1) *Education level generate education = runiformint(1, 7) label define education 1 "less than hs" 2 "hs only" 3 "associates degree" 4 "ba/bs" 5 "masters" 6 "doctorate" 7 "postdoc training" label val education education *Age generate age = runiformint(18, 64) *Run logit model estimating employment with education and age interaction logit employ i.education##c.age
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

22 Sep 2023, 16:06

Is this the correct way to do it?

Yes.

Also, would it be correct to interpret the interaction terms as "for a particular education level, a one year increase in age is associated with a ______ unit increase in being employed."

No. First, because this is a logistic model, not a linear model, the coefficient has no interpretation as a marginal effect on the outcome probability. At best you could say that a coefficient is the expected change in the log odds of being employed associated with a one year increase in age.

But even that isn't correct because the interaction coefficients do not work that way. Let's use your example. I ran it, and it produces this result:

Code:

. logit employ i.education##c.age Iteration 0: Log likelihood = -693.09718 Iteration 1: Log likelihood = -686.68863 Iteration 2: Log likelihood = -686.68781 Iteration 3: Log likelihood = -686.68781 Logistic regression Number of obs = 1,000 LR chi2(13) = 12.82 Prob > chi2 = 0.4619 Log likelihood = -686.68781 Pseudo R2 = 0.0092 ------------------------------------------------------------------------------------ employ | Coefficient Std. err. z P>|z| [95% conf. interval] -------------------+---------------------------------------------------------------- education | hs only | .975355 .8192212 1.19 0.234 -.6302891 2.580999 associates degree | .3083571 .7926751 0.39 0.697 -1.245257 1.861972 ba/bs | .018488 .7381733 0.03 0.980 -1.428305 1.465281 masters | .7828401 .7654795 1.02 0.306 -.7174722 2.283152 doctorate | .2678314 .8101819 0.33 0.741 -1.320096 1.855759 postdoc training | -.1113451 .7922442 -0.14 0.888 -1.664115 1.441425 | age | -.0018536 .0128959 -0.14 0.886 -.0271291 .023422 | education#c.age | hs only | -.0197494 .0188845 -1.05 0.296 -.0567623 .0172634 associates degree | -.0060461 .0181128 -0.33 0.739 -.0415464 .0294543 ba/bs | -.0021015 .0171611 -0.12 0.903 -.0357367 .0315337 masters | -.0192974 .0178177 -1.08 0.279 -.0542195 .0156246 doctorate | .0016788 .0187431 0.09 0.929 -.035057 .0384147 postdoc training | -.0024043 .0179687 -0.13 0.894 -.0376223 .0328138 | _cons | .0188641 .5671397 0.03 0.973 -1.092709 1.130438 ------------------------------------------------------------------------------------

For less than hs education (the base value for education), there is no interaction term, and a 1 year increase in age will be associated with a decrease of .0018536 (see the coefficient of age above) in the log odds of employ. Since most people have at least some difficulty wrapping their minds around log odds, we can simplify a bit by going to the somewhat more familiar odds metric by exponentiating. A one year age increase is associated with the odds of employ changing by a factor of exp(-.0018536) =0.998 (to 3 decimal places). One might make this even easier to understand by calling it a 0.2% decrease.

For the other education categories, it gets a little more complicated. Let's use ba/bs as the example. Now we have to look not just at the coefficient of age, but also at the ba/bs#c.age coefficients. The "effective coefficient" of age for education = ba/bs is the sum of the age coefficient and the ba/bs#c.age coefficient. That is, -.0018536 + (-.0021015), or -0.0039551. Again, this number represents the decrease in the log odds of employ associated with a 1 year increase in age when education == ba/bs. Exponentiating that gives us, to three decimal places, 0.996. So among those with education == ba/bs, with a 1 year age increase there is an associated decrease in the odds of employ by a factor of 0.996, or, equivalently a 0.4% decrease in the odds of employ.

The other education categories work just like ba/bs--less than hs is different only because it is the base category and has no associated interaction coefficient.

Now, your initial reaction might be that this is pretty unsatisfactory. First of all, this is a lot of work--and tedious, error-prone work at that. But you can automate that:

Code:

levelsof education if e(sample), local(edlevels) foreach e of local edlevels { display "Education level: `:label (education) `e''" lincom age + `e'.education#c.age, or }

But even this is a bit unsatisfactory because you are still getting your answers in the odds of employ. While that is comprehensible, it is not as natural as asking how much difference in the probability of employ there is. And that is much more complicated to do by hand. Because a given odds ratio does not correspond to any specific change in probability: the corresponding change in probability depends on what the probability is before the 1 year increase in age. Fortunately, there is the -margins- command that can handle the complexity of this. But it's not reduced to a no-brainer: you have to think about what starting values of the model predictor variables you use. There are many choices for this, of different usefulness for different purposes. One of the commonest is to use the average marginal effect. That is, we ask, suppose we calculated the effect of a 1 year age increase on each person in the data set, using the particular values of all the variables for that person calculated the change in probability of employ and averaged those changes. (In your example, age and education are the only variables, but I'm thinking of the more general setting here). The code for this would be:

Code:

margins education, dydx(age)

This will give you an output table with 7 rows, one for each level of education. In each row you will find the associated change in the average probability of employ associated with a year age increase. In your example, it looks like this:

Code:

------------------------------------------------------------------------------------ | Delta-method | dy/dx std. err. z P>|z| [95% conf. interval] -------------------+---------------------------------------------------------------- age | education | less than hs | -.0043342 .0029927 -1.45 0.148 -.0101998 .0015314 hs only | -.0007664 .0035431 -0.22 0.829 -.0077107 .0061779 associates degree | .0023469 .003219 0.73 0.466 -.0039624 .0086561 ba/bs | -.001671 .0028021 -0.60 0.551 -.007163 .003821 masters | .0018659 .0032108 0.58 0.561 -.0044271 .0081589 doctorate | -.0023542 .0027454 -0.86 0.391 -.0077351 .0030266 postdoc training | -.0054248 .002958 -1.83 0.067 -.0112224 .0003727 ------------------------------------------------------------------------------------

Note: You may get different results for the regression and -margins- when you run the same code. That's because you did not set the random number seed in your code before generating the random data. My random number generator may not be in synch with yours, so I may have random data that differs from your random data. But the principles are all the same.

Added: For further information about interactions and the -margins- command, I highly recommend the excellent Richard Williams's https://www3.nd.edu/~rwilliam/stats/Margins01.pdf and
https://www3.nd.edu/~rwilliam/stats2/l53.pdf .

Last edited by Clyde Schechter; 22 Sep 2023, 16:09.
Comment
James Lee

Join Date: Sep 2022

Posts: 35
#3

22 Sep 2023, 17:41

@Clyde Schechter
Thank you so much for being so thorough and walking me through your process. Thank you confirming how to specify the model correctly and thank you for providing a detailed explanation of estimating the average marginal effects.
Comment

Announcement

Specifying and interpreting models with interactions between a continuous variable and a factor variable

Comment

Comment