Using interaction terms in a cross-sectional regression

Robert Sanker

Join Date: Jun 2015

Posts: 1
#1

Using interaction terms in a cross-sectional regression

02 Jun 2015, 15:18

Hi there! I have a question regarding the use of interaction terms in a cross-sectional regression model. I hope you can bear with my somewhat dodgy statistical explanation:

Currently I am working on a study for which I have a sample consisting of roughly 500 observations, which can be divided under type 1 and type 2 (similar to e.g. male and female). My interest goes out to finding whether the outcome in the dependent variable is significantly different between groups. After performing a difference in means t-test I found out that there is a significant difference, but I want to further control whether this difference holds by adding in total 8 independent variables. I performed an OLS regression including all these variables and a dummy that noted 1 for type 1 and 0 for type 2, but I was told to include interaction term effects. This way my model would look something like:

a + b1x1 + b2x2 + b3x3... b9x1x2 + b10x1x3... + e (I basically multiplied every predictor variable with the dummy to find the additional interaction terms)

I am wondering:

a) if I add a dummy to a regression with multiple predictor variables into a regression, why can I not interpret the outcome for the dummy as already being controlled for these variables?

b) whether this is legit? I have found models specifying e.g. x1 + x2 + x1x2, but I could hardly find any good reads on adding multiple independent variable interaction terms.

c) to me it seems highly problematic to use these many interaction terms, as my intention is to find out what the effect is for the dummy variable (type 1), and whether it is significant. I am aware that adding interaction terms gives me no more chance to directly interpret the coefficient and the significance, but I could hardly imagine how I can still make any valid assumptions with this amount of interaction terms?

d) Just as a check of my knowledge, if the interaction term lets say dummy1 x education comes out significant over this entire interaction model, can I then correctly assume that the coefficient shows the difference in the effect the predictor education has on the dependent variable for all type 1 when compared to type 2?

I hope that I explained my current problem clearly, if in need for any clarification I will gladly do so. Thanks in advance for any help.

Best, Robert
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30114
#2

02 Jun 2015, 22:48

a) if I add a dummy to a regression with multiple predictor variables into a regression, why can I not interpret the outcome for the dummy as already being controlled for these variables?

First let me be a little pedantic. Unless those other variables are the result of deliberate experimental assignment, you cannot consider your analysis to have "controlled" for them no matter what the regression equation includes. Adjusted for them, yes, but not controlled. That said, adjustment for covariates is something of an art. It is often done casually, and badly, by just throwing variables into a regression equation. But remember that the purpose of adjustment for covariates is to either reduce confounding (aka omitted variable) bias, or to reduce variance that is extraneous to the effects of interest in the study. To accomplish either of those goals, the representation of the covariates in the model needs to reflect the actual functional form of the relationship between the covariate and the outcome. Determining this is a modeling exercise in its own right--which is probably why it is often skipped. So, yes, you can consider the analysis to be adjusted for the covariates, but it may be "maladjusted!" One of the ways in which improper adjustment is made is by presuming that the relationship between the principal vary of interest (dummy, in your case) and the outcome is the same regardless of the levels of the other variables. That presumption may be true, but often is not. The use of interaction terms covers the case where the presumption fails. Now, you might think from this that I would advocate routinely including all interaction terms among all variables to be sure that we don't miss anything. The problem is that the number of interaction terms shows a combinatorial explosion with the number of covariates, so you quickly exhaust your degrees of freedom and also end up modeling all the noise in the data. So it becomes a judgment call as to which interactions are likely to be important enough that omitting them risks a serious misspecification of the model, and which ones can be safely ignored. That isn't so much a statistical question as a judgment call based on expertise in the subject matter. I think you should ask the people who told you to include interaction terms to be specific about which interaction terms they think need to be included and explain the reasons for those choices.

b) whether this is legit? I have found models specifying e.g. x1 + x2 + x1x2, but I could hardly find any good reads on adding multiple independent variable interaction terms.

There is no theoretical upper limit on the number of covariates or interactions in principle. In any given data set, you can quickly exhaust your degrees of freedom if you use too many. And, as noted above, the issue of which interactions (and how many) to include is really more a question of the underlying science than a statistical question per se.

c) to me it seems highly problematic to use these many interaction terms, as my intention is to find out what the effect is for the dummy variable (type 1), and whether it is significant. I am aware that adding interaction terms gives me no more chance to directly interpret the coefficient and the significance, but I could hardly imagine how I can still make any valid assumptions with this amount of interaction terms?

Again, the issue is not too many or too few interactions, but whether they are the right ones, and whether your data set is large enough to accommodate them. As for determining whether your outcome is then associated with your predictor of interest (dummy) in the presence of all of these, I would first look substantively at the results of the appropriate -margins- commands, and then, if you must perform a null hypothesis significance test, a joint test of the dummy itself along with all of the interaction terms it participates in would be in order.

d) Just as a check of my knowledge, if the interaction term lets say dummy1 x education comes out significant over this entire interaction model, can I then correctly assume that the coefficient shows the difference in the effect the predictor education has on the dependent variable for all type 1 when compared to type 2?

That is correct.
Comment

Announcement

Using interaction terms in a cross-sectional regression

Comment