Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • interaction terms in stata (without main effect)

    Hi,

    I am little confuse about the way stata handle and designate interaction variables. To me (and to all textbooks), an interaction variable is X*W. But in stata language, interaction variables are coded X#W but they are not equivalent to X*W in a regression framework.

    As an example, I would have expected that:
    Code:
    sysuse nlsw88.dta
    gen south_married=south*married
    reg wage  south south_married
    gives the same results than
    Code:
    reg wage  i.south i.south#i.married
    But they do not. The coefficients for the interaction term are the same but not the main effect (south). How do you interpret both south coefficients then?

    Thanks for your help


  • #2
    These are not equivalent models. Before you even look at the coefficients, notice that in the first model you get 2 numerator degrees of freedom, but in the second you get 3.

    What it boils down to is that in both models you have included "main" effects for south, but not for married. Your south*married variable has only 1 degree of freedom: it is a single 0/1 variable. But the variable i.south#i.married begins with four levels: 0.south*0.married, 1.south*0.married, 0.south*1.married and 1.south*1.married. Now, one of these levels, 0.south*0.married, gets removed automatically as the reference category, no matter what. If you had included the "main" effect for married, then the various relationships among these would lead to everything going away except 1.south*1.married. But you did not, so 0.south*1.married and 1.south*1.married both survive. Consequently we see a model with 1 more numerator degree of freedom.

    I should also point out that omitting one of the two "main" effects leads to a badly-specified model unless the omitted main effect is colinear with something else in the model. The bad-specification is reflected in the resulting model failing to behave properly with linear transformations of the involved variables. For this reason, it is safer to specify interactions using the ## operator, which automatically generates the main effects. (If one of the main effects is colinear with something else, Stata will either drop it, or drop the something else, so no problem here.)

    Sometimes, it is more convenient to use the # operator for the interaction and omit both "main" effects. That is perfectly OK and results in a model that is equivalent to the ## version, but differently parameterized. For some purposes, the output of the # operator without any "main" effects is easier to work with.

    Comment


    • #3
      It's a bad idea to include an interaction term without including both of the main effects. You do not include the married main effect in either of your models, and the results from them differ. In particular in the factor notation model Stata includes three interaction effects: not_south & married, south & married, and south & single , while the only interaction in the other model is just the south & married interaction. I think the difference lies in how a poorly-specified model is treated, but someone else may be able to explain it more fully.

      A more appropriate pair of models is
      Code:
      reg wage  south south_married married
      reg wage  i.south##i.married
      .
      which produce identical results.

      Comment


      • #4
        Hi thanks for your answer. It is clearer now.

        I know that this model without the main is not generally identify. I was envisioning using such model in a RCT setting with nested treatments. Actually, there is no cost of including both main effects even in a nested design so ## is the way to go.

        What I was more surprised of is the vocabulary used by Stata to call interaction. In the help manual,(help varlist), stata calls south#married an interaction term while we generally call an interaction term south*married. Hence my confusion.


        Comment


        • #5
          If south and married are both dichotomous 0/1 variables (as they are in that particular data set) then south#married is in fact a single variable whose value is equal to south * married.

          But if we were looking at an interaction between race and occupation, there are 3 levels of race in that data set and 13 of occupation. So we would have (3-1)*(13-1) = 24 interaction variables. Each of those would be the product of a dichotomous 0/1 variable indicating one particular level of race and another dichotomous 0/1 variable indicating one particular level of occupation. So it would not make sense to denote this as race * occupation. There is, in fact, no universally used notation for a set of product variables like this. So Stata chose to denote it as race#occupation. When you are just dealing with single 0/1 variables like south and married, then south#married denotes the set which consists of the single variable whose value is the product of south and married.

          Comment

          Working...
          X