Factor variables vs. dummy variables with interactions

Sarah Thorne

Join Date: Mar 2018

Posts: 17
#1

Factor variables vs. dummy variables with interactions

23 Apr 2019, 00:27

Hi all,

I have a dummy variable x1 (no missing values) in the dataset and has values of 0 and 1, and a variable x2 which takes on values of 1 and 2 or missing.

I would like to estimate the effect of x1, x2, and the interaction on the outcome y.

Code:

reg y x1 i.x2 i.x1#i.x2

produces different estimates for the coefficient on x1 than

Code:

reg y i.x1 i.x2 i.x1#i.x2

A simple regression without the interaction produces the same coefficient for x1 whether I use factor notation or not. Does anyone know why this happens or what is actually being estimated in either case?

Thanks for any advice or help you can provide!

Last edited by Sarah Thorne; 23 Apr 2019, 00:42.
Tags: categorical, factor notation, factor variable, indicator variable, syntax
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

23 Apr 2019, 00:33

Sarah:
I would go with your second code, that can be written more efficiently (by the way: the regressor -x1- is repeated twice in your second code; I assume that it's a typo):

Code:

reg y i.x1##i.x2

Kind regards,
Carlo
(Stata 19.0)
Comment
Sarah Thorne

Join Date: Mar 2018

Posts: 17
#3

23 Apr 2019, 00:45

Thank you! Typo fixed. I'm still concerned that the two syntaxes would produce a different result for the x1 coefficient, though. The results report a coefficient for 1.x1, meaning it should be using 0 as reference category, the same as a dummy variable, so is there anything else that would cause a difference between the two?
Comment
Eric de Souza

Join Date: Mar 2014

Posts: 587
#4

23 Apr 2019, 05:27

Without seeing your results, difficult to say why and exactly what you are referring to.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#5

23 Apr 2019, 06:47

as Eric says, it is hard to respond without knowing exactly what you are referring to; here, however, is a guess: the interaction terms are also different (possibly only in sign) and that points to a difference in how "x1" is treated in the two regressions; for further information, please show your results inside CODE delimiters (see the FAQ for an explanation if this is not clear)

added: you might also use the "allbase" option to help make things clearer
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

23 Apr 2019, 07:25

Sarah:
can you please provide an example/excerpt of your data via -dataex- in order to replicate the problem you're complaining about? Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#7

23 Apr 2019, 08:36

Here's an example, which leads to a collinearity. This problem is not unique to this example, and I'd presume (?) it also exists in Sarah's data.

Code:

sysuse auto gen byte x1 = foreign rename price y recode weight (min/2500 = 1) (2501/4300 = 2) (else = .), gen(x2) // tab x1 x2 reg y x1 i.x2 i.x1#i.x2 reg y i.x1 i.x2 i.x1#i.x2
Comment

Sarah Thorne

Join Date: Mar 2018
Posts: 17

23 Apr 2019, 12:43

Hi everyone,

Thank you for your help. Here is an subset of my data of ~140 observations that replicates what happens in the larger dataset described above

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(y x1 x2)
13.469865 0 1
13.476255 0 1
13.434385 1 1
13.391302 0 1
13.391302 0 1
13.454865 0 1
13.165752 0 1
13.325862 1 1
13.251364 1 1
 13.37181 1 1
 13.24672 0 1
  13.1683 0 1
13.240578 1 1
13.471485 0 1
13.482866 1 1
13.381432 1 1
13.115342 0 1
13.432265 0 1
 13.56408 1 1
13.448138 0 1
 13.33188 0 1
13.414117 0 1
13.463763 0 1
  13.4081 1 1
13.498776 0 1
 13.41776 0 1
 13.11481 0 1
 13.20126 0 1
13.285892 0 1
13.454078 1 1
13.296874 1 1
 13.12894 0 1
 13.27304 0 1
 13.27508 1 1
 13.20126 0 1
13.490165 1 1
13.179234 1 1
 13.20937 0 1
13.317464 0 1
13.397864 1 1
13.189385 1 1
13.285966 0 1
 13.19907 0 1
13.208874 1 1
13.384131 1 1
13.179234 1 1
13.170043 1 1
 13.24091 1 1
13.399294 0 1
13.304432 0 1
13.347064 1 1
13.212533 1 1
13.210518 0 1
 13.07501 0 1
13.244032 1 1
 13.32245 0 1
13.336968 0 1
 13.34015 0 1
13.303796 1 1
13.078695 1 2
12.763182 1 2
 12.88847 1 2
12.815368 1 2
 12.62965 1 2
13.118662 0 2
 12.69856 0 2
13.101826 1 2
12.771944 1 2
 13.04766 1 2
12.825054 1 2
13.081288 1 2
12.689074 1 2
 12.62624 1 2
12.708122 1 2
12.992208 0 2
12.637317 0 2
 12.78088 1 2
 13.09819 1 2
12.763182 1 2
 13.02829 1 2
12.902164 1 2
13.101536 1 2
12.984118 1 2
 13.07701 0 2
 13.08649 1 2
  12.9588 1 2
 12.74688 1 2
12.851295 0 2
 13.12109 0 2
12.911754 0 2
 13.13781 0 2
12.828372 0 2
 13.04766 0 2
12.768653 0 2
12.880617 1 2
 12.62965 1 2
13.083558 1 2
 13.00051 1 2
 13.11481 1 2
13.015163 1 2
 13.27508 1 2
12.798273 1 2
12.750602 1 2
12.742422 1 2
13.009192 1 2
12.676634 1 2
 12.67856 0 2
 13.01153 0 2
   12.925 0 2
12.772264 1 2
12.896713 1 2
13.062822 1 2
12.970338 1 2
13.146394 1 2
12.912217 1 2
13.055318 1 2
13.139783 1 2
12.791352 1 2
13.083974 1 2
  14.5669 1 2
12.665727 1 2
14.683336 1 2
 13.13228 1 2
13.058916 1 2
13.004144 1 2
  13.0644 1 2
12.848704 1 2
13.139783 1 2
 12.91436 1 2
 13.13793 1 2
12.748793 1 2
12.754213 1 2
12.941167 0 2
12.813773 0 2
   13.022 1 2
12.649853 1 2
 12.92445 1 2
12.987398 0 2
12.923494 1 2
end

And the output is below

Code:

reg y i.x1##i.x2 if ID<100|(ID>=11578&ID<11700)

      Source |       SS           df       MS      Number of obs   =       139
-------------+----------------------------------   F(3, 135)       =     22.46
       Model |  4.24969982         3  1.41656661   Prob > F        =    0.0000
    Residual |  8.51445871       135  .063070065   R-squared       =    0.3329
-------------+----------------------------------   Adj R-squared   =    0.3181
       Total |  12.7641585       138  .092493902   Root MSE        =    .25114

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        1.x1 |   .0106515    .066165     0.16   0.872    -.1202025    .1415055
        2.x2 |  -.3913761   .0732045    -5.35   0.000    -.5361521   -.2466002
             |
       x1#x2 |
        1 2  |   .0462199   .0943343     0.49   0.625    -.1403443    .2327841
             |
       _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
------------------------------------------------------------------------------

Code:

reg y x1 i.x2 i.x2#i.x1 if ID<100|(ID>=11578&ID<11700)
note: 2.x2#1.x1 omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =       139
-------------+----------------------------------   F(3, 135)       =     22.46
       Model |  4.24969982         3  1.41656661   Prob > F        =    0.0000
    Residual |  8.51445871       135  .063070065   R-squared       =    0.3329
-------------+----------------------------------   Adj R-squared   =    0.3181
       Total |  12.7641585       138  .092493902   Root MSE        =    .25114

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .0568714   .0672395     0.85   0.399    -.0761077    .1898505
        2.x2 |  -.3913761   .0732045    -5.35   0.000    -.5361521   -.2466002
             |
       x2#x1 |
        1 1  |  -.0462199   .0943343    -0.49   0.625    -.2327841    .1403443
        2 1  |          0  (omitted)
             |
       _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
------------------------------------------------------------------------------

Comment

Sarah Thorne

Join Date: Mar 2018

Posts: 17
#9

23 Apr 2019, 12:52

Originally posted by Mike Lacy View Post

Here's an example, which leads to a collinearity. This problem is not unique to this example, and I'd presume (?) it also exists in Sarah's data.

Mike would you mind explaining how you came up with the example? I don't quite see how the recoding of weight led to collinearity with foreign?

Thanks!
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#10

23 Apr 2019, 17:14

I think that's the point: It wasn't the recode that caused the problem. Rather, there's something about using a factor interaction term involving x1 along with a main effect coded simply as x1 that leads to some oddity in the design matrix that the regression command sees, and this happens in both of our examples. (Note that your example has the collinearity, too.) I don't quite know enough about how factor variables map into the coding to know what is happening. (I came up with the example with exactly the code I showed, no more, no less.) I feel confident that someone else will be able to explain this.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#11

23 Apr 2019, 19:23

I can't fully explain the discrepancy but in any event I would use factor variable notation throughout. Otherwise a command like margins might get confused because x1 is being treated both as a continuous variable and as a categorical variable.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
2 likes
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10214

#12

24 Apr 2019, 13:33

I would like to estimate the effect of x1, x2, and the interaction on the outcome y.
reg y x1 i.x2 i.x1#i.x2
produces different estimates for the coefficient on x1 thanreg y i.x1 i.x2 i.x1#i.x2

Your data example and output in #8 is helpful. The question to you would be why do you expect the coefficients to be the same when the interactions are different between the models? What would be surprising is if you found the predicted values of the outcome across groups differed across regressions. In the presence of interactions involving a variable, you cannot interpret the coefficient of the variable independent of the interaction term. Note that this has nothing to do with collinearity as at most, with two binary variables, you can only include one interaction combination as the other three will be collinear. Let us look at your results and see if they are inconsistent.

Code:

reg y i.x1##i.x2 if ID<100|(ID>=11578&ID<11700)

      Source |       SS           df       MS      Number of obs   =       139
-------------+----------------------------------   F(3, 135)       =     22.46
       Model |  4.24969982         3  1.41656661   Prob > F        =    0.0000
    Residual |  8.51445871       135  .063070065   R-squared       =    0.3329
-------------+----------------------------------   Adj R-squared   =    0.3181
       Total |  12.7641585       138  .092493902   Root MSE        =    .25114

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        1.x1 |   .0106515    .066165     0.16   0.872    -.1202025    .1415055
        2.x2 |  -.3913761   .0732045    -5.35   0.000    -.5361521   -.2466002
             |
       x1#x2 |
        1 2  |   .0462199   .0943343     0.49   0.625    -.1403443    .2327841
             |
       _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
------------------------------------------------------------------------------

Here, \(\hat{y}\) = 13.3107 if x1=0 and x2=1 (intercept= average predicted y at base levels). If x1=1 and x2=1, we have to add the value of x1 to the intercept, and \(\hat{y}\) = 13.3107+ .0106515= 13.3213515. If x1=0 and x2=2, \(\hat{y}\) = 13.3107-.3913761=12.9193239. Finally, if x1=1 and x2=2, \(\hat{y}\) = 13.3107 +.0106515 -.3913761+ .0462199 =12.9761953.

On the other hand, with the second specification,

Code:

reg y x1 i.x2 i.x2#i.x1 if ID<100|(ID>=11578&ID<11700)
note: 2.x2#1.x1 omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =       139
-------------+----------------------------------   F(3, 135)       =     22.46
       Model |  4.24969982         3  1.41656661   Prob > F        =    0.0000
    Residual |  8.51445871       135  .063070065   R-squared       =    0.3329
-------------+----------------------------------   Adj R-squared   =    0.3181
       Total |  12.7641585       138  .092493902   Root MSE        =    .25114

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .0568714   .0672395     0.85   0.399    -.0761077    .1898505
        2.x2 |  -.3913761   .0732045    -5.35   0.000    -.5361521   -.2466002
             |
       x2#x1 |
        1 1  |  -.0462199   .0943343    -0.49   0.625    -.2327841    .1403443
        2 1  |          0  (omitted)
             |
       _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
------------------------------------------------------------------------------

the intercept is the same as the base levels in both models are the same and therefore \(\hat{y}\) = 13.3107 if x1=0 and x2=1. If x1=1 and x2=1, \(\hat{y}\) = 13.3107+ .0568714- .0462199=13.3213515 (note as the interaction is x1=1, x2=1). If x1=0 and x2=2, \(\hat{y}\) =13.3107-.3913761=12.9193239. Finally, If x1=1 and x2=2, \(\hat{y}\) = 13.3107+.0568714-.3913761=12.9761953. So the predictions stay intact. Also note that we can change the coefficient of x2 by choosing a different interaction combination, e.g.,

Code:

. reg y 1.x1 i.x2 i.x2#i.x1
note: 2.x2#1.x1 omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =       139
-------------+----------------------------------   F(3, 135)       =     22.46
       Model |  4.24969941         3  1.41656647   Prob > F        =    0.0000
    Residual |  8.51446284       135  .063070095   R-squared       =    0.3329
-------------+----------------------------------   Adj R-squared   =    0.3181
       Total |  12.7641623       138  .092493929   Root MSE        =    .25114

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |
          1  |   .0106515    .066165     0.16   0.872    -.1202025    .1415055
             |
        2.x2 |  -.3451562   .0594984    -5.80   0.000    -.4628258   -.2274866
             |
       x2#x1 |
        2 0  |    -.04622   .0943343    -0.49   0.625    -.2327842    .1403442
        2 1  |          0  (omitted)
             |
       _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
------------------------------------------------------------------------------

.
This does not change our conclusions. So in summary, do not simply look at estimates of variables in isolation if they are part of an interaction.

ps. You don't have to calculate the predictions manually as I do. Simply run margins i.x1#i.x2 after the regression.

Last edited by Andrew Musau; 24 Apr 2019, 13:45.

Comment

Sarah Thorne

Join Date: Mar 2018

Posts: 17
#13

24 Apr 2019, 20:24

Originally posted by Andrew Musau View Post

So in summary, do not simply look at estimates of variables in isolation if they are part of an interaction.

ps. You don't have to calculate the predictions manually as I do. Simply run margins i.x1#i.x2 after the regression.

Thank you!!
1 like
Comment

Announcement