Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Factor variables vs. dummy variables with interactions

    Hi all,

    I have a dummy variable x1 (no missing values) in the dataset and has values of 0 and 1, and a variable x2 which takes on values of 1 and 2 or missing.

    I would like to estimate the effect of x1, x2, and the interaction on the outcome y.

    Code:
    reg y x1 i.x2 i.x1#i.x2
    produces different estimates for the coefficient on x1 than

    Code:
    reg y i.x1 i.x2 i.x1#i.x2
    A simple regression without the interaction produces the same coefficient for x1 whether I use factor notation or not. Does anyone know why this happens or what is actually being estimated in either case?


    Thanks for any advice or help you can provide!
    Last edited by Sarah Thorne; 23 Apr 2019, 00:42.

  • #2
    Sarah:
    I would go with your second code, that can be written more efficiently (by the way: the regressor -x1- is repeated twice in your second code; I assume that it's a typo):
    Code:
    reg y i.x1##i.x2

    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Thank you! Typo fixed. I'm still concerned that the two syntaxes would produce a different result for the x1 coefficient, though. The results report a coefficient for 1.x1, meaning it should be using 0 as reference category, the same as a dummy variable, so is there anything else that would cause a difference between the two?

      Comment


      • #4
        Without seeing your results, difficult to say why and exactly what you are referring to.

        Comment


        • #5
          as Eric says, it is hard to respond without knowing exactly what you are referring to; here, however, is a guess: the interaction terms are also different (possibly only in sign) and that points to a difference in how "x1" is treated in the two regressions; for further information, please show your results inside CODE delimiters (see the FAQ for an explanation if this is not clear)

          added: you might also use the "allbase" option to help make things clearer

          Comment


          • #6
            Sarah:
            can you please provide an example/excerpt of your data via -dataex- in order to replicate the problem you're complaining about? Thanks.
            Kind regards,
            Carlo
            (Stata 18.0 SE)

            Comment


            • #7
              Here's an example, which leads to a collinearity. This problem is not unique to this example, and I'd presume (?) it also exists in Sarah's data.
              Code:
              sysuse auto
              gen byte x1 = foreign
              rename price y
              recode weight (min/2500 = 1) (2501/4300 = 2) (else = .), gen(x2)
              //
              tab x1 x2
              reg y x1 i.x2 i.x1#i.x2
              reg y i.x1 i.x2 i.x1#i.x2

              Comment


              • #8
                Hi everyone,

                Thank you for your help. Here is an subset of my data of ~140 observations that replicates what happens in the larger dataset described above

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float(y x1 x2)
                13.469865 0 1
                13.476255 0 1
                13.434385 1 1
                13.391302 0 1
                13.391302 0 1
                13.454865 0 1
                13.165752 0 1
                13.325862 1 1
                13.251364 1 1
                 13.37181 1 1
                 13.24672 0 1
                  13.1683 0 1
                13.240578 1 1
                13.471485 0 1
                13.482866 1 1
                13.381432 1 1
                13.115342 0 1
                13.432265 0 1
                 13.56408 1 1
                13.448138 0 1
                 13.33188 0 1
                13.414117 0 1
                13.463763 0 1
                  13.4081 1 1
                13.498776 0 1
                 13.41776 0 1
                 13.11481 0 1
                 13.20126 0 1
                13.285892 0 1
                13.454078 1 1
                13.296874 1 1
                 13.12894 0 1
                 13.27304 0 1
                 13.27508 1 1
                 13.20126 0 1
                13.490165 1 1
                13.179234 1 1
                 13.20937 0 1
                13.317464 0 1
                13.397864 1 1
                13.189385 1 1
                13.285966 0 1
                 13.19907 0 1
                13.208874 1 1
                13.384131 1 1
                13.179234 1 1
                13.170043 1 1
                 13.24091 1 1
                13.399294 0 1
                13.304432 0 1
                13.347064 1 1
                13.212533 1 1
                13.210518 0 1
                 13.07501 0 1
                13.244032 1 1
                 13.32245 0 1
                13.336968 0 1
                 13.34015 0 1
                13.303796 1 1
                13.078695 1 2
                12.763182 1 2
                 12.88847 1 2
                12.815368 1 2
                 12.62965 1 2
                13.118662 0 2
                 12.69856 0 2
                13.101826 1 2
                12.771944 1 2
                 13.04766 1 2
                12.825054 1 2
                13.081288 1 2
                12.689074 1 2
                 12.62624 1 2
                12.708122 1 2
                12.992208 0 2
                12.637317 0 2
                 12.78088 1 2
                 13.09819 1 2
                12.763182 1 2
                 13.02829 1 2
                12.902164 1 2
                13.101536 1 2
                12.984118 1 2
                 13.07701 0 2
                 13.08649 1 2
                  12.9588 1 2
                 12.74688 1 2
                12.851295 0 2
                 13.12109 0 2
                12.911754 0 2
                 13.13781 0 2
                12.828372 0 2
                 13.04766 0 2
                12.768653 0 2
                12.880617 1 2
                 12.62965 1 2
                13.083558 1 2
                 13.00051 1 2
                 13.11481 1 2
                13.015163 1 2
                 13.27508 1 2
                12.798273 1 2
                12.750602 1 2
                12.742422 1 2
                13.009192 1 2
                12.676634 1 2
                 12.67856 0 2
                 13.01153 0 2
                   12.925 0 2
                12.772264 1 2
                12.896713 1 2
                13.062822 1 2
                12.970338 1 2
                13.146394 1 2
                12.912217 1 2
                13.055318 1 2
                13.139783 1 2
                12.791352 1 2
                13.083974 1 2
                  14.5669 1 2
                12.665727 1 2
                14.683336 1 2
                 13.13228 1 2
                13.058916 1 2
                13.004144 1 2
                  13.0644 1 2
                12.848704 1 2
                13.139783 1 2
                 12.91436 1 2
                 13.13793 1 2
                12.748793 1 2
                12.754213 1 2
                12.941167 0 2
                12.813773 0 2
                   13.022 1 2
                12.649853 1 2
                 12.92445 1 2
                12.987398 0 2
                12.923494 1 2
                end

                And the output is below

                Code:
                reg y i.x1##i.x2 if ID<100|(ID>=11578&ID<11700)
                
                      Source |       SS           df       MS      Number of obs   =       139
                -------------+----------------------------------   F(3, 135)       =     22.46
                       Model |  4.24969982         3  1.41656661   Prob > F        =    0.0000
                    Residual |  8.51445871       135  .063070065   R-squared       =    0.3329
                -------------+----------------------------------   Adj R-squared   =    0.3181
                       Total |  12.7641585       138  .092493902   Root MSE        =    .25114
                
                ------------------------------------------------------------------------------
                           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                        1.x1 |   .0106515    .066165     0.16   0.872    -.1202025    .1415055
                        2.x2 |  -.3913761   .0732045    -5.35   0.000    -.5361521   -.2466002
                             |
                       x1#x2 |
                        1 2  |   .0462199   .0943343     0.49   0.625    -.1403443    .2327841
                             |
                       _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
                ------------------------------------------------------------------------------

                Code:
                reg y x1 i.x2 i.x2#i.x1 if ID<100|(ID>=11578&ID<11700)
                note: 2.x2#1.x1 omitted because of collinearity
                
                      Source |       SS           df       MS      Number of obs   =       139
                -------------+----------------------------------   F(3, 135)       =     22.46
                       Model |  4.24969982         3  1.41656661   Prob > F        =    0.0000
                    Residual |  8.51445871       135  .063070065   R-squared       =    0.3329
                -------------+----------------------------------   Adj R-squared   =    0.3181
                       Total |  12.7641585       138  .092493902   Root MSE        =    .25114
                
                ------------------------------------------------------------------------------
                           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                -------------+----------------------------------------------------------------
                          x1 |   .0568714   .0672395     0.85   0.399    -.0761077    .1898505
                        2.x2 |  -.3913761   .0732045    -5.35   0.000    -.5361521   -.2466002
                             |
                       x2#x1 |
                        1 1  |  -.0462199   .0943343    -0.49   0.625    -.2327841    .1403443
                        2 1  |          0  (omitted)
                             |
                       _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
                ------------------------------------------------------------------------------

                Comment


                • #9
                  Originally posted by Mike Lacy View Post
                  Here's an example, which leads to a collinearity. This problem is not unique to this example, and I'd presume (?) it also exists in Sarah's data.
                  Mike would you mind explaining how you came up with the example? I don't quite see how the recoding of weight led to collinearity with foreign?

                  Thanks!

                  Comment


                  • #10
                    I think that's the point: It wasn't the recode that caused the problem. Rather, there's something about using a factor interaction term involving x1 along with a main effect coded simply as x1 that leads to some oddity in the design matrix that the regression command sees, and this happens in both of our examples. (Note that your example has the collinearity, too.) I don't quite know enough about how factor variables map into the coding to know what is happening. (I came up with the example with exactly the code I showed, no more, no less.) I feel confident that someone else will be able to explain this.

                    Comment


                    • #11
                      I can't fully explain the discrepancy but in any event I would use factor variable notation throughout. Otherwise a command like margins might get confused because x1 is being treated both as a continuous variable and as a categorical variable.
                      -------------------------------------------
                      Richard Williams, Notre Dame Dept of Sociology
                      Stata Version: 17.0 MP (2 processor)

                      EMAIL: [email protected]
                      WWW: https://www3.nd.edu/~rwilliam

                      Comment


                      • #12
                        I would like to estimate the effect of x1, x2, and the interaction on the outcome y.
                        reg y x1 i.x2 i.x1#i.x2
                        produces different estimates for the coefficient on x1 thanreg y i.x1 i.x2 i.x1#i.x2
                        Your data example and output in #8 is helpful. The question to you would be why do you expect the coefficients to be the same when the interactions are different between the models? What would be surprising is if you found the predicted values of the outcome across groups differed across regressions. In the presence of interactions involving a variable, you cannot interpret the coefficient of the variable independent of the interaction term. Note that this has nothing to do with collinearity as at most, with two binary variables, you can only include one interaction combination as the other three will be collinear. Let us look at your results and see if they are inconsistent.

                        Code:
                        reg y i.x1##i.x2 if ID<100|(ID>=11578&ID<11700)
                        
                              Source |       SS           df       MS      Number of obs   =       139
                        -------------+----------------------------------   F(3, 135)       =     22.46
                               Model |  4.24969982         3  1.41656661   Prob > F        =    0.0000
                            Residual |  8.51445871       135  .063070065   R-squared       =    0.3329
                        -------------+----------------------------------   Adj R-squared   =    0.3181
                               Total |  12.7641585       138  .092493902   Root MSE        =    .25114
                        
                        ------------------------------------------------------------------------------
                                   y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                1.x1 |   .0106515    .066165     0.16   0.872    -.1202025    .1415055
                                2.x2 |  -.3913761   .0732045    -5.35   0.000    -.5361521   -.2466002
                                     |
                               x1#x2 |
                                1 2  |   .0462199   .0943343     0.49   0.625    -.1403443    .2327841
                                     |
                               _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
                        ------------------------------------------------------------------------------
                        Here, \(\hat{y}\) = 13.3107 if x1=0 and x2=1 (intercept= average predicted y at base levels). If x1=1 and x2=1, we have to add the value of x1 to the intercept, and \(\hat{y}\) = 13.3107+ .0106515= 13.3213515. If x1=0 and x2=2, \(\hat{y}\) = 13.3107-.3913761=12.9193239. Finally, if x1=1 and x2=2, \(\hat{y}\) = 13.3107 +.0106515 -.3913761+ .0462199 =12.9761953.

                        On the other hand, with the second specification,

                        Code:
                        reg y x1 i.x2 i.x2#i.x1 if ID<100|(ID>=11578&ID<11700)
                        note: 2.x2#1.x1 omitted because of collinearity
                        
                              Source |       SS           df       MS      Number of obs   =       139
                        -------------+----------------------------------   F(3, 135)       =     22.46
                               Model |  4.24969982         3  1.41656661   Prob > F        =    0.0000
                            Residual |  8.51445871       135  .063070065   R-squared       =    0.3329
                        -------------+----------------------------------   Adj R-squared   =    0.3181
                               Total |  12.7641585       138  .092493902   Root MSE        =    .25114
                        
                        ------------------------------------------------------------------------------
                                   y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                  x1 |   .0568714   .0672395     0.85   0.399    -.0761077    .1898505
                                2.x2 |  -.3913761   .0732045    -5.35   0.000    -.5361521   -.2466002
                                     |
                               x2#x1 |
                                1 1  |  -.0462199   .0943343    -0.49   0.625    -.2327841    .1403443
                                2 1  |          0  (omitted)
                                     |
                               _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
                        ------------------------------------------------------------------------------
                        the intercept is the same as the base levels in both models are the same and therefore \(\hat{y}\) = 13.3107 if x1=0 and x2=1. If x1=1 and x2=1, \(\hat{y}\) = 13.3107+ .0568714- .0462199=13.3213515 (note as the interaction is x1=1, x2=1). If x1=0 and x2=2, \(\hat{y}\) =13.3107-.3913761=12.9193239. Finally, If x1=1 and x2=2, \(\hat{y}\) = 13.3107+.0568714-.3913761=12.9761953. So the predictions stay intact. Also note that we can change the coefficient of x2 by choosing a different interaction combination, e.g.,


                        Code:
                        . reg y 1.x1 i.x2 i.x2#i.x1
                        note: 2.x2#1.x1 omitted because of collinearity
                        
                              Source |       SS           df       MS      Number of obs   =       139
                        -------------+----------------------------------   F(3, 135)       =     22.46
                               Model |  4.24969941         3  1.41656647   Prob > F        =    0.0000
                            Residual |  8.51446284       135  .063070095   R-squared       =    0.3329
                        -------------+----------------------------------   Adj R-squared   =    0.3181
                               Total |  12.7641623       138  .092493929   Root MSE        =    .25114
                        
                        ------------------------------------------------------------------------------
                                   y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                  x1 |
                                  1  |   .0106515    .066165     0.16   0.872    -.1202025    .1415055
                                     |
                                2.x2 |  -.3451562   .0594984    -5.80   0.000    -.4628258   -.2274866
                                     |
                               x2#x1 |
                                2 0  |    -.04622   .0943343    -0.49   0.625    -.2327842    .1403442
                                2 1  |          0  (omitted)
                                     |
                               _cons |    13.3107   .0430697   309.05   0.000     13.22552    13.39588
                        ------------------------------------------------------------------------------
                        .
                        This does not change our conclusions. So in summary, do not simply look at estimates of variables in isolation if they are part of an interaction.

                        ps. You don't have to calculate the predictions manually as I do. Simply run margins i.x1#i.x2 after the regression.
                        Last edited by Andrew Musau; 24 Apr 2019, 13:45.

                        Comment


                        • #13
                          Originally posted by Andrew Musau View Post
                          So in summary, do not simply look at estimates of variables in isolation if they are part of an interaction.

                          ps. You don't have to calculate the predictions manually as I do. Simply run margins i.x1#i.x2 after the regression.
                          Thank you!!

                          Comment

                          Working...
                          X