Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interaction terms gone when using factor variable notation

    Dear all,

    I am running a linear probability model on stata 15 using a mix of firm-level data and country-level data. For example, my dependent is a dummy (innov), varying by firm, country and year of survey. Some of my regressors are also firm-level variables, others are country-level variables (varying only by country and year of survey). Below is an excerpt of my dataset.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 code float id int year byte(innov RD) double trade float REGION byte(REGION0 REGION1) float(tourXREGION1 tourXREGION2)
    "ALB" 3 2019 0 .                 . 2 0 0 . .
    "ALB" 3 2019 0 .                 . 2 0 0 . .
    "ALB" 3 2019 0 .                 . 2 0 0 . .
    "ARG" 4 2006 1 0 40.43347987191512 4 0 0 0 0
    "ARG" 4 2006 1 1 40.43347987191512 4 0 0 0 0
    "ARG" 4 2006 1 0 40.43347987191512 4 0 0 0 0
    "ARG" 4 2006 1 0 40.43347987191512 4 0 0 0 0
    "ARG" 4 2006 1 1 40.43347987191512 4 0 0 0 0
    "ARG" 4 2006 . . 40.43347987191512 4 0 0 0 0
    end
    label values innov H1
    label def H1 1 "Yes", modify
    label values RD H8
    label def H8 1 "Yes", modify
    I have included regional dummies (REGION) and interactions with my variable of interest (tour) in the model. Below are the observations of the variable REGION.
    REGION Freq. Percent Cum.
    0 15,323 9.65 9.65
    1 40,397 25.44 35.09
    2 17,531 11.04 46.13
    3 3,782 2.38 48.52
    4 32,331 20.36 68.88
    5 4,603 2.90 71.78
    6 835 0.53 72.30
    7 43,979 27.70 100.00
    Total 158,781 100.00

    I am running the following model using factor variables and adding country dummies (i.id) and years dummies(i.year) as well:
    Code:
    reg innov i.REGION##c.tour RD size_num cert l_gdp_gr fdi_s_gdp trade i.year i.id, r
    When doing so, (almost) all interaction terms are omitted as shown below, and the reason that is given is collinearity.
    Click image for larger version

Name:	out.png
Views:	1
Size:	47.0 KB
ID:	1570937


    However, when I generate the interactions manually and run the same model as follows, the interactions terms are not omitted (please see below.)
    Code:
    reg innov REGION1 REGION2 REGION3 REGION4 REGION5 REGION6 REGION7 tour tourXREGION1 tourXREGION2 tourXREGION3 tourXREGION4 tourXREGION5 tourXREGION6 tourXREGION7 RD size_num cert l_gdp_gr fdi_s_gdp  trade  i.year i.id, r
    Click image for larger version

Name:	out2.png
Views:	1
Size:	58.7 KB
ID:	1570938


    Then, I was wondering if it might happen that regressions using factor notations do not yield results under some circumtances, and if I could trust my results obtained from generating the interactions manually.

    I thank you for your clarifications,

    Best,

    Assi

  • #2
    I do not think you explained how Region relates to Country. Are countries part of regions? Or are regions part of countries?

    And you should show the code you used to generate manually the region interactions.

    Comment


    • #3
      Countries are part of regions.
      This is the code I used to generate to interactions:
      Code:
      gen tourXREGIONi=tour*REGIONi
      , with i=1/7

      Comment


      • #4
        Assi:
        interactions created by hand are not recognized by Stata as we would expect it to, as Stata reads them as variables that show no relationship with the ones included in the interactions (you can easy test it just invoking -margins- after -regress- with interactions created by hand).
        This might be the reason for different results when you compare your two models.
        My advice is to rely (always) on -fvvarlist- notation when you create interactions and/or categorical variables.
        There is only a handful of community-contributed Stata programme that, being a bit old-fashioned. do not support -fvvarlist- notation.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Carlo:
          I thank you for your answer. I understand your point about how Stata reads interactions created by hand. However, I still expected the same results for both models. When I started using fvvarlist notation, I used to also run the "by hand" model to check that both results match and make sure I made no mistake. For example, running both reg Y i.sex##c.age and reg Y sex age sex*age. And it always worked. This is the first time I have obtained different results.

          Does your advice imply that the "by hand" model results are not reliable? If so, I really want to follow the advice, but this means that I must not use my original specification as the interactions are omitted in the regression output. In addition, I doubt the collinearity issue (the reason given by Stata for omitting the interactions) is that severe in my dataset given the large number of observations used in the estimations (74,571).

          Kind regards,
          Assi

          Comment


          • #6
            Assi:
            I've never delved into this issue, but my gut-feeling is that, in your case, the problems come from the intearction by hand of two categorical variables.
            I would not say that all the regression models that contain interactions made by hand are unreliable, but I do not find any gain in relying on -fvvarlist- and then create interactions by hand just to compare the results obtained via the two approaches (especially, as in your case, the interaction includes two categorical variables).
            Last edited by Carlo Lazzaro; 01 Sep 2020, 10:21.
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment


            • #7
              Assi:
              in the following toy-example, I've investigated the issue and found out that, when the interaction is created by hand, Stata reads the second terms included in the interaction as it was continuous instead of categorical:
              Code:
              . sysuse auto
              (1978 Automobile Data)
              
              . regress price i.foreign##i.rep78, allbase ///this is the approach I would use and recommend///
              note: 1.foreign#1b.rep78 identifies no observations in the sample
              note: 1.foreign#2.rep78 identifies no observations in the sample
              note: 1.foreign#5.rep78 omitted because of collinearity
              
                    Source |       SS           df       MS      Number of obs   =        69
              -------------+----------------------------------   F(7, 61)        =      0.39
                     Model |    24684607         7  3526372.43   Prob > F        =    0.9049
                  Residual |   552112352        61  9051022.16   R-squared       =    0.0428
              -------------+----------------------------------   Adj R-squared   =   -0.0670
                     Total |   576796959        68  8482308.22   Root MSE        =    3008.5
              
              -------------------------------------------------------------------------------
                      price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              --------------+----------------------------------------------------------------
                    foreign |
                  Domestic  |          0  (base)
                   Foreign  |   2088.167   2351.846     0.89   0.378     -2614.64    6790.974
                            |
                      rep78 |
                         1  |          0  (base)
                         2  |   1403.125   2378.422     0.59   0.557    -3352.823    6159.073
                         3  |   2042.574   2204.707     0.93   0.358    -2366.011    6451.159
                         4  |   1317.056   2351.846     0.56   0.578    -3385.751    6019.863
                         5  |       -360   3008.492    -0.12   0.905    -6375.851    5655.851
                            |
              foreign#rep78 |
                Domestic#1  |          0  (base)
                Domestic#2  |          0  (base)
                Domestic#3  |          0  (base)
                Domestic#4  |          0  (base)
                Domestic#5  |          0  (base)
                 Foreign#1  |          0  (empty)
                 Foreign#2  |          0  (empty)
                 Foreign#3  |  -3866.574   2980.505    -1.30   0.199    -9826.462    2093.314
                 Foreign#4  |  -1708.278   2746.365    -0.62   0.536    -7199.973    3783.418
                 Foreign#5  |          0  (omitted)
                            |
                      _cons |     4564.5   2127.325     2.15   0.036      310.651    8818.349
              -------------------------------------------------------------------------------
              
              . regress price i.foreign##c.rep78, allbase ///let's pretend that still using -fvvarlist- -rep78- is continuous///
              
                    Source |       SS           df       MS      Number of obs   =        69
              -------------+----------------------------------   F(3, 65)        =      0.13
                     Model |  3541309.46         3  1180436.49   Prob > F        =    0.9395
                  Residual |   573255649        65  8819317.68   R-squared       =    0.0061
              -------------+----------------------------------   Adj R-squared   =   -0.0397
                     Total |   576796959        68  8482308.22   Root MSE        =    2969.7
              
              ---------------------------------------------------------------------------------
                        price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              ----------------+----------------------------------------------------------------
                      foreign |
                    Domestic  |          0  (base)
                     Foreign  |  -2717.657   4335.068    -0.63   0.533    -11375.39    5940.072
                              |
                        rep78 |  -73.56917   517.1275    -0.14   0.887    -1106.344    959.2058
                              |
              foreign#c.rep78 |
                    Domestic  |          0  (base)
                     Foreign  |   630.3747   1060.592     0.59   0.554    -1487.773    2748.522
                              |
                        _cons |    6401.49   1619.897     3.95   0.000     3166.332    9636.649
              ---------------------------------------------------------------------------------
              
              . g foreign_rep78_interacted=foreign * rep78 if foreign==0
              (26 missing values generated)
              
              . replace foreign_rep78_interacted=foreign * rep78 if foreign==1 & foreign_rep78_interacted==.
              (21 real changes made)
              
              . regress price foreign rep78 foreign_rep78_interacted , allbase ///let's create interaction by hand and see that the results are the same as in the previous (wrong) specification///
              
                    Source |       SS           df       MS      Number of obs   =        69
              -------------+----------------------------------   F(3, 65)        =      0.13
                     Model |  3541309.46         3  1180436.49   Prob > F        =    0.9395
                  Residual |   573255649        65  8819317.68   R-squared       =    0.0061
              -------------+----------------------------------   Adj R-squared   =   -0.0397
                     Total |   576796959        68  8482308.22   Root MSE        =    2969.7
              
              ------------------------------------------------------------------------------------------
                                 price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------------------+----------------------------------------------------------------
                               foreign |  -2717.657   4335.068    -0.63   0.533    -11375.39    5940.072
                                 rep78 |  -73.56917   517.1275    -0.14   0.887    -1106.344    959.2058
              foreign_rep78_interacted |   630.3747   1060.592     0.59   0.554    -1487.773    2748.522
                                 _cons |    6401.49   1619.897     3.95   0.000     3166.332    9636.649
              ------------------------------------------------------------------------------------------
              
              .
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Dear Carlo,
                Thank you for investigating the issue more deeply. I might not have been clear enough about my interaction variables, but the var tour is continous. So it's about an interaction of a categorical variable and a continuous variable, not between two categorical variables. So, as your example showed, I should have obtained the same results for both models (the by hand model and the fvvarlist model).
                Kind regards,
                Assi

                Comment


                • #9
                  Proabably the best way to troubleshoot is to begin with a minimal model without the robust option and only with
                  reg innov i.REGION##c.tour Then to add a variable at a time until the problem arises.

                  Comment


                  • #10
                    Dear Eric,
                    Thanks for joining in. I had already tried this incremental procedure and it showed that the problem arises when I include country dummies (i.id). My concern is that whatever variables I include, I get coefficients for the interacted variables when I generate interactions by hand. The problem occurs when I use fvvarlist notation for interactions. I was then wondering if it could happen that the Stata underlying computation procedure using fvvarlist notation fail to yield results in some cases and if I could rely on the results obtained from the interactions by hand model.

                    Kind regards,
                    Assi

                    Comment


                    • #11
                      Assi:
                      it may well be that the issue arises when the interaction includes a continuous variable and a categorical variable with 3 or more levels (as in your case and in the following toy-example):
                      Code:
                      . use "C:\Program Files\Stata16\ado\base\a\auto.dta"
                      (1978 Automobile Data)
                      
                      . regress price c.mpg##i.rep78
                      
                            Source |       SS           df       MS      Number of obs   =        69
                      -------------+----------------------------------   F(9, 59)        =      3.65
                             Model |   206362465         9  22929162.7   Prob > F        =    0.0011
                          Residual |   370434494        59  6278550.75   R-squared       =    0.3578
                      -------------+----------------------------------   Adj R-squared   =    0.2598
                             Total |   576796959        68  8482308.22   Root MSE        =    2505.7
                      
                      ------------------------------------------------------------------------------
                             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                               mpg |  -123.1667      590.6    -0.21   0.836    -1304.955    1058.621
                                   |
                             rep78 |
                                2  |   10881.12   13452.68     0.81   0.422    -16037.64    37799.87
                                3  |   8973.281   12725.58     0.71   0.483    -16490.55    34437.11
                                4  |   651.7399    12823.1     0.05   0.960    -25007.23    26310.71
                                5  |   4363.191   12794.52     0.34   0.734    -21238.58    29964.96
                                   |
                       rep78#c.mpg |
                                2  |  -507.6563   642.1123    -0.79   0.432     -1792.52    777.2075
                                3  |  -375.7209   601.1921    -0.62   0.534    -1578.704    827.2618
                                4  |   43.26329   603.3025     0.07   0.943    -1163.942    1250.469
                                5  |  -81.52802     597.53    -0.14   0.892    -1277.183    1114.127
                                   |
                             _cons |       7151   12528.52     0.57   0.570    -17918.51    32220.51
                      ------------------------------------------------------------------------------
                      
                      . g rep78_mpg= rep78*mpg
                      (5 missing values generated)
                      
                      . regress price rep78 mpg rep78_mpg
                      
                            Source |       SS           df       MS      Number of obs   =        69
                      -------------+----------------------------------   F(3, 65)        =      9.59
                             Model |   177021707         3  59007235.6   Prob > F        =    0.0000
                          Residual |   399775252        65   6150388.5   R-squared       =    0.3069
                      -------------+----------------------------------   Adj R-squared   =    0.2749
                             Total |   576796959        68  8482308.22   Root MSE        =      2480
                      
                      ------------------------------------------------------------------------------
                             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      -------------+----------------------------------------------------------------
                             rep78 |  -2015.191   1217.096    -1.66   0.103    -4445.899     415.517
                               mpg |  -779.9662   228.8818    -3.41   0.001    -1237.075   -322.8578
                         rep78_mpg |   124.4424   54.32953     2.29   0.025     15.93881     232.946
                             _cons |   20305.01   4828.185     4.21   0.000     10662.46    29947.56
                      ------------------------------------------------------------------------------
                      
                      . regress price c.mpg##c.rep78
                      
                            Source |       SS           df       MS      Number of obs   =        69
                      -------------+----------------------------------   F(3, 65)        =      9.59
                             Model |   177021707         3  59007235.6   Prob > F        =    0.0000
                          Residual |   399775252        65   6150388.5   R-squared       =    0.3069
                      -------------+----------------------------------   Adj R-squared   =    0.2749
                             Total |   576796959        68  8482308.22   Root MSE        =      2480
                      
                      -------------------------------------------------------------------------------
                              price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                      --------------+----------------------------------------------------------------
                                mpg |  -779.9662   228.8818    -3.41   0.001    -1237.075   -322.8578
                              rep78 |  -2015.191   1217.096    -1.66   0.103    -4445.899     415.517
                                    |
                      c.mpg#c.rep78 |   124.4424   54.32953     2.29   0.025     15.93881     232.946
                                    |
                              _cons |   20305.01   4828.185     4.21   0.000     10662.46    29947.56
                      -------------------------------------------------------------------------------
                      
                      .
                      Kind regards,
                      Carlo
                      (Stata 19.0)

                      Comment


                      • #12
                        Carlo,
                        I don't think the issue arises when the interaction includes a continuous variable and a categorical variable with 3 or more levels. When I rerun the first regression of your example after generating the interactions by hand, I find exactly the same results you got using fvvarlist notation.
                        Code:
                        . sysuse auto, clear
                        (1978 Automobile Data)
                        
                        . forval i=2/5 {
                          2.
                        . g rep78_mpg`i'=mpg if rep78==`i'
                          3.
                        . replace rep78_mpg`i'=0 if rep78!=`i' & rep78!=.
                          4.
                        . }
                        (66 missing values generated)
                        (61 real changes made)
                        (44 missing values generated)
                        (39 real changes made)
                        (56 missing values generated)
                        (51 real changes made)
                        (63 missing values generated)
                        (58 real changes made)
                        
                        . forval i=2/5 {
                          2.
                        . g rep78_c`i'=1 if rep78==`i'
                          3.
                        . replace rep78_c`i'=0 if rep78!=`i' & rep78!=.
                          4.
                        . }
                        (66 missing values generated)
                        (61 real changes made)
                        (44 missing values generated)
                        (39 real changes made)
                        (56 missing values generated)
                        (51 real changes made)
                        (63 missing values generated)
                        (58 real changes made)
                        
                        . regress price mpg rep78_c* rep78_mpg*
                        
                              Source |       SS           df       MS      Number of obs   =        69
                        -------------+----------------------------------   F(9, 59)        =      3.65
                               Model |   206362465         9  22929162.7   Prob > F        =    0.0011
                            Residual |   370434494        59  6278550.75   R-squared       =    0.3578
                        -------------+----------------------------------   Adj R-squared   =    0.2598
                               Total |   576796959        68  8482308.22   Root MSE        =    2505.7
                        
                        ------------------------------------------------------------------------------
                               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                        -------------+----------------------------------------------------------------
                                 mpg |  -123.1667      590.6    -0.21   0.836    -1304.955    1058.621
                            rep78_c2 |   10881.12   13452.68     0.81   0.422    -16037.64    37799.87
                            rep78_c3 |   8973.281   12725.58     0.71   0.483    -16490.55    34437.11
                            rep78_c4 |   651.7399    12823.1     0.05   0.960    -25007.23    26310.71
                            rep78_c5 |   4363.191   12794.52     0.34   0.734    -21238.58    29964.96
                          rep78_mpg2 |  -507.6563   642.1123    -0.79   0.432     -1792.52    777.2075
                          rep78_mpg3 |  -375.7209   601.1921    -0.62   0.534    -1578.704    827.2618
                          rep78_mpg4 |   43.26329   603.3025     0.07   0.943    -1163.942    1250.469
                          rep78_mpg5 |  -81.52802     597.53    -0.14   0.892    -1277.183    1114.127
                               _cons |       7151   12528.52     0.57   0.570    -17918.51    32220.51
                        ------------------------------------------------------------------------------
                        
                        .

                        Comment


                        • #13
                          Carlo:
                          I have got the solution from the technical support. My results obtained from the by hand model are correct. Another way to obtain the same results using the factor variables operators is to omit the same covariates that -regress- drops because of collinearity when I include the interactions generated by hand using the 'o.' factor variable.

                          For example: regress innov i.REGION##c.tour i.year o(24 86 101 105 107 111 118 119 132 136 137 141 143 144).id

                          Massive thanks for bearing with me in finding a solution.

                          Best,
                          Assi

                          Comment


                          • #14
                            Assi:
                            very interesting thread indeed.
                            Thanks for opening it and sharing the fix provided by Stata Technical Support.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Thanks for reporting back. You did not tell us that other variables were also dropped because of collinearity. This is why reporting the full output is so necessary.

                              Comment

                              Working...
                              X