Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regression analyses including interaction between with dummy variables

    Dear all,

    I am having some doubts on my regression model. I am exploring the effect of two dummy variables ("fk1n" 3 factors (1-2-3), and "ctq_total_80p" 2 factors (1-2)) on the continuous "CAPE_NEG" variable (outcome). I also want to explore the effect of the interaction between both variables on the outcome. In all the analyses I include Age, Sex and Sample as covariables.

    For the main effects I have had no problem:
    xi: regress CAPE_NEG i.fk1n Sex Age Sample, beta
    i.fk1n _Ifk1n_1-3 (naturally coded; _Ifk1n_1 omitted)

    Source | SS df MS Number of obs = 773
    -------------+------------------------------ F( 5, 767) = 1.67
    Model | 235.003861 5 47.0007722 Prob > F = 0.1386
    Residual | 21541.9172 767 28.0859416 R-squared = 0.0108
    -------------+------------------------------ Adj R-squared = 0.0043
    Total | 21776.9211 772 28.208447 Root MSE = 5.2996

    ------------------------------------------------------------------------------
    CAPE_NEG | Coef. Std. Err. t P>|t| Beta
    -------------+----------------------------------------------------------------
    _Ifk1n_2 | .2063414 .4034761 0.51 0.609 .0192905
    _Ifk1n_3 | .1708133 .6607577 0.26 0.796 .0097483
    Sex | -.3892154 .4649464 -0.84 0.403 -.0308724
    Age | -.1064209 .0464872 -2.29 0.022 -.0826285
    Sample | .6120988 .4159441 1.47 0.142 .0542749
    _cons | 10.80678 1.237075 8.74 0.000 .
    ------------------------------------------------------------------------------

    I used the xi command to later see the effect between fk1n 2 and 3.

    To explore the interaction effects between fk1n and ctq_total_80p on CAPE_NEG, I first run the regression including the interaction term as "i.ctq_total_80p##i.fk1n", however, with this I could not see the effect of all the interaction factors. So I calculated a new variable with the interaction:

    tab fk1n ctq_total_80p, nolabel

    FKBP5_1_rs | CTQ_TOTAL_80p
    3800373 | 1 2 | Total
    -----------+----------------------+----------
    1 | 289 68 | 357
    2 | 265 77 | 342
    3 | 62 16 | 78
    -----------+----------------------+----------
    Total | 616 161 | 777

    . tab totalxfk1n
    totalxfk1n | Freq. Percent Cum.
    ------------+-----------------------------------
    11 | 289 37.19 37.19
    12 | 265 34.11 71.30
    13 | 62 7.98 79.28
    21 | 68 8.75 88.03
    22 | 77 9.91 97.94
    23 | 16 2.06 100.00
    ------------+-----------------------------------
    Total | 777 100.00



    and then added it to the regression model. However, when I include the interaction term on the model (keeping the main effects) it omits some factors, because of collinearity, as shown below. How can I do this? I need to include also the main effects... Should I do it in a different way?

    xi: regress CAPE_NEG i.ctq_total_80p i.fk1n i.totalxfk1n Sex Age Sample, beta
    i.ctq_total_80p _Ictq_total_1-2 (naturally coded; _Ictq_total_2 omitted)
    i.fk1n _Ifk1n_1-3 (naturally coded; _Ifk1n_3 omitted)
    i.totalxfk1n _Itotalxfk1_11-23 (naturally coded; _Itotalxfk1_11 omitted)
    note: _Ifk1n_2 omitted because of collinearity
    note: _Itotalxfk1_13 omitted because of collinearity
    note: _Itotalxfk1_21 omitted because of collinearity

    Source | SS df MS Number of obs = 772
    -------------+------------------------------ F( 8, 763) = 4.98
    Model | 1078.82213 8 134.852766 Prob > F = 0.0000
    Residual | 20680.3009 763 27.1039331 R-squared = 0.0496
    -------------+------------------------------ Adj R-squared = 0.0396
    Total | 21759.1231 771 28.2219495 Root MSE = 5.2061

    --------------------------------------------------------------------------------
    CAPE_NEG | Coef. Std. Err. t P>|t| Beta
    ---------------+----------------------------------------------------------------
    _Ictq_total_1 | -3.43665 .7063009 -4.87 0.000 -.2629902
    _Ifk1n_1 | -.6494848 .7304092 -0.89 0.374 -.0609708
    _Ifk1n_2 | 0 (omitted) 0
    _Itotalxfk1_12 | -.2175991 .7355693 -0.30 0.767 -.0194072
    _Itotalxfk1_13 | 0 (omitted) 0
    _Itotalxfk1_21 | 0 (omitted) 0
    _Itotalxfk1_22 | -1.832638 1.13418 -1.62 0.107 -.1034392
    _Itotalxfk1_23 | -2.420919 1.620644 -1.49 0.136 -.0649639
    Sex | -.3537958 .4569318 -0.77 0.439 -.028069
    Age | -.1329175 .0459178 -2.89 0.004 -.103241
    Sample | .4442104 .4101547 1.08 0.279 .0393917
    _cons | 14.99289 1.567061 9.57 0.000 .
    --------------------------------------------------------------------------------



    Hope I have explained myself clearly.
    Thanks in advance.

    With kind regards,

    Marta.

  • #2
    Hello Marta,

    Welcome to the Stata Forum,

    It is difficult to read the output this way. Please provide the commands and results under CODE delimiters, as recommended in the FAQ.

    That said, you still don't need to use the "xi", even if you wish to test differences between arms.

    For the interaction terms, you may wish just to use the "i." and/or "c." with the "#" for such operation.

    Finally, I'm not sure if I got it right, but it seems you created new variables for the second model, some of them being a total, some of them being a fraction of the same variable.

    I believe this was the reason for the collinearity, and Stata dutifully omitted such variables.

    Best,

    Marcos
    Best regards,

    Marcos

    Comment


    • #3
      Thanks Marcos. Sorry, here the commands again (hope I do it right now):

      Main effects:
      Code:
      xi: regress CAPE_NEG i.fk1n Sex Age Sample, beta
      i.fk1n            _Ifk1n_1-3          (naturally coded; _Ifk1n_1 omitted)
      
            Source |       SS       df       MS              Number of obs =     773
      -------------+------------------------------           F(  5,   767) =    1.67
             Model |  235.003861     5  47.0007722           Prob > F      =  0.1386
          Residual |  21541.9172   767  28.0859416           R-squared     =  0.0108
      -------------+------------------------------           Adj R-squared =  0.0043
             Total |  21776.9211   772   28.208447           Root MSE      =  5.2996
      
      ------------------------------------------------------------------------------
          CAPE_NEG |      Coef.   Std. Err.      t    P>|t|                     Beta
      -------------+----------------------------------------------------------------
          _Ifk1n_2 |   .2063414   .4034761     0.51   0.609                 .0192905
          _Ifk1n_3 |   .1708133   .6607577     0.26   0.796                 .0097483
               Sex |  -.3892154   .4649464    -0.84   0.403                -.0308724
               Age |  -.1064209   .0464872    -2.29   0.022                -.0826285
            Sample |   .6120988   .4159441     1.47   0.142                 .0542749
             _cons |   10.80678   1.237075     8.74   0.000                        .
      ------------------------------------------------------------------------------


      New variable (interaction) calculation: it's a completely new variable based on the interaction of such variables, so subject with fk1n 1 and ctq_total 1 will be 11, subjects with fk1n 2 and ctq_total 1 will be 21...
      Code:
       tab fk1n ctq_total_80p, nolabel
      
      FKBP5_1_rs |     CTQ_TOTAL_80p
         3800373 |         1          2 |     Total
      -----------+----------------------+----------
               1 |       289         68 |       357 
               2 |       265         77 |       342 
               3 |        62         16 |        78 
      -----------+----------------------+----------
           Total |       616        161 |       777 
      
      
      . tab totalxfk1n
      
       totalxfk1n |      Freq.     Percent        Cum.
      ------------+-----------------------------------
               11 |        289       37.19       37.19
               12 |        265       34.11       71.30
               13 |         62        7.98       79.28
               21 |         68        8.75       88.03
               22 |         77        9.91       97.94
               23 |         16        2.06      100.00
      ------------+-----------------------------------
            Total |        777      100.00

      Regression with interaction:

      Code:
      xi: regress CAPE_NEG i.ctq_total_80p i.fk1n i.totalxfk1n Sex Age Sample, beta
      i.ctq_total_80p   _Ictq_total_1-2     (naturally coded; _Ictq_total_2 omitted)
      i.fk1n            _Ifk1n_1-3          (naturally coded; _Ifk1n_1 omitted)
      i.totalxfk1n      _Itotalxfk1_11-23   (naturally coded; _Itotalxfk1_11 omitted)
      note: _Itotalxfk1_12 omitted because of collinearity
      note: _Itotalxfk1_13 omitted because of collinearity
      note: _Itotalxfk1_21 omitted because of collinearity
      
            Source |       SS       df       MS              Number of obs =     772
      -------------+------------------------------           F(  8,   763) =    4.98
             Model |  1078.82213     8  134.852766           Prob > F      =  0.0000
          Residual |  20680.3009   763  27.1039331           R-squared     =  0.0496
      -------------+------------------------------           Adj R-squared =  0.0396
             Total |  21759.1231   771  28.2219495           Root MSE      =  5.2061
      
      --------------------------------------------------------------------------------
            CAPE_NEG |      Coef.   Std. Err.      t    P>|t|                     Beta
      ---------------+----------------------------------------------------------------
       _Ictq_total_1 |   -3.43665   .7063009    -4.87   0.000                -.2629902
            _Ifk1n_2 |   .4318857   .4458453     0.97   0.333                 .0403723
            _Ifk1n_3 |   .6494848   .7304092     0.89   0.374                 .0368695
      _Itotalxfk1_12 |          0  (omitted)                                         0
      _Itotalxfk1_13 |          0  (omitted)                                         0
      _Itotalxfk1_21 |          0  (omitted)                                         0
      _Itotalxfk1_22 |  -1.615039   .9751123    -1.66   0.098                -.0911573
      _Itotalxfk1_23 |  -2.420919   1.620644    -1.49   0.136                -.0649639
                 Sex |  -.3537958   .4569318    -0.77   0.439                 -.028069
                 Age |  -.1329175   .0459178    -2.89   0.004                 -.103241
              Sample |   .4442104   .4101547     1.08   0.279                 .0393917
               _cons |   14.34341   1.397474    10.26   0.000                        .
      --------------------------------------------------------------------------------

      For the interaction terms, if I use the "#", this doesn't include in the model the main effects of both variables, unless I add it manually, right?


      Thanks,


      Comment


      • #4
        Hello Marta, to include main effect plus interaction, you may wish to use "##" instead of "#". That said, you will still need to cope with collinearity.

        Generally speaking, I'd suggest to check correlations, perhaps exclude a collinear variable or use continuous variables instead of categorizing them, etc.

        Maybe you should first reflect about the rationale for the basic itself, because it provided a non-significant p-value in the omnibus F test and residuals are huge.

        Best,

        Marcos
        Best regards,

        Marcos

        Comment


        • #5
          Marta:
          as Marcos have already pointed out, your model basically fails to explain the variation in the dependent variable given the chosen set of predictors.
          Actually, despite a quite large sample and a limited number of independent variables, your R-sq is dramatically low.
          I would recommend you to skim the literature in your research field and see if better specifications are reported.
          Tha said, unless you're using a quite old Stata release (and, if it were the case, you should tell the list, as, by default, we all are assumed to use the latest Stata version - 14.2 this day), please note that -fvvarlist- has widely outperformed the -xi- prefix as far as catgorical variables and interactions are concerned.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Dear Marcos and Carlo,

            Thank you for the answers.

            I have now checked the correlations between variables and, as expected, ctq_total is highly correlated with the interaction (0.99). Given one variable is calculated from the other, wouldn't that be expected? And how can affect to my model to exclude ctq_total, for example? The main effects of this variable are considered because of the inclusion of totalxfk1n? Maybe those are very basic questions...my appologies for my ignorance.

            My regression model tries to explain the relation between a biological variable (fk1n) and an enviromental variable (ctq_total) on a psychological trait (CAPE_NEG). I have read that it is expected to have low R squared values, given human behaviors are difficult to predict. Despite this, I do not expect that my interaction predicts highly the outcome, as I belive there are much more biological variables affecting this trait. So, in this case, despite this low R-sq, I can still draw conclusions about my results, is this right? There is a similar study than main, but using continuous ctq_total in which they report also very low R-sq: "In the DS [discovery sample], the interaction accounted for 1.9% of the variance of positive PEs [positive PEs is a similar measure as CAPE_NEG] (...). Similarly, in the RS [replication sample], the interaction accounted for 6.6% of the variance (...)".



            Kind regards,


            Marta.

            Comment


            • #7
              Marta:
              there are some points of your answer I'm not clear with:
              - in regression jargon R-sq and prediction are different beasts;
              - you should have substantive (i.e., theory-based) reasons to include an interaction in your regression model;
              - if you consider that other predictors can contribute to explain the variation in the dependent variable and you do not have them in your dataset, your regression estimates might be biased due to omitted variable bias and endogeneity.

              As an aside, please consider that posting what you typed and what Stata gave you back (see FAQ #12 on how to do it properly) worths more that tons of words aimed at describing what you did and what you got after you did it. Thanks.
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                Dear Marta,

                Apart from Carlo's insightful remarks, I feel that your comments in #7 are related to the backbone theory related to regression analysis. What is more, they corroborate what has been said in #2 and #4, such as: the use of fractions and totals will prompt to collinearity; relying on categorization instead of using continuous variables may also harm the model; checking correlations would help to identify the variables mostly related with collinerarity.

                But the most important aspect in my opinion is the very fact that the basic model seems to be "problematic", so to speak, not only due to the tiny R-squared, but also as a consequence of the non-significant "omnibus" F test and, on top of that, the huge residuals.

                In short, it means that there is too much to be explained "outside" the model. What is more, "inside" the model, apparently, there is not much to explain the use of the selected predictors.

                Sorry to say that, but I sincerely fear that, on such grounds, maybe adding interaction terms to the model won't help to improve it significantly.

                Best,

                Marcos
                Best regards,

                Marcos

                Comment

                Working...
                X