Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Main Effects and Interaction Effects in Count Data Model

    Dear all,

    I am estimating a count data model for predicting traffic crashes as a function of explanatory variables. While controlling for other explanatory factors, the key focus is on investigating the effect of "variation in a co-variate" (which happens to be an interaction term) on the response outcome. I am seeking expert opinion on whether excluding "main effects" and keeping "interaction effect" in the model is reasonable? Of course, i have followed the discussion over here http://www.statalist.org/forums/forum/general-stata-discussion/general/1374798-how-does-the-interpretation-change-if-i-drop-the-linear-terms,
    but i want to clarify my concept with relevance to the data i have.

    Consider the data description:
    Code:
    avecrash                             //response outcome
    meanspeed                             //average speed - main effect
    sdspeed                                //sdspeed - main effect
    covspeed                            //Interaction term: Coefficient of variation for above two variables i.e. sdspeed/meanspeed
    The two models are:
    Code:
    nbreg avecrash meanspeed sdspeed covspeed // Model 1
    nbreg avecrash covspeed // Model 2
    Following Professor. Phil's comment in the above thread, It is seldom desirable to run interactions without including the main effects, because the second specification (Model 2 above) forces the influence of x2 (say sdspeed in my case) to be 0 when x1 (say meanspeed) equals 0 while the first specification (Model 1 above) does not put such a restriction.

    Here is my question please: In my case, the interaction term consists of two variables that are sort of derivative of a same variable "overall speed", from which mean speed and standard deviation of speed is calculated and then put into the interaction. So, based on the data i have, there is no case where meanspeed can be zero while sdspeed can be non-zero. Such a case is also not possible conceptually. In other words, if i am understanding the concept correctly, when main-effects are excluded, we are not forcing the coefficient of meanspeed to be zero when sdspeed equals zero (and vice versa), because no such case exist in the data i have. Any guidance in this regard will be highly appreciated.

    Below are the descriptive statistics for clarity.

    Click image for larger version

Name:	sdsds.JPG
Views:	1
Size:	26.0 KB
ID:	1375602


    -Behram

  • #2
    Stata does not know what your variables mean, nor what is on substantive grounds possible or impossible. So it will force that constrain, even though that combination of values does not happen in your data.

    The basic idea is easier to visualize if we look at whether we want to include the constant or not. Consider the example below. we look at the association between hourly wage and hours works. We may safely assume that if someone works 0 hours she (the example data is for women only) will get a wage of 0. So why not include that assumption in our model by excluding the constant? As you can see in the graph below, forcing the regression line through the point 0,0 significantly deteriorates the fit of the model, even though that point is not present in the data. The same thing will happen when you exclude the main effects in your model; it will have a big influence on your results, even though the point 0 is not present, and cannot happen, in your data.

    Code:
    // open some example data
    sysuse nlsw88, clear
    
    // nobody works 0 hours
    sum hours
    
    // with constant
    reg wage hours
    predict xb1
    
    // without constant
    reg wage hours, nocons
    predict xb2
    
    // which line fits better?
    scatter wage hours, msymbol(oh) mcolor(gs8) || ///
    line xb1 xb2 hours, sort                       ///
        legend(order(2 "with" "constant"           ///
                     3 "without" "constant"))      ///
        ytitle(hourly wage)
    Click image for larger version

Name:	Graph.png
Views:	1
Size:	45.5 KB
ID:	1375612

    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Dear Maarten,

      Thank you for your detailed and helpful response. I understood your valid point.

      However, if i decide to include the "conditional effects" in addition to the interaction effects, i am facing difficulty in intuitively interpreting the results.

      For example, model with only interaction effect is:
      Code:
      poisson avecrash covspeed lnadtmaj lnadtmin x4leg totnumleft rangespeed if sigornot == 1
      Click image for larger version

Name:	inteffect.JPG
Views:	1
Size:	52.0 KB
ID:	1375616

      The model with both conditional and interaction effects is:
      Code:
      poisson avecrash meanspeed sdspeed covspeed lnadtmaj lnadtmin x4leg totnumleft rangespeed if sigornot == 1
      Click image for larger version

Name:	intmaineffects.JPG
Views:	1
Size:	45.2 KB
ID:	1375617


      The interaction effects are significant in both of the above models. Now, how can i interpret the conditional effect of "sdspeed" (or the conditional effect of meanspeed) in the above model. For example, for "sdspeed" equal zero, an increase of 1-unit in meanspeed affects outcome by x units ?? Though in this case, the conditional effects are statistically significant, i just find it conceptually difficult to interpret it and intuitively relate it to the response.

      Also, including conditional effects do not affect the statistical significance of interaction effects in this un-pooled model, however it does in other un-pooled models. Should i still keep them?

      Thank you for your guidance, -Behram

      Comment


      • #4
        It may not influence the significance but it does strongly impact the coefficient, which is what you ultimately care about. So, yes include the main effects.

        You did not include an abstract from the data so I will use another the nlsw88 data instead. Wage is not a count variable, but with the vce(robust) option it is an attractive model for this type of variable ( http://blog.stata.com/2011/08/22/use...tell-a-friend/ ) Here we have a similar "impossible" zero value. In this case nobody works 0 hours. To interpret the coefficients it is very helpful to include the irr option. So in this case non-union members can expect their hourly wage to increase by a factor 1.004 or (1.004-1)*100%=0.4% for ever hour per week they work longer. Becoming a union member will increase the hourly wage by 89% if they work 0 hours. That is not very helpful, and that is the problem you are referring to.

        Code:
        . // open some example data
        . sysuse nlsw88, clear
        (NLSW, 1988 extract)
        
        .
        . // estimate the model
        . poisson wage c.hours##i.union ttl_exp grade , irr vce(robust)
        note: you are responsible for interpretation of noncount dep. variable
        
        Iteration 0:   log pseudolikelihood = -4766.5829  
        Iteration 1:   log pseudolikelihood = -4766.5828  
        
        Poisson regression                              Number of obs     =      1,875
                                                        Wald chi2(5)      =    1052.40
                                                        Prob > chi2       =     0.0000
        Log pseudolikelihood = -4766.5828               Pseudo R2         =     0.1199
        
        -------------------------------------------------------------------------------
                      |               Robust
                 wage |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
        --------------+----------------------------------------------------------------
                hours |   1.004248   .0017532     2.43   0.015     1.000818     1.00769
                      |
                union |
               union  |   1.886415   .2393017     5.00   0.000     1.471153    2.418893
                      |
        union#c.hours |
               union  |   .9866461   .0030773    -4.31   0.000     .9806331     .992696
                      |
              ttl_exp |   1.038123   .0024918    15.59   0.000      1.03325    1.043018
                grade |   1.085762   .0048586    18.39   0.000     1.076281    1.095327
                _cons |   1.254774   .1029197     2.77   0.006     1.068435    1.473613
        -------------------------------------------------------------------------------
        I typically solve that by centering my variable to have meaningful 0 values within the range of the data. In case of hours per week worked, it makes sense to choose 40, as that is the standard for full-time employment. You do that by creating a new variable which contains hours - 40. The effect of hours remains unchanged. Only the constant and the main effect of union changes. So becoming a union member increases the wage by 10% if one is full-time employed, and this effect of becoming a union member decreases by 1.3% ((0.987 - 1)*100%=-1.3%) for every hour one works longer. Notice that this change in effect is a change in percentages and not percentage points. Also see: http://maartenbuis.nl/publications/interactions.html

        Code:
        . // center variable
        . gen hours_c = hours - 40
        (4 missing values generated)
        
        . label var hours_c "usual hours worked, centered at 40"
        
        .
        . // reestimate the model
        . poisson wage c.hours_c##i.union ttl_exp grade , irr vce(robust)
        note: you are responsible for interpretation of noncount dep. variable
        
        Iteration 0:   log pseudolikelihood = -4766.5829  
        Iteration 1:   log pseudolikelihood = -4766.5828  
        
        Poisson regression                              Number of obs     =      1,875
                                                        Wald chi2(5)      =    1052.40
                                                        Prob > chi2       =     0.0000
        Log pseudolikelihood = -4766.5828               Pseudo R2         =     0.1199
        
        ---------------------------------------------------------------------------------
                        |               Robust
                   wage |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ----------------+----------------------------------------------------------------
                hours_c |   1.004248   .0017532     2.43   0.015     1.000818     1.00769
                        |
                  union |
                 union  |   1.101777    .026129     4.09   0.000     1.051738    1.154198
                        |
        union#c.hours_c |
                 union  |   .9866461   .0030773    -4.31   0.000     .9806331     .992696
                        |
                ttl_exp |   1.038123   .0024918    15.59   0.000      1.03325    1.043018
                  grade |   1.085762   .0048586    18.39   0.000     1.076281    1.095327
                  _cons |   1.486633   .0893659     6.60   0.000     1.321404    1.672523
        ---------------------------------------------------------------------------------
        ---------------------------------
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz
        Germany
        http://www.maartenbuis.nl
        ---------------------------------

        Comment


        • #5
          Hi Maarten,

          This thread is very informative! I want to run a Poisson model with an interaction term to predict count of psychiatric symptoms, but to complicate matters further, the predictors are count of other (earlier) symptoms, a binary grouping variable (sex), plus an interaction of the two (to ask if earlier psychiatric symptoms have differential predictive effects in males vs. females). Is this possible with Poisson regression? And I assume I shouldn't be specifying robust standard errors if it's true count data (and there aren't any issues with impossible zero values)?

          Many thanks,

          Virginia Carter Leno

          Comment

          Working...
          X