No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Time-Series Analysis Interpretation with a Binary-by-Continuous Interaction

    Hi All,

    I am trying to determine if a two-stage intervention for a medical centre affected the number of appointments it was able to book. I am interested in determining specifically if the regression line predicting the number of appointments changes as a function of what stage of the intervention the clinic was in. I had read through similar threads in Statalist (e.g., and I *think* I understand the logic of using using interaction terms, but would like to make sure that I'm not misunderstanding my model (or made some either grevious error).

    The variables in my model are:

    avg_apps - the average number of new appointments per month
    month - starts in January of 2016 and goes to April of 2018 (28 entries)
    int3 - the stage of intervention (0 = pre-intervention (Jan 2016/Jan 2017; 1 = stage 1 of intervention (Feb 2017/July 2017; 2 = stage 2 of intervention (Aug 2017/Apr 2018)

    I also include two lagged variables for avg_apps (lagged for 1 month and two months respectively) because they have large zero-order correlations with avg_apps.

    regress avg_apps c.month##i.int3 L1.avg_apps L2.avg_apps
          Source |       SS           df       MS      Number of obs   =        26
    -------------+----------------------------------   F(7, 18)        =      8.99
           Model |  23.0524598         7  3.29320854   Prob > F        =    0.0001
        Residual |   6.5918093        18  .366211628   R-squared       =    0.7776
    -------------+----------------------------------   Adj R-squared   =    0.6912
           Total |  29.6442691        25  1.18577076   Root MSE        =    .60515
               avg_apps |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                  month |    -.04377   .0579524    -0.76   0.460    -.1655235    .0779836
                   int3 |
    Feb 2017/July 2017  |   3.109226   2.460196     1.26   0.222    -2.059455    8.277907
     Aug 2017/May 2018  |   6.354542   2.091319     3.04   0.007     1.960845    10.74824
           int3#c.month |
    Feb 2017/July 2017  |  -.1820653    .156301    -1.16   0.259    -.5104416     .146311
     Aug 2017/May 2018  |  -.2822999   .1062098    -2.66   0.016    -.5054384   -.0591614
               avg_apps |
                    L1. |    .400261   .2264945     1.77   0.094    -.0755862    .8761081
                    L2. |  -.1321656   .2767065    -0.48   0.639    -.7135045    .4491732
                  _cons |   3.914906   1.795642     2.18   0.043     .1424022     7.68741
    I used the interaction term to investigate whether the predicted number of appointments per month, changed as a function for which stage of the intervention the clinic was in.

    From this model I am concluding that:

    1. There were no differences in the predicted DV for pre-intervention and Stage 1 of the intervention (B = -.18, p = .259).
    2. The was a difference in the predicted DV for the pre-intervention and Stage 2 of the intervention (B = -.28, p = .016).
    3. Once in Stage 2 of the intervention, moving from one month to an adjacent month, is associated with a decline of .32 appointments (-.04 + -.28 = -.32).

    Is this a reasonable interpretation of the findings?

    Thanks everyone!


  • #2
    By including the lag terms in the model, you are folding the lagged error into the estimation of the current error, so your observations are no longer independent. Consequently, the standard errors (and confidence intervals and p-values) you calculate are not valid. There are ways of handling this kind of situation, but as someone who is really quite unfamiliar with them, I can't advise you how to fix this problem.

    Even apart from the standard error problem, adding in those lags muddies the waters for interpretation. Your coefficients no longer represent estimated differences in the outcome, they are now estimated changes with an adjustment for the two preceding months' observations. I understand that there is a large serial correlation in this data, but it may be that when you just regress on c.month alone, the correlation of the residuals is low enough to be unproblematic. That is, perhaps the correlation is simply an artifact of the time trend alone. Certainly if that is the case, the inclusion of the lags is just introducing complications and not helpful in any way. I would look into that.

    If after partialing out month you still have a lot of serial correlation, then I suppose you will have to look into one of those special time-series analysis commands that deals with this kind of situation.

    Leaving that issue aside, your interpretation of the coefficients is generally right, although you are inappropriately reifying statistical significance (a common error since most of us were taught to do that). But since the American Statistical Association has finally caught up with the misinterpretation of p-values we should get with the program and start using them appropriately and interpreting them correctly. In particular, a non-statistically significant result does not mean there is no effect. It is better to state the estimated effect and its confidence interval, and comment that the data and analysis did not permit a sufficiently precise determination of the effect to exclude the possibility that the effect is even in the opposite direction.

    The next thing is that you have modeled this in such a way as to allow for both a jump in the level of appointments at each phase of the intervention and a change in the rate at which the number of appointments per month is trending over time. You can't really look at just one effect and not the others. It could really be quite complicated. I suggest that you run

    local first = tm(2016m1)
    local last = tm(2018m4)
    margins int3, at(month = (`first' (1) `last'))
    so that you can literally see what your model is telling you here.

    By the way, you say you have 28 months of data but there are only 26 observations in your regression output. Why is that? Some missing values in some variables perhaps?

    It may be that you did not really intend to include the possibility of jumps at the onset of each phase of the intervention but just expected that the time trend would bend. If you don't want jumps, then the way to go here is with -mkspline- to set up new variables to replace month, and interact int3 with those in your regression.


    • #3
      Hi Clyde,

      Thank-you for your detailed reply!

      RE: Independence: Would using <vce(robust)> fix the issue with loss of independence for error? I know that Stata automatically incorporates robust errors with pweights b/c of the presumed effect of clustering on independence.

      RE: Interpretation with lagged variables and serial correlation: Autocorrelation is significant Would using <vce(robust)> address the issue with autocorrelation as well? You seemed to indicate that it's detectable through correlation of residuals, which is a specific form of heteroskedasticity.

      RE:Reification: Ha - yes, I was reifying, but it was meant as a shorthand not as a declarative statement surrounding the absence of an effect.

      RE: Looking at jumps: Yes, I would be including differences in intercepts as well as slopes. This was less intuitive from the data b/c the the intercept differences were being compared at month = 0; I'll be using <margins> to investigate differences at different points. My major concern was that using lagged data changed how one would typically interpret coefficients, which I am relieved to see isn't the case).

      RE: 28 vs. 26 observations: I'm guessing it has to do with how <regress> handles missing data. B/c I used L. prefixes the first two observations were eliminated (b/c there wasn't sufficient data before to partial the lagged effects).




      • #4
        I'm not sure, but I don't think the use of robust standard errors is enough here. Robust standard errors are robust to heteroskedasticity, but the problem we are dealing with here is autocorrelation. There are cluster-robust standard errors that are also robust to within-cluster correlation of errors, but in this case the correlation of errors is not within groups but is across all observations. I don't think that robust standard errors correct for that.

        You're right about the 28 vs 26 observations: the lagged predictors will be missing in the first two observations, and so they will be excluded from the estimation sample. Sorry, I should have seen that from the start.