No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpreting a difference-in-differences model

    Hi everyone,

    I have questions about how to interpret the results of a difference-in-differences model. Specifically, I’m having difficulty understanding what is being captured by the time fixed effect, the group fixed effect, and the DD estimator. I’m also wondering about the implications of including or excluding the year dummies (i.e., time fixed effect).

    I have scanned quite a lot of lecture notes and posts on online forums, and I’m running across sometimes conflicting instructions on how to specify the model.


    I have data on the prices of roughly 150 products between 2011 and 2016; there is one observation per product per year (approx. 900 observations).

    Without going into much detail, a new pricing policy was introduced in 2016 which was meant to put pressure on companies to drop their prices; this particular policy only applied to 33 of the products (~10% of total sample). Thus, I have 5 pre-treatment observations and 1 post-treatment observation for each product.

    (I understand that the lack of post-intervention data is a limitation: I’m only able to estimate the impact of the policy change on the shift in the mean price levels between the treatment and control groups, not on any change in the price trend.)

    I ran the following model

    xtset product_id year
    xtreg price_ln i.treatment i.year i.generic, vce(cluster product_id) fe
    where “price_ln” is the natural logarithm of the inflation-adjusted prices (outcome variable), “treatment” is an indicator variable for those products subjected to the new policy (0 in control, 1 in treatment), “post” is an indicator variable for the post-treatment period (0 in pre-intervention years, 1 in post-intervention year), and “year” is a dummy variable for each of the years (i.e., time fixed effect); the fe is the product fixed effect for each of the “groups” (i.e., medicines) in the analysis.

    I also added the covariate “generic” to control for compositional changes in the two groups. Because medicines can go off-patent at various times, you can end up with an unbalanced sample if you do not control for the off-patent status of a drug (ie, whether it is still patent-protected or available in generic form). I think it would be important to control for any relevant characteristic that can change over time at different rates between treatment/control groups and is likely to impact prices. So, in my model I controlled for the generic status of medicines (variable “generic”).

    As a precaution, I clustered the standard errors to account for serial autocorrelation. Positive serial correlation is a potential issue with repeated observations of drug prices: if the price of a product is well above (or below) average in one year then it’s plausible that the price of that product will again be higher (or lower) than average in the next year.

    Finally, because I needed to control for the generic status of a drug, I was not able to plot the mean (or median) prices over time to visually inspect the parallel trends assumption. Instead, I ran the following model to test the assumption

    xtreg price_ln i.treatment##ib2015.year i.generic, vce(cluster product_id) fe
    in which I interacted the treatment indicator with the dummy years (with the last pre-intervention year as the base). As expected, all of the coefficients for the years preceding the intervention were not significant (so estimated coefficients were 0), and it was highly significant in 2016.

    • 1) Is it okay that the diff-in-diff model drops both the 2016 year dummy and the “treatment” variable, along with one omitted base year?

    It drops the treatment variable because it remains constant throughout the six years for each medicine, so I think it gets absorbed into the fixed effect by product_id. And the 2016 dummy is perfectly collinear with the “post” indicator.

    I am still able to interpret the "treatment*post" coefficient, but I am worried about one of the main effects ("treatment") dropping from the model.
    • 2) What is captured by the different fixed effects and variables?

    I’m having trouble separating what is being captured by the “generic”, “year”, and “treatment*post” coefficients. Here is my current understanding:

    I think the diff-in-diff estimator ("treatment*post") shows the size of the drop in price due to the policy (if all assumptions of the model hold). For example, a DD ("treatment*post") coefficient of -0.232 would indicate that the policy was associated with an estimated 21% (1-exp(-0.232)) reduction in price, controlling for the other variables in the model (time/fixed effect, generic status).

    The "generic" variable captures the effect of a drug going off patent during the study period; if a product was available as a generic from the start of the study, or remained patent-protected the whole time, then the variable will remain constant (0 or 1) throughout the period would be absorbed by the product fixed effect. The product fixed effects (, fe), in turn, controls for any individual characteristics of units—in both treatment and control arms—that affect price levels but do not vary over time. This includes characteristics such as the therapeutic value of a medicine, strength, etc. I think this is important to get more precise diff-in-diff estimates. Is the product fixed effect accounted for by the intercept?

    I believe the year dummies control for any shocks that happen in a given year that affect prices in both the control and treatment arms? But since this gets dropped in the intervention year, I don’t see how it helps in the estimation of the DD coefficient? Is it just to make the estimate more precise (lower the standard error), although I don't have it fully clear in my head why this is the case? I guess I just don't have the statistical concepts fully clear in my head here -- it would really helpful if someone could explain this, perhaps with a hypothetical example.

    Indeed, the following three models all give me the same DD coefficient, but the standard error increases with each model. (Note that I dropped the "generic" variable because I was experimenting with the fixed effects just to try to unpack the model.) All models are significant at the 1% level. The clustering of the standard errors of course result in a higher standard error, but I am a bit confused about what the year fixed effect and the product fixed effect (fe) are doing here.

    xtreg price_ln
    xtreg price_ln i.year
    xtreg price_ln i.year, vce(cluster id_new) fe
    Note that if I run the four following models with the "generic" variable, the DD coefficient varies a bit (range from -412 to -.447) depending on whether or not I include the "year" dummies" and fixed effects. All models are significant at the 1% level.

    xtreg price_ln i.generic, vce(cluster product_id) fe
    xtreg price_ln i.generic i.year, vce(cluster product_id)
    xtreg price_ln i.generic i.year, vce(cluster product_id) fe
    Finally, how does the DD estimator differ from running the following model and interpreting the “treatmen*year” with 2016 dummy? This is essentially what I did to check the common trends assumption.

    xtreg price_ln i.treatment##ib2015.year i.generic, vce(cluster product_id) fe
    • 3) How should I present the log-transformed coefficients?

    I just want to confirm that my interpretation of coefficient is correct: I should exponentiate the coefficient (since the outcome variable is expressed as natural logarithm) and then subtract from one (since all the coefficients were negative) to calculate the associated % change in price. (See example above under question 1.)
    Last edited by Jonathan Vasilieri; 09 Nov 2018, 09:31.

  • #2
    I have included sample output --- perhaps it's easier to describe the different effects using actual figures.

    . eststo: xtreg price_ln i.generic i.year, vce(cluster product_id) fe
    note: 1.treatment omitted because of collinearity
    note: 2016.year omitted because of collinearity
    Fixed-effects (within) regression               Number of obs     =        912
    Group variable: id_new                          Number of groups  =        152
    R-sq:                                           Obs per group:
         within  = 0.4822                                         min =          6
         between = 0.2434                                         avg =        6.0
         overall = 0.2213                                         max =          6
                                                    F(7,134)          =      24.53
    corr(u_i, Xb)  = 0.2038                         Prob > F          =     0.0000
                                   (Std. Err. adjusted for 152 clusters in product_id)
                 |               Robust
        price_ln |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
          1.treatment |          0  (omitted)
 |  -.5281149   .0619012    -8.53   0.000    -.6505447   -.4056852
       treatment#post |
            1 1  |  -.4514453   .1516088    -2.98   0.003     -.751301   -.1515896
       1.generic |   .9848219   .1694969     5.81   0.000     .6495866    1.320057
            year |
           2012  |  -.0592672   .0496664    -1.19   0.235    -.1574986    .0389642
           2013  |  -.2449903   .0468741    -5.23   0.000    -.3376992   -.1522814
           2014  |  -.3440336   .0474112    -7.26   0.000    -.4378046   -.2502625
           2015  |  -.4732512   .0535405    -8.84   0.000    -.5791449   -.3673574
           2016  |          0  (omitted)
           _cons |   3.617452   .0548612    65.94   0.000     3.508946    3.725958
         sigma_u |  1.4486758
         sigma_e |  .43274924
             rho |   .9180764   (fraction of variance due to u_i)


    • #3
      Hello, sorry to bump this thread. Just wondering if anyone had any thoughts? I'm especially confused about difference between running these three models

      * model 1
      xtreg price_ln i.treatment i.year i.generic, vce(cluster product_id) fe
      * model 2
      xtreg price_ln i.treatment i.generic, vce(cluster product_id) fe
      * model 3
      xtreg price_ln i.treatment##ib2015.year i.generic, vce(cluster product_id) fe
      Thank you very much.


      • #4
        I don't recall seeing this thread before--it is a type I usually respond to, and I was active on the list on that date. Somehow I missed it.

        Well, you've asked a lot of questions, and I'm not sure where to begin. Let me answer the questions you raise in #3. Perhaps you have already resolved those in #1 to your satisfaction.

        Both models 1 and 2 are redundantly specified. The separate mention of i.treatment as the first predictor is unnecessary because the ## operator also generates it. (## is not the same as #. Read -help fvvarlist-.) That also explains why i.treatment gets dropped in the output from these two models: it is not just colinear with the fixed effect, it is also implicitly mentioned twice in the command, so even without fixed-effects, one of the two mentions would have to disappear. Now, even if it were not redundant, it would still be dropped in a fixed-effects model due to colinearity with the fixed effects. And that is not a problem because the information carried by the i.treatment variable is also carried in the fixed effects, so nothing is lost.

        The difference betwen #1 and #2 is that you have added year effects in model 1 that do not appear in model 2. This enables you to re-allocate any effects that are constant across all products in any given year to the i.year variables and remove them from the error term. This generally results in sharper estimates of the effects of interest.

        There are two differences between model 1 and model 3, one of which is illusory, and the other is real. The omission of the i.treatment term has no actual effect, because the interaction term i.treatment##ib2015.year causes Stata to create the i.treatment term anyway. (In model 1, as already noted, it was redundant.) The real difference between models 1 and 3 is the way in which time is modeled. In model 1, prices bounce up and down from one year to the next, and this is factored out of the model, but the difference in price changes between products affected and unaffected by the policy change is assumed to be the same in all years of the pre-policy period, and again the same in all post-policy years (though presumably different pre-policy from post-policy). In model 3, the difference (actually, ratio) in price between policy-affected and policy-unaffected products is modeled as being different in every single year.

        I hope this helps.