Hi everyone,
I have questions about how to interpret the results of a difference-in-differences model. Specifically, I’m having difficulty understanding what is being captured by the time fixed effect, the group fixed effect, and the DD estimator. I’m also wondering about the implications of including or excluding the year dummies (i.e., time fixed effect).
I have scanned quite a lot of lecture notes and posts on online forums, and I’m running across sometimes conflicting instructions on how to specify the model.
Background
I have data on the prices of roughly 150 products between 2011 and 2016; there is one observation per product per year (approx. 900 observations).
Without going into much detail, a new pricing policy was introduced in 2016 which was meant to put pressure on companies to drop their prices; this particular policy only applied to 33 of the products (~10% of total sample). Thus, I have 5 pre-treatment observations and 1 post-treatment observation for each product.
(I understand that the lack of post-intervention data is a limitation: I’m only able to estimate the impact of the policy change on the shift in the mean price levels between the treatment and control groups, not on any change in the price trend.)
I ran the following model
where “price_ln” is the natural logarithm of the inflation-adjusted prices (outcome variable), “treatment” is an indicator variable for those products subjected to the new policy (0 in control, 1 in treatment), “post” is an indicator variable for the post-treatment period (0 in pre-intervention years, 1 in post-intervention year), and “year” is a dummy variable for each of the years (i.e., time fixed effect); the fe is the product fixed effect for each of the “groups” (i.e., medicines) in the analysis.
I also added the covariate “generic” to control for compositional changes in the two groups. Because medicines can go off-patent at various times, you can end up with an unbalanced sample if you do not control for the off-patent status of a drug (ie, whether it is still patent-protected or available in generic form). I think it would be important to control for any relevant characteristic that can change over time at different rates between treatment/control groups and is likely to impact prices. So, in my model I controlled for the generic status of medicines (variable “generic”).
As a precaution, I clustered the standard errors to account for serial autocorrelation. Positive serial correlation is a potential issue with repeated observations of drug prices: if the price of a product is well above (or below) average in one year then it’s plausible that the price of that product will again be higher (or lower) than average in the next year.
Finally, because I needed to control for the generic status of a drug, I was not able to plot the mean (or median) prices over time to visually inspect the parallel trends assumption. Instead, I ran the following model to test the assumption
in which I interacted the treatment indicator with the dummy years (with the last pre-intervention year as the base). As expected, all of the coefficients for the years preceding the intervention were not significant (so estimated coefficients were 0), and it was highly significant in 2016.
Questions
It drops the treatment variable because it remains constant throughout the six years for each medicine, so I think it gets absorbed into the fixed effect by product_id. And the 2016 dummy is perfectly collinear with the “post” indicator.
I am still able to interpret the "treatment*post" coefficient, but I am worried about one of the main effects ("treatment") dropping from the model.
I’m having trouble separating what is being captured by the “generic”, “year”, and “treatment*post” coefficients. Here is my current understanding:
I think the diff-in-diff estimator ("treatment*post") shows the size of the drop in price due to the policy (if all assumptions of the model hold). For example, a DD ("treatment*post") coefficient of -0.232 would indicate that the policy was associated with an estimated 21% (1-exp(-0.232)) reduction in price, controlling for the other variables in the model (time/fixed effect, generic status).
The "generic" variable captures the effect of a drug going off patent during the study period; if a product was available as a generic from the start of the study, or remained patent-protected the whole time, then the variable will remain constant (0 or 1) throughout the period would be absorbed by the product fixed effect. The product fixed effects (, fe), in turn, controls for any individual characteristics of units—in both treatment and control arms—that affect price levels but do not vary over time. This includes characteristics such as the therapeutic value of a medicine, strength, etc. I think this is important to get more precise diff-in-diff estimates. Is the product fixed effect accounted for by the intercept?
I believe the year dummies control for any shocks that happen in a given year that affect prices in both the control and treatment arms? But since this gets dropped in the intervention year, I don’t see how it helps in the estimation of the DD coefficient? Is it just to make the estimate more precise (lower the standard error), although I don't have it fully clear in my head why this is the case? I guess I just don't have the statistical concepts fully clear in my head here -- it would really helpful if someone could explain this, perhaps with a hypothetical example.
Indeed, the following three models all give me the same DD coefficient, but the standard error increases with each model. (Note that I dropped the "generic" variable because I was experimenting with the fixed effects just to try to unpack the model.) All models are significant at the 1% level. The clustering of the standard errors of course result in a higher standard error, but I am a bit confused about what the year fixed effect and the product fixed effect (fe) are doing here.
Note that if I run the four following models with the "generic" variable, the DD coefficient varies a bit (range from -412 to -.447) depending on whether or not I include the "year" dummies" and fixed effects. All models are significant at the 1% level.
Finally, how does the DD estimator differ from running the following model and interpreting the “treatmen*year” with 2016 dummy? This is essentially what I did to check the common trends assumption.
I just want to confirm that my interpretation of coefficient is correct: I should exponentiate the coefficient (since the outcome variable is expressed as natural logarithm) and then subtract from one (since all the coefficients were negative) to calculate the associated % change in price. (See example above under question 1.)
I have questions about how to interpret the results of a difference-in-differences model. Specifically, I’m having difficulty understanding what is being captured by the time fixed effect, the group fixed effect, and the DD estimator. I’m also wondering about the implications of including or excluding the year dummies (i.e., time fixed effect).
I have scanned quite a lot of lecture notes and posts on online forums, and I’m running across sometimes conflicting instructions on how to specify the model.
Background
I have data on the prices of roughly 150 products between 2011 and 2016; there is one observation per product per year (approx. 900 observations).
Without going into much detail, a new pricing policy was introduced in 2016 which was meant to put pressure on companies to drop their prices; this particular policy only applied to 33 of the products (~10% of total sample). Thus, I have 5 pre-treatment observations and 1 post-treatment observation for each product.
(I understand that the lack of post-intervention data is a limitation: I’m only able to estimate the impact of the policy change on the shift in the mean price levels between the treatment and control groups, not on any change in the price trend.)
I ran the following model
Code:
xtset product_id year xtreg price_ln i.treatment i.treatment##i.post i.year i.generic, vce(cluster product_id) fe
I also added the covariate “generic” to control for compositional changes in the two groups. Because medicines can go off-patent at various times, you can end up with an unbalanced sample if you do not control for the off-patent status of a drug (ie, whether it is still patent-protected or available in generic form). I think it would be important to control for any relevant characteristic that can change over time at different rates between treatment/control groups and is likely to impact prices. So, in my model I controlled for the generic status of medicines (variable “generic”).
As a precaution, I clustered the standard errors to account for serial autocorrelation. Positive serial correlation is a potential issue with repeated observations of drug prices: if the price of a product is well above (or below) average in one year then it’s plausible that the price of that product will again be higher (or lower) than average in the next year.
Finally, because I needed to control for the generic status of a drug, I was not able to plot the mean (or median) prices over time to visually inspect the parallel trends assumption. Instead, I ran the following model to test the assumption
Code:
xtreg price_ln i.treatment##ib2015.year i.generic, vce(cluster product_id) fe
Questions
- 1) Is it okay that the diff-in-diff model drops both the 2016 year dummy and the “treatment” variable, along with one omitted base year?
It drops the treatment variable because it remains constant throughout the six years for each medicine, so I think it gets absorbed into the fixed effect by product_id. And the 2016 dummy is perfectly collinear with the “post” indicator.
I am still able to interpret the "treatment*post" coefficient, but I am worried about one of the main effects ("treatment") dropping from the model.
- 2) What is captured by the different fixed effects and variables?
I’m having trouble separating what is being captured by the “generic”, “year”, and “treatment*post” coefficients. Here is my current understanding:
I think the diff-in-diff estimator ("treatment*post") shows the size of the drop in price due to the policy (if all assumptions of the model hold). For example, a DD ("treatment*post") coefficient of -0.232 would indicate that the policy was associated with an estimated 21% (1-exp(-0.232)) reduction in price, controlling for the other variables in the model (time/fixed effect, generic status).
The "generic" variable captures the effect of a drug going off patent during the study period; if a product was available as a generic from the start of the study, or remained patent-protected the whole time, then the variable will remain constant (0 or 1) throughout the period would be absorbed by the product fixed effect. The product fixed effects (, fe), in turn, controls for any individual characteristics of units—in both treatment and control arms—that affect price levels but do not vary over time. This includes characteristics such as the therapeutic value of a medicine, strength, etc. I think this is important to get more precise diff-in-diff estimates. Is the product fixed effect accounted for by the intercept?
I believe the year dummies control for any shocks that happen in a given year that affect prices in both the control and treatment arms? But since this gets dropped in the intervention year, I don’t see how it helps in the estimation of the DD coefficient? Is it just to make the estimate more precise (lower the standard error), although I don't have it fully clear in my head why this is the case? I guess I just don't have the statistical concepts fully clear in my head here -- it would really helpful if someone could explain this, perhaps with a hypothetical example.
Indeed, the following three models all give me the same DD coefficient, but the standard error increases with each model. (Note that I dropped the "generic" variable because I was experimenting with the fixed effects just to try to unpack the model.) All models are significant at the 1% level. The clustering of the standard errors of course result in a higher standard error, but I am a bit confused about what the year fixed effect and the product fixed effect (fe) are doing here.
Code:
xtreg price_ln i.treatment##i.post xtreg price_ln i.treatment##i.post i.year xtreg price_ln i.treatment##i.post i.year, vce(cluster id_new) fe
Code:
xtreg price_ln i.treatment##i.post i.generic, vce(cluster product_id) fe xtreg price_ln i.treatment##i.post i.generic i.year, vce(cluster product_id) xtreg price_ln i.treatment##i.post i.generic i.year, vce(cluster product_id) fe
Code:
xtreg price_ln i.treatment##ib2015.year i.generic, vce(cluster product_id) fe
- 3) How should I present the log-transformed coefficients?
I just want to confirm that my interpretation of coefficient is correct: I should exponentiate the coefficient (since the outcome variable is expressed as natural logarithm) and then subtract from one (since all the coefficients were negative) to calculate the associated % change in price. (See example above under question 1.)
Comment