Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Implement Difference-in-Differences With 3 Treatment Groups in Stata?

    Dear Sirs and Madams,

    I am struggling with something for which it is surprisingly hard to find information. For my thesis, I would like to perform a diff-in-diff methodology with 3 treatment groups and 1 control group. We were taught the 'traditional' setup where you would have 1 treatment and 1 control group. The regression equation would look like this:

    Y = ß0 + ß1 Treatment + ß2 Time + ß3 (Treatment * Time) + eit

    Here, the intercept would be ß0, the main effects would be ß1 and ß2, and the regressor of interest would be ß3. My question to you would be how I could extend this to incorporate 3 treatment groups and 1 control group in Stata. I have an unbalanced panel dataset with annual frequency data, in which I have firms of 4 sizes: Micro, Small, Medium and Large. The 3 treatment groups are Micro, Small and Medium, which I want to compare to 1 control group; Large. Perhaps the following extension would work, where I include 3 treatment main effects and 3 interactions:

    Y = ß0 + ß1 Micro + ß2 Small + ß3 Medium + ß4 Time + ß5 (Micro * Time) + ß6 (Small * Time) + ß7 (Medium * Time) + eit

    Unfortunately I cannot find information anywhere verifying this intuition. As such, I have come to you. I have come to this forum earlier as a visitor with econometric questions for prior courses, and I know that the people here are very knowledgeable. Thanks for reading and I look forward to your answers.

  • #2
    Your intuition is good. The only thing I would suggest is that there is a better way to code this in Stata. Instead of having indicator ("dummy") variables for Micro, Small, and Medium, create a single variable with 4 levels: 0 = Large, 1 = Medium, 2 = Small, and 3 = Micro. Let's call this variable size. Then do it this way:
    Code:
    xtset firm_id time
    xtreg Y i.size##i.time, fe
    You don't describe the data design enough for me to infer how you are representing time. I'm assuming that Time is a dichotomous variable, 0 = pre-intervention, 1 = post-intervention.

    The DID estimator of the treatment effects of Medium, Small, and Micro (relative to Large = control) will be given by the coefficients of 1.size#1.time, 2.size#1.time, and 3.size#1.time, respectively.

    If you are not familiar with the notation here involving i. and # and ##, do read up on Stata's factor-variable notation: -help fvvarlist-.

    Comment


    • #3
      Dear Clyde,

      Thanks for the quick and clear reply. I have further developed my thesis in the last few days, and I have stumbled upon 3 additional questions which I would like to ask. Before I do so however, I will provide a bit more information about my study. I am comparing sales growth of the 3 treatment groups (Micro, Small and Medium) to the singular control group (Large). I do so before economic crises (pre-period) and after economic crises (post-period), because I am interested in examining whether small firms are more sensitive to economic cycles than large firms.

      In the selection of pre- and post-periods surrounding the economic crises, I can opt for a very wide range (such as 5 years before and after) or a very narrow range (e.g., -1 / +1). Initially I thought that I should use as many years as possible because working with more observations increases the statistical significance of my results.
      1. However, I did some research and in the diff-in-diff model with more than 2 time periods, you are making the following comparison: (Ytreatment, after - Ytreatment, before) - (Ycontrol, after - Ycontrol, before). Is it true that I am comparing the average sales growth rates across the whole pre- and post-periods? So for example, COVID-19 crisis occurs in 2020, and with 3 years before and after, I compare the average scores from 2017-2019 to 2020-2022?
      2. Is it true that if I use a narrower range as opposed to a wider range, that I lose observations? I probably lose observations for the treatment and time variables, but what about control variables?
      3. Suppose that you have a 30-year sample period, but that the cumulative pre- and post-lengths are only 10 years. What would the purpose be of the remaining 20 years?
      These were my 3 questions. I hope that it is a bit clear. It isn't easy to describe it concise but sufficiently clear. Thanks for reading and I am looking forward to your answer.

      Comment


      • #4
        I can only give you vague, general answers to your questions. Ultimately there are two conflicting forces directing how many years of data to include in each period. On the one hand, you do lose observations, and therefore decrease the precision of the estimates you get, when you use fewer years of data. (The fact that you have data on some variables for those years doesn't matter, because only complete cases get included in the analysis.) On the other hand, if you include too many years of data, you run into the possibility that other external factors will alter the outcome trends either disrupting parallelism in the pre-"treatment" period, or even introducing non-linearity into the outcome:time relationship. So you have to make a tradeoff. (Sometimes it will turn out that there is no acceptable tradeoff, in which case your research question cannot be answered from that data set.)

        But I can't advise you how many years gives you the "sweet spot." At best, you might seek advice from somebody who is an expert in your field. (I'm an epidemiologist, so I have only lay knowledge in this area.) Or you may just have to explore the data before proceeding with your analysis to determine the time windows over which you have parallel trends prior to treatment and linearity in each period and then see whether those windows are large enough to sustain a sufficiently precise analysis.

        Comment

        Working...
        X