Staggered Difference in Difference: How to properly regress

matthew green

Join Date: May 2019
Posts: 3

Staggered Difference in Difference: How to properly regress

11 May 2019, 23:14

Hey everyone, ok so I'm completely new to Stata and I have no clue how to run my Staggered Difference in Difference regression on here.... My goal is to explain if crime rates increase in a city with an addition of a Sports Stadium being built. I gathered my data and I came up with something like this:

City with Treat	Y1	Period	CR1	Treatment1	Cits without Treat	Y2	Period2	CR2	Treatment2
Denver	2000	0	4692.5	0	Sacramento	2000	0	4636.8	0
Denver	2001	1	4273.5	1	Sacramento	2001	1	4210	0
Milwaukee	2000	0	4626.7	0	Norfolk	2000	0	4535.2	0
Milwaukee	2001	1	4539.3	1	Norfolk	2001	1	3737	0
Pittsburgh	2000	0	2751.4	0	San Jose	2000	0	2776.8	0
Pittsburgh	2001	1	2598.5	1	San Jose	2001	1	2628.6	0
Detroit	2001	0	4686.5	0	Chicago	2001	0	5046.3	0
Detroit	2002	1	4297.8	1	Chicago	2002	1	5132	0
Foxborough	2001	0	760	0	Weymouth	2001	0	900	0
Foxborough	2002	1	1267	1	Weymouth	2002	1	1211	0
Houston	2001	0	5046.3	0	Boston	2001	0	5072	0
Houston	2002	1	5505.4	1	Boston	2002	1	5361	0
Seattle	2001	0	5221.1	0	Baltimore	2001	0	5565.9	0
Seattle	2002	1	5219.4	1	Baltimore	2002	1	5124.3	0

I am using similar cities based off of population and crime as the control group who did not build a stadium and the treatment is obviously cities that built a stadium. The time period is from 2000 - 2016, I have 69 Control and Treatment Variables. 0 is the year before a Stadium is built and 1 is when a Stadium is built.

Can someone help me out with how to write the code to run this or give me some pointers on what to do? Anything would be appreciated.

Thanks!

Tags: data, regression, Suggestion, syntax, Time Series

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

12 May 2019, 13:39

Please explain your data organization. Is each row of the tableau you show a matched pair? If not, why do you have data on two cities in each observation? You speak of only a single treatment, building a stadium, but you have two different treatment variables. Why? Y1 and Y2 are always equal in the example you show. Why do you have both variables? (Or is that not true in your data set as a whole?)

In addition to explaining these when posting back, please provide a usable example from your actual Stata data set, using the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Finally, explain what you mean by a "staggered" difference in difference model. What exactly do you wish to see staggered and in what way?
Comment

matthew green

Join Date: May 2019
Posts: 3

14 May 2019, 01:17

Hi Clyde,

Thanks for the response! My overall goal is to see the effect of a stadium being added to a city has on the respective cities crime rate. The question I had is that there is obviously multiple years where a stadium was added to a city (instead of there being 1 specific uniform year for all cities). I did some research and found a the staggered difference in difference model which essentially would take a look at the year prior to a stadium being built, the year it is built (treated), and then the year after it was built. This was to see the effect of adding the stadium had in comparison to a city who never added one. By doing this approach, I would be assigning all the cities a common period where the treatment occurs, My logic was to assign a value or period (0,1,2) to each of the years to give them all a common point of treatment (1 being the treatment period). The idea was to included cities from 2001 - 2017 that added a stadium and an equal amount of cities who are similar in population and crime rate at period 0. The cities who never built a stadium are classified as my Control Group while the ones who do add a stadium are the treatment group.

The table i posted earlier was a little odd looking (I apologize for that) But here is the -dataex- command for my data. I have more observations but this is basically what I came up with.
I also grouped the cities by year together for when they built a stadium and added respective control cities who are similar to them in the same group (I don't know if that is needed or not; example, Denver and Milwaukee added a stadium in 2001 so they are assigned group 1, Detroit and Houston did in 2002 so they are in group 2).

By data below shows:
the groups (1,2,3,4,5,ect)
City
Years observed- 2000, 2001, 2002, ect.
Period (0,1,2)- 0= before stadium, 1=year stadium was opened (or not), 2= year after stadium was opened
Crime- Cities respective number of crimes for that period
Treatment- 1 for periods 0,1,2 for treatment group and 0 for control group)
Post- 1 for year a stadium was built and after; 0 for if a stadium was never built

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte number str13 city int year byte period float crime byte(treatment post)
1 "Denver"        2000 0 4692.5 1 0
1 "Denver"        2001 1 4273.5 1 1
1 "Denver"        2002 2   4821 1 1
1 "Milwaukee"     2000 0 4626.7 1 0
1 "Milwaukee"     2001 1 4539.3 1 1
1 "Milwaukee"     2002 2   4690 1 1
1 "Pittsburgh"    2000 0 2751.4 1 0
1 "Pittsburgh"    2001 1 2598.5 1 1
1 "Pittsburgh"    2002 2   2772 1 1
1 "Sacramento "   2000 0 4636.8 0 0
1 "Sacramento "   2001 1   4210 0 0
1 "Sacramento "   2002 2 4830.2 0 0
1 "Norfolk, VA"   2000 0 4535.2 0 0
1 "Norfolk, VA"   2001 1   3737 0 0
1 "Norfolk, VA"   2002 2 4478.8 0 0
1 "San Jose"      2000 0 2776.8 0 0
1 "San Jose"      2001 1 2628.6 0 0
1 "San Jose"      2002 2 2645.4 0 0
2 "Detroit"       2001 0 4686.5 1 0
2 "Detroit"       2002 1 4297.8 1 1
2 "Detroit"       2003 2 3360.3 1 1
2 "Foxborough"    2001 0    760 1 0
2 "Foxborough"    2002 1   1267 1 1
2 "Foxborough"    2003 2   3245 1 1
2 "Houston"       2001 0 5046.3 1 0
2 "Houston"       2002 1 5505.4 1 1
2 "Houston"       2003 2 5097.1 1 1
2 "Seattle"       2001 0 5221.1 1 0
2 "Seattle"       2002 1 5219.4 1 1
2 "Seattle"       2003 2 5458.4 1 1
2 "Chicago"       2001 0 5046.3 0 0
2 "Chicago"       2002 1   5132 0 0
2 "Chicago"       2003 2 5239.7 0 0
2 "Weymouth"      2001 0    900 0 0
2 "Weymouth"      2002 1   1211 0 0
2 "Weymouth"      2003 2   1433 0 0
2 "Boston"        2001 0   5072 0 0
2 "Boston"        2002 1   5361 0 0
2 "Boston"        2003 2 2830.5 0 0
2 "Baltimore"     2001 0 5565.9 0 0
2 "Baltimore"     2002 1 5124.3 0 0
2 "Baltimore"     2003 2 4701.2 0 0
3 "Los Angeles"   2002 0 3998.3 1 0
3 "Los Angeles"   2003 1 3675.5 1 1
3 "Los Angeles"   2004 2 3518.9 1 1
3 "Chicago"       2002 0 6637.4 1 0
3 "Chicago"       2003 1 6698.1 1 1
3 "Chicago"       2004 2   7000 1 1
3 "Cincinnati"    2002 0 4541.5 1 0
3 "Cincinnati"    2003 1 4517.8 1 1
3 "Cincinnati"    2004 2 4032.1 1 1
3 "Philadelphia"  2002 0 3389.6 1 0
3 "Philadelphia"  2003 1 3446.1 1 1
3 "Philadelphia"  2004 2   3851 1 1
3 "New York"      2002 0 3998.3 0 0
3 "New York"      2003 1   2659 0 0
3 "New York"      2004 2 2535.1 0 0
3 "Washington DC" 2002 0 4047.1 0 0
3 "Washington DC" 2003 1 3862.3 0 0
3 "Washington DC" 2004 2   2909 0 0
3 "Norfolk, VA"   2002 0 4478.8 0 0
3 "Norfolk, VA"   2003 1 3558.3 0 0
3 "Norfolk, VA"   2004 2 4066.8 0 0
3 "Detroit"       2002 0 4297.8 0 0
3 "Detroit"       2003 1 3360.3 0 0
3 "Detroit"       2004 2 3070.1 0 0
4 "Philadelphia"  2003 0 5508.8 1 0
4 "Philadelphia"  2004 1   3851 1 1
4 "Philadelphia"  2005 2 3360.8 1 1
4 "San Diego"     2003 0 4187.8 1 0
4 "San Diego"     2004 1 4111.2 1 1
4 "San Diego"     2005 2 3777.1 1 1
4 "Detroit"       2003 0 3360.3 0 0
4 "Detroit"       2004 1 3070.1 0 0
4 "Detroit"       2005 2 3292.1 0 0
4 "Minneapolis"   2003 0 3766.9 0 0
4 "Minneapolis"   2004 1 3728.3 0 0
4 "Minneapolis"   2005 2 3983.5 0 0
5 "Frisco"        2004 0 5198.7 1 0
5 "Frisco"        2005 1 3318.2 1 1
5 "Frisco"        2006 2 3720.3 1 1
5 "Orlando"       2004 0 4992.2 0 0
5 "Orlando"       2005 1 5135.4 0 0
5 "Orlando"       2006 2 5200.1 0 0
6 "Bridgeview"    2005 0 2261.7 1 0
6 "Bridgeview"    2006 1 2261.7 1 1
6 "Bridgeview"    2007 2 2147.2 1 1
6 "Glendale"      2005 0 2215.1 1 0
6 "Glendale"      2006 1 2132.4 1 1
6 "Glendale"      2007 2 2097.1 1 1
6 "St. Louis"     2005 0 3937.3 1 0
6 "St. Louis"     2006 1 4297.3 1 1
6 "St. Louis"     2007 2 4037.7 1 1
6 "Burbank, IL"   2005 0   4001 0 0
6 "Burbank, IL"   2006 1   3761 0 0
6 "Burbank, IL"   2007 2 3872.3 0 0
7 "Denver"        2006 0 4130.1 1 0
7 "Denver"        2007 1 2496.6 1 1
7 "Denver"        2008 2 3236.7 1 1
7 "Albuquerque"   2005 0 5753.2 0 0
end

Thanks for the help. I am very new to Stata and have been going through posts left and right to figure out a solution with no luck. I am also toying around with doing a FE model of this instead but i figured I'd get some input first.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

14 May 2019, 21:21

OK, so your "staggered DID" is what I generally refer to as generalized DID. There's an additional wrinkle here: you have two different treatment periods: year of building the stadium, and subsequent years, contrasting with pre-stadium building years. I can't tell from your example data, but I will assume that your control cities have period coded as 0 in all observations.

I'm a little bit confused by one aspect of your data. There are instances where the same city has two or more observations for the same year, and in one instance, Detroit, there are three observations for year 2003. It appears that in these instances, the city actually changes state during the course of that year, although I don't get how Detroit could have gone from pre-stadium through building the stadium to being post-stadium all in the space of one year. Is that one a data error? This aspect of your data is also confusing because in some cases you have different values for the crime rates, and in others you have the same value for both observations in the year. What's going on here? Do I have this general understanding right?

With those assumptions, your variable period is, in fact, the treatment X time interaction variable that is central to any kind of DID analysis. Now, I imagine you will need to adjust your analysis for some additional variables. I won't belabor the point: it's a science question that you and your colleagues know more about than I would.

So here's how I would set up the bare bones analysis:

Code:

encode city, gen(ncity) xtset ncity xtreg crime i.period i.year, fe // CONSIDER vce(cluster ncity)

The output of the -xtreg- command will include two rows for period, each of which represents the difference in expected crime rates between the corresponding period and the pre-construction period.

You definitely should use an FE model for this. First, you cannot just use -regress- because you have repeated observations on the same cities, so the observations cannot be considered independent. So you must use a panel data analysis. Since the effect you are interested in is a purely within-city effect, you are best off with the fixed effects estimator.
Comment
matthew green

Join Date: May 2019

Posts: 3
#5

15 May 2019, 23:00

Thanks for the clarification Clyde. And thank you for pointing out the discrepancies in my data. I guess it didn't transfer properly. Quick question about the code.

Code:

encode city, gen(ncity) xtset ncity xtreg crime i.period i.year, fe

How is this equation interpreted? is the i.period and i.year just dummies and is this still a Diff-in-Deiff equation? Just trying to visualize what is being applied here since it doesn't look like they are being interacted and obvioulsy I'm still not use to Stata.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

16 May 2019, 22:37

The -encode- command is just to create a numeric version of the city variable, since -xtset- will not accept a string as the panel identifier.

As for the xtreg, the variable period, as you created it, is the interaction term. It's not written in the usual form of an interaction term but it has exactly the right values. For a city in the treatment group it takes on the value 0 before the stadium was built, 1 in the year the stadium is built, and 2 in the years afterwards. And for the control group it is everywhere 0. That is exactly the behavior you want from an interaction term. Then, since this is a generalized (or, as you call it, staggered), not a classical, DID, you need panel fixed effects (which FE providesS) and year fixed effects, which i.year provides. Voila!
Comment
Anand Kumar Finance

Join Date: Dec 2024

Posts: 3
#7

11 Dec 2024, 04:35

Hello Stata Community,

I am a PhD scholar in finance and would greatly appreciate your help in implementing a staggered Difference-in-Differences (DID) analysis for my dataset. Data Context
Panel Structure:
My dataset is an unbalanced panel covering multiple firms over several years. Firms are observed for different time periods, meaning the panel is not balanced.

Treatment and Control Assignment:
My research examines the impact of a law (captured by a binary variable year_dummy) on a dependent variable.

Treatment is determined at the country level:
In a given year, if a law is enacted, all firms within that country are treated (year_dummy = 1) for that year.

If no law is enacted, all firms in that country are in the control group (year_dummy = 0) for that year.

Note: Firms can transition between treatment and control groups depending on the enactment of laws in different years.

No "Always Untreated" Cohorts:
All countries in the sample experience law enactment at some point, meaning there are no firms or countries that remain permanently untreated throughout the panel.

Objective:
I aim to estimate the causal effect of these laws on the outcome variable while accounting for the staggered timing of treatment across countries and years.

Challenges
Repeated Transitions Between Treatment and Control:
Unlike standard DID setups, firms can switch back and forth between treatment (year_dummy = 1) and control (year_dummy = 0) depending on whether a law is enacted in a particular year.

No Permanently Untreated Cohorts:
Since all countries experience law enactment at some point, there are no "always untreated" cohorts for comparison. The analysis needs to use "not-yet-treated" firms in a given year as the control group.

Unbalanced Panel:
My dataset is unbalanced, with firms observed for varying time periods. This adds complexity to ensuring proper identification of treatment effects.

Dynamic Effects:
I am particularly interested in estimating dynamic treatment effects (e.g., pre- and post-treatment effects) to understand how the laws impact over time.

Question:
1. Given that firms can switch between treatment and control, is eventstudyinteract or csdid the best tool for this type of staggered DID analysis? If not, what alternative approach or package would you recommend?
2. How can I account for the unbalanced panel structure in my dataset?
3. Are there specific steps or adjustments I should make to estimate dynamic treatment effects more effectively?

Any suggestions, corrections, or alternative approaches would be greatly appreciated. Thank you for your time and help!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#8

11 Dec 2024, 06:48

#7 is optimistic (usually good) but also possibly quite unrealistic (not usually good) in its hopes of what people here are likely to do. People here react best to single specific questions, not portmanteau questions about the entirety of a project.
1 like
Comment
Anand Kumar Finance

Join Date: Dec 2024

Posts: 3
#9

11 Dec 2024, 10:46

Thank you for your feedback on my query regarding staggered Difference-in-Differences (DID) analysis. I understand your point about the question being too broad and appreciate the guidance.

I would like to focus on the following specific questions for now:
Dynamic Treatment Effects Estimation:
Given the staggered adoption of laws in my dataset, do you recommend using eventstudyinteract or csdid for estimating dynamic effects?

Accounting for Unbalanced Panels:
Any best practices for handling unbalanced panel data in staggered DID models, particularly when firms are observed for varying periods? Thank you in advance for your reply.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#10

13 Dec 2024, 05:44

1) any of the new methods would allow you to estimate dynamic effects. Csdid jwdid did_imputation etc
2) jwdid and cadid also handle unbalanced panels but under different assumptions. You may need to see what each one does and if you are ok moving forward with a particular approach
Comment

Announcement

Staggered Difference in Difference: How to properly regress

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment