Hello everyone,
I am estimating an event-study regression with a dependent variable y, an event-time indicator t_idx that takes values from 1 to 16, and a cohort variable. The goal is to estimate cohort-specific event-time effects, using the period where t_idx == 5 as the baseline within each cohort.
In a smaller setting (with only a few cohorts), I implemented it as follows:
Now I would like to replace the cohort variable with municipality, which has more than 2,000 unique values. With this approach, Stata would have to generate roughly 15 × 2,000 = 30'000 interaction dummies, which is obviously not efficient.
I was wondering if there is a more efficient way to estimate this event-study specification. For example i thought about using Stata’s factor-variable notation (# or ##), but I haven’t figured out how to replicate the structure above — specifically, how to ensure that t_idx == 5 serves as the baseline within each cohort.
Any suggestions or references to more efficient approaches would be very welcome.
Thank you for your time and help,
Heike Waechter
I am estimating an event-study regression with a dependent variable y, an event-time indicator t_idx that takes values from 1 to 16, and a cohort variable. The goal is to estimate cohort-specific event-time effects, using the period where t_idx == 5 as the baseline within each cohort.
In a smaller setting (with only a few cohorts), I implemented it as follows:
Code:
*define cohort
local cohort languageregion
local coh3 = substr("`cohort'", 1, 3)
local coh9 = substr("`cohort'", 1, 9)
*create interaction dummies
xi i.t_idx*i.`cohort', noomit
drop _It_iX`coh3'_5_* _It_idx* _I`coh9'_*
* drop empty interactions
local vardrop
foreach var of varlist _I* {
quietly summarize `var', meanonly
if r(mean) == 0 {
di "`var' --> delete"
local vardrop `vardrop' `var'
}
}
capture drop `vardrop'
* regression
reg y _It_iX`coh3'_* i.age i.statyear, r
I was wondering if there is a more efficient way to estimate this event-study specification. For example i thought about using Stata’s factor-variable notation (# or ##), but I haven’t figured out how to replicate the structure above — specifically, how to ensure that t_idx == 5 serves as the baseline within each cohort.
Any suggestions or references to more efficient approaches would be very welcome.
Thank you for your time and help,
Heike Waechter

Comment