Dear Stata community,
I am currently working with an unbalanced panel dataset and would greatly appreciate your guidance on implementing a Difference-in-Differences (DiD) approach using the Callaway & Sant’Anna (2020) methodology (via csdid).
My dataset consists of daily observations from environmental monitoring stations, which measure air pollution indicators such as PM2.5 and PM10. The panel is unbalanced because not all stations report data continuously over time, and in some cases, entire periods are missing for certain pollutants.
The unit of observation is the monitoring station, and each station is located within a municipality. The treatment corresponds to municipality where a station is located has implemented a environmental policy.
I define the treatment timing variable (gvar) as the date of the policy implementation in the municipality. In practice, I construct this variable at the unit level (station) so that it remains constant over time within each unit. Specifically, I use the following Stata code:
bysort id_polution_station: egen gvar_d = min(date_policy_approval) replace gvar_d = 0 if missing(gvar_d) format gvar_d %td
This ensures that:
Thank you very much in advance for your help.
Best regards,
I am currently working with an unbalanced panel dataset and would greatly appreciate your guidance on implementing a Difference-in-Differences (DiD) approach using the Callaway & Sant’Anna (2020) methodology (via csdid).
My dataset consists of daily observations from environmental monitoring stations, which measure air pollution indicators such as PM2.5 and PM10. The panel is unbalanced because not all stations report data continuously over time, and in some cases, entire periods are missing for certain pollutants.
The unit of observation is the monitoring station, and each station is located within a municipality. The treatment corresponds to municipality where a station is located has implemented a environmental policy.
I define the treatment timing variable (gvar) as the date of the policy implementation in the municipality. In practice, I construct this variable at the unit level (station) so that it remains constant over time within each unit. Specifically, I use the following Stata code:
bysort id_polution_station: egen gvar_d = min(date_policy_approval) replace gvar_d = 0 if missing(gvar_d) format gvar_d %td
This ensures that:
- All observations for a treated unit share the same treatment adoption date.
- Units that are never treated are assigned gvar = 0.
- Are there any recommended best practices when dealing with irregular reporting frequency or missing periods in high-frequency environmental data?
- In this context, would it be preferable to rely on never-treated units as controls, or should I use the notyet option to include not-yet-treated units as well?
Thank you very much in advance for your help.
Best regards,

Comment