Diff in Diff: DRDID and CSDID

Mahdi Tavalaei

Join Date: Dec 2020

Posts: 3
#61

11 Jan 2022, 11:50

Thank you very much Fernando. All very clear!
And regarding the clustered se, yes you did mention it in the helpfile. I was looking at your webpage and forgot to check the helpfile (my bad!)
All the best
Comment
Tiyo Ardiyono

Join Date: Mar 2021

Posts: 8
#62

17 Jan 2022, 23:48

Dear Fernando,

I used your csdid for a small sample (around 60), with the numbers of the treated and control groups respectively are up to 12 and 50, depending on the first treatment. The package works well and what I need to do next is to capture the number of observations, the control and the treated. So I use the syntax below:

Code:

matrix list e(gtt)

The result is below:

Code:

e(gtt)[9,7] cohort t0 t1 error N N_trt N_cntr r1 2013 2012 2013 0 54 50 4 r2 2013 2012 2014 0 54 50 4 r3 2013 2012 2015 0 54 50 4 r4 2014 2012 2013 0 56 50 6 r5 2014 2013 2014 0 56 50 6 r6 2014 2013 2015 0 56 50 6 r7 2015 2012 2013 0 62 50 12 r8 2015 2013 2014 0 62 50 12 r9 2015 2014 2015 0 62 50 12

My questions:
1. Why are the numbers of the control and the treated group reversed?
2. How to obtain the chi2 and p-value of the parallel trend after "estat pretend"?
3. Does csdid work well for a small sample like in my case?

Thank you very much.

Last edited by Tiyo Ardiyono; 18 Jan 2022, 00:26.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2466
#63

18 Jan 2022, 06:46

Hi Tiyo
Here some answers:
1) that was a typo, I have to submit the latest version that will fix that.
2) you should be able to get those typing return list.
3) I think the sample may be to small. ALthough, it would be the same regardless of the methodology.
Best wishes
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#64

23 Jan 2022, 05:06

Hi FernandoRios
Thanks for creating these packages.
I have an industry level unbalanced panel data from 1990-2015 . I have a treatment dummy (some legislation announced) where different states adopted this legislation at different time and i am looking at its effect on the industry outcome. As of now , i am assuming that parallel trend hold unconditionally (no covariate in the model) . In my data, i also have never-adopted states (states that never adopted the said legislation). The legislation was announced in 1994 and some states adopted it in the year-1994 (i call this first group as Group₁₉₉₄). My last group is (Group₂₀₁₅). I have four clarificatory questions:
Since it is an unbalanced panel and in order to run the Callaway and Sant'Anna (2020), i did the following:

Code:

set seed 1 gen sample = runiform() < .9 csdid y if sample== 1, ivar(state) time(year) gvar(first_legislated) method(dripw)

In this case, the output displays the warning -
Panel is not balanced
will use the observations with pairs balanced (observed at t0 and t1)

My understanding is that in this case the command coerces data into being a balanced panel by dropping units with observations that are missing in any time period. However i am not much clear about it.
For example if we have just three-states (A, B, C say) with years t1, t2, t3. Let the data on dependent variable in t1 and t2 is available for all the three states, while for t3 it is available for C only and missing for states A and B.

1) Does this code drops t3 from the observation. Is that what it meant ?

i also ran the following:

Code:

set seed 1 gen sample = runiform() < .9 csdid y if sample== 1, cluster(state) time(year) gvar(first_legislated) method(dripw)

Both the codes run well. with observations dropped in the first-case ivar(state) but not in the second-case cluster(state)

2) which one I should use in my case ?

3) These codes generate a new variable "sample" with entries as zero and 1, i want to ask what are these zeros and 1's . Please shed some light on it.

Since i also have never-adopted states in my data and i only want not-yet-adopted to be used as a comparison in my case, i tried running the following :

Code:

csdid y if sample== 1, cluster(state) time(year) gvar(first_legislated) notyet method(dripw)

The number of observations in this case is 2202, whereas in the earlier case using never-adopted as a comparison group, the no. of observations were slightly lower 2200.

4) My understanding is that specifying "notyet" in the code above uses both not-yet-complied and never-complied as a comparison group not just not-yet-complied as it might seem. Is that true ?
If yes ! please guide me how to specify the condition in the code so that it uses only "not-yet-complied" as a valid comparison .

Thanks
Apologies for such a long query
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2466
#65

23 Jan 2022, 09:37

Hi Ridwan
ok so some answers
1) Treatment of Unbalanced panel:
there are three options:
a) csdid y x1 x2, cluster(id) [other]
this runs the repeated crossection estimator using ALL (balanced or unbalanced) data.
b)
bysort id:gen nt=_N
csdid y x1 x2 if nt==Max number of periods, ivar(id) [other]
This option first counts how many periods an observation is "seen" in the data.
where MAX represents the maximum number of periods available. so keeping units that are observed for ALL periods will see the fully balanced panel:

c) csdid y x1 x2 , ivar(id) [other]

This is the option you were using. Its a middle ground between fully balanced and unbalaced data.
Rather than constraining the sample to units observed ALL periods, It constrains the sample to units that are observed at least 2 periods for a particular ATTGT.
For example, say that you have 4 units observed for up to 3 periods

unit 0: observed at T0 , T1, T2,T3
unit 1: observed at T0 , T1, T3
unit 2: observed at T0 , T2, T3
unit 3: observed at T0 , T1, T2, T3

for simplicity, assume unit 0 is your control, and that T0 is the last pretreatment period (the base period).
If you use option b) fully balanced, you will drop Unit1 and 2.
However, if you use csdid , ivar() option, you can use units 1 and 2, when possible.

If you are estimating ATTGT at T1, N=2 (treated). at T2 also N=2, but for T3 you have 3 treated observations.

2) For which code to use, its hard to say. Will dependent on how much unbalanced data you have, and its nature.

3) the variable sample was created only as an example. it doesn't need to be used for you case. I used to to illustrate what happens when one uses unbalanced data.

4) If you want to use ONLY not yet treated, you need to do something like
csdid y x1 x2 if gvar !=0 , [other]
That way you EXCLUDE observations that were never treated.

HTH
Fernando
2 likes
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#66

23 Jan 2022, 12:19

Hello Fernando, I sort of have a more conceptual question i figured you'd be able to answer for me.

As I understand, these estimators and newer DD-event study estimators estimate the ATT in relative event time.

Is the reasoning for this relative-time estimation because of Goodman-Bacon'd paper about later treated units receiving less/more weights compared to previously treated units? I guess my real question is, why is balancing one's sample in event time important in counterfactual estimation? I ask because I'm writing a synthetic estimator, and I implemented this in settings where we have staggered implementation, but I was wondering if anyone could explain why it's necessary or desirable? FernandoRios
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2466
#67

23 Jan 2022, 18:58

Hi Jared
I think the questions you pose are not necessarily related.
1. I don't think the relative base periods and Goodman-Bacon critique are related to each other.
As I understand Goodman-Bacon, the problem is the incorrect use of treated units as controls, which will produce this awkward set of weights (negative weights on treated units).
2. I don't think balancing samples is necessarily important. But depends on the nature of the sample, and the missing values.
For example, if you use, say, Current Population survey, you could use either panel estimators (for those followed) or repeated crossection, because the sample is replenish, and the population is still represented in the sample, even though not everyone is followed.
Now, if you have unbalanced data because, of attrition, (fewer observations in later years), I would say it makes more sense to use panel data estimators, because you want your Pre-periods units and Post periods units to represent the same population.
3. In staggered implementation, I think the same logic applies. You want similar populations across time to be observed, to calculate treatment effects. Other wise, how can you rule out that the effect you estimate is just because the sample is changing.
HTH
Fernando
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#68

24 Jan 2022, 08:53

Thanks very much FernandoRios . This helped me to understand it better.
When i executed these estimations, regarding that I have few more clarifications (sorry).
The data in my case is from 1990-2015 and my last two groups are Group₂₀₁₂and Group₂₀₁₅ (say). When i use only NOT-YET-TREATED as a comparison by executing :

Code:

csdid y if first_legislated !=0 , cluster(state) time(year) gvar(first_legislated) notyet method(dripw)

The code uses only NOT-YET-TREATED as a comparison group (as specified) .However, it produces ATT(g,t)'s for Group₂₀₁₅(the last group) at t= 1996, 1997,.....2011 [No ATT(g,t)'s before 1996 and after 2011).
1) My understanding was that, since it is a last year in the sample and for group of states adopting the legislation in this year(2015), there is no valid comparison group for these states using "notyet". How could this produce these ATT(g,t)'s for this group ?

When i use option-b by coercing the sample to be fully balanced by executing:

Code:

bysort states:gen nt=_N sum nt

after summarizing the variable "nt" , i put the maximum value in "nt" (which is 26) below in the code:

Code:

csdid y if nt==26, ivar(states) time(year) gvar(first_legislated) method(dripw)

which coerces the data to be fully balanced by keeping units that are observed for ALL periods and then runs csdid on them- this makes sense. I hope i have written the code correct.
2) However, it produces ATT(g,t)'s for some group not for other and for certain groups , it only gives coefficients of ATT(g,t)'s not their standard errors , z-scores and 95% confidence intervals. Why is it so ?

Thanks,

(Ridwan)
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2466
#69

24 Jan 2022, 10:24

Thank you for noticing that, I need to review the code and check what is happening there. And will get back to you.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#70

24 Jan 2022, 10:40

Thank you very much FernandoRios . You had been so much of help in this forum.
I shall be waiting for your response.
Best,
(Ridwan)
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#71

27 Jan 2022, 09:38

Hello FernandoRios . I have very quick question about "csdid".
I was experimenting with my data.
Case-1- In my data i have both never-treated and not-yet-treated units. I run "csdid" with cluster() option using not-yet-treated as a comparison and estimate store all the group-time, event-study and calendar-time coefficients.
Case-2 -Then, i alter this data file by including additional never-treated units (i do not change not-yet-treated units that got treatment at any point of time, that group is fixed). I run "csdid" in this new data file using same (clustering) option with not-yet-treated units again as a comparison . I again store group-time, event-study and calendar-time coefficients.

The event-study and calendar-time coefficients are different in case -1 and case-2, but group-time coefficients are same. Why is this happening ?

My understanding was '' since the estimations in both the cases uses "not-yet-treated" as comparison and that group is fixed (same) in both cases and i only change never-treated group in case-2, but the estimation is not using those units in either of cases, i was expecting that coefficients should be same" but they are not same except group-time ATT(g,t)'s. Is this because of clustering ? - I have unbalanced panel .

Thanks
(Ridwan)
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2466
#72

27 Jan 2022, 18:57

Hi Ridwan
First of all thank you. What you found in your previous post was a bug that would affect pretreatment estimation.
I ll be pushing an update on it soon on SSC, because I'm trying to incorporate some group, calendar and treatment averages to the current outputs.
In the meantime, please use the file I'm attaching to replace csdid.ado
This should fix the problem you reported earlier.

Regarding the point, you make now. Not sure why would that be happening.
so a) if you can replace the file Im sending you, and try again your experiments. If you still see the odd results, then
b) send me an email with some data that i can look closer.

Clustering should have no effect on the estimations either.

Attached Files

csdid.ado (35.0 KB, 1 view)
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#73

28 Jan 2022, 12:34

Thanks very much FernandoRios . This helped me to understand it better.
When i executed these estimations, regarding that I have few more clarifications (sorry).
The data in my case is from 1990-2015 and my last two groups are Group₂₀₁₂and Group₂₀₁₅ (say). When i use only NOT-YET-TREATED as a comparison by executing :
Code:
csdid y if first_legislated !=0 , cluster(state) time(year) gvar(first_legislated) notyet method(dripw)
The code uses only NOT-YET-TREATED as a comparison group (as specified) .However, it produces ATT(g,t)'s for Group₂₀₁₅(the last group) at t= 1996, 1997,.....2011 [No ATT(g,t)'s before 1996 and after 2011).
1) My understanding was that, since it is a last year in the sample and for group of states adopting the legislation in this year(2015), there is no valid comparison group for these states using "notyet". How could this produce these ATT(g,t)'s for this group ?

When i use option-b by coercing the sample to be fully balanced by executing:
Code:
bysort states:gen nt=_N sum nt
after summarizing the variable "nt" , i put the maximum value in "nt" (which is 26) below in the code:
Code:
csdid y if nt==26, ivar(states) time(year) gvar(first_legislated) method(dripw)
which coerces the data to be fully balanced by keeping units that are observed for ALL periods and then runs csdid on them- this makes sense. I hope i have written the code correct.
2) However, it produces ATT(g,t)'s for some group not for other and for certain groups , it only gives coefficients of ATT(g,t)'s not their standard errors , z-scores and 95% confidence intervals. Why is it so ?

Thanks,

(Ridwan)

Thanks Fernando the first case is resolved by replacing "csdid.ado" in C-folder- Not Att(g,t)'s for the last group (G-2015) , those are omitted as there is no comparison group when using not-yet as a comparison.

Thank you, it helped.

However, when coercing data to be fully balanced in the second case, i get all the Att(g,t) coefficients with z-scores, standard errors, and 95% CI in group-1993, group-1995, group-1996, group-1997, group-2000, group-2001, group-2002, group-2003, and group-2004, but only coefficients in group-1994, group-1998, group-1999, and group-2005 (no z-scores, standard errors, and 95% CI). Also other groups like group-2007, group-2008, group-2012, group-2015 are totally missing in this fully balanced case.
is that because of coercing data to be balanced.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2466
#74

28 Jan 2022, 12:59

yeah that is possible.
can you check the following:
matrix list e(gtt)
Im guessing the main problem is that the number of observations may not be enough when you are using the balanced data version
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#75

29 Jan 2022, 03:46

Thanks, Here is what i found

Code:

bysort states:gen nt=_N sum nt csdid y if nt==26, ivar(states) time(year) gvar(first_legislated) method(dripw) saverif(B1) matrix list e(gtt)

I got e(gtt) [200,7] matrix list with 200 rows and 7 columns. The cohorts are- 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2005.

To obtain aggregate group specific effects across all post-treatment periods- (equation-3.7 in callaway and sant'ann):

Code:

use B1, clear csdid_stats group, estore(group) esttab group, se

Therefore it produces group-specific aggregate effects for these cohorts /groups only (which makes sense)-(3.7 aggregation in callaway and sant'ann )
However, what is not clear to mean is that why i am not getting pre-and post-treatment z-scores, CI, and SE's within each group . For example Att(1994,t1) Att(1994,t2) Att(1994,t3) , Att(1994,t4) , Att(1994,t5).......Att(1994,t14) . However, for this group the treatment effect across all its post-treatment periods (aggregated over t1 to t14) is nevertheless obtain with z-scores, CI, and SE's and everything.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment