Diff in Diff: DRDID and CSDID

FernandoRios

Join Date: Apr 2014

Posts: 2487
#46

05 Dec 2021, 20:46

Hi Isabella
The equivalent to R's unbalance is to request repeated crossection estimator.
Instead of
csdid pea nao_emp_Chefe, ivar(id) time(time_calendar) gvar(first_treat)
write
csdid pea nao_emp_Chefe, cluster(id) time(time_calendar) gvar(first_treat)
HTH
1 like
Comment
Isabella Helter

Join Date: Dec 2021

Posts: 7
#47

06 Dec 2021, 05:13

thanks a lot, Fernado!! but I still have a doubt, my panel is at individual level, can I still cluster? I saw this option in csdid help, but I didn't use cluster precisely because I want the effect at the individual level... how do i interpret the effect using cluster if my data is at the individual level?

also, what is the difference between removing the "ivar" option and using "cluster" to deal with unbalanced panels? from the help, I understood that removing the "ivar" would also be one kind of solution when my data is unbalanced. Can I use this alternative in my case?

Thanks again!
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2487
#48

06 Dec 2021, 06:55

So, something that may not be well understood, when you use panel estimators with csdid, standard errors are obtained clustering at the individual level.
If your panel is fully balance, for example, and all variables are time fixed, using ivar(id) or cluster(id) should produce the same result.
Differences would arise if data is unbalanced or if characteristics change across time.

Now, when you say your data is at the individual level, do you mean you do not observe the same individuals across time? In that case, cluster will not be useful.
In any case, cluster only modifies how standard errors are estimated.

Now, using ivar Forces csdid to use panel data estimators. And, in the first step for these estimators, you always get the within unit change across time: Dy = y - l.y. THis is the reason for the message.
When you do not use ivar, you request repeated crossection estimators. In this case the DID estimator basically focus on estimating E(Dy) = E(Y|t) - E(Y|t-1). SO rather than getting first difference, you first get the conditional means, then estimate the changes across time.

If you are interested in the exact formulas, you can check Pedro's original paper, or take a look at my reinterpretation of those values here: https://friosavila.github.io/playing...did_csdid.html
This is what R's DID do when you allow for unbalanced panel. Except that in Stata you need to be explicit about the clustering too.
HTH
Comment
Isabella Helter

Join Date: Dec 2021

Posts: 7
#49

07 Dec 2021, 12:21

thanks again, now I understand!

I have one more doubt, now it's about the speed of the package in stata. My model has been running for more than a day, but it doesn't finish, is this normal? Here is the pdf image...
Attached Files

image.pdf (85.4 KB, 1 view)
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2487
#50

07 Dec 2021, 13:03

From what i see, you have a lot of data.
And a lot of that is not even beeing used. As long as you see "." or "x" happening, then it does mean there is progress.
I wonder, however, if there are some problems regarding your variable selection, that is creating a problem for the model estimation.

So, one way to figure that out.
first Tab tempo_calendario primerio_choque, and show me how much data you have there.
Then you can select 1 cohort plus the never treated, and 2 years (right before treatment and after treatment based on your cohort variable.

Finally, run a logit model to determine the chances of belonging to the cohort.
If that gives you any problems, means that you may be overfitting your model.
HTH
Comment

Isabella Helter

Join Date: Dec 2021
Posts: 7

#51

07 Dec 2021, 13:46

Ok, that's the result of:

Code:

tab tempo_calendario primeiro_choque

Total: 751,537.

Just to explain: this is a base that originated from the largest household survey in Brazil. The base has a quarterly frequency and starts in 2012 and goes until 2019. People are follow for maximum to 5 quarters. To create the variable tempo_calendario I did:

Code:

egen tempo_calendario = group(year quarter)

The treatment happens when the head of the household loses his job in one of the quarter, and the idea is to see the effect of this shock over the quarter on the likelihood that the child will start working to compensate for the loss of income.

group(year quarter)		primeiro_choque
tempo_calendario	0	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	Total
1	13,865	932	418	190	82	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	15,487
2	21,590	932	1,075	455	238	93	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	24,383
3	22,225	654	1,075	1,012	478	263	65	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	25,772
4	21,871	390	736	1,012	1,050	556	195	56	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	25,866
5	21,780	179	428	691	1,050	1,110	408	180	60	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	25,886
6	22,017	0	184	444	737	1,110	886	368	205	61	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	26,012
7	21,205	0	0	169	419	675	886	863	400	189	63	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	24,869
8	21,171	0	0	0	204	401	547	863	901	378	159	63	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	24,687
9	21,338	0	0	0	0	178	350	574	901	888	365	184	94	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	24,872
10	21,574	0	0	0	0	0	158	357	598	888	817	388	244	87	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	25,111
11	21,916	0	0	0	0	0	0	170	398	618	817	833	481	218	79	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	25,530
12	21,938	0	0	0	0	0	0	0	159	415	545	833	995	448	186	94	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	25,613
13	21,452	0	0	0	0	0	0	0	0	198	342	543	995	929	367	249	90	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	25,165
14	21,005	0	0	0	0	0	0	0	0	0	149	335	638	929	804	484	269	64	0	0	0	0	0	0	0	0	0	0	0	0	0	0	24,677
15	20,551	0	0	0	0	0	0	0	0	0	0	136	404	591	804	1,083	565	284	137	0	0	0	0	0	0	0	0	0	0	0	0	0	24,555
16	20,308	0	0	0	0	0	0	0	0	0	0	0	167	364	554	1,083	1,146	604	368	125	0	0	0	0	0	0	0	0	0	0	0	0	24,719
17	20,078	0	0	0	0	0	0	0	0	0	0	0	0	178	362	739	1,146	1,264	683	305	114	0	0	0	0	0	0	0	0	0	0	0	24,869
18	19,869	0	0	0	0	0	0	0	0	0	0	0	0	0	175	477	816	1,264	1,370	590	289	113	0	0	0	0	0	0	0	0	0	0	24,963
19	19,740	0	0	0	0	0	0	0	0	0	0	0	0	0	0	237	511	987	1,370	1,335	648	297	80	0	0	0	0	0	0	0	0	0	25,205
20	19,160	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	220	624	957	1,335	1,403	549	222	80	0	0	0	0	0	0	0	0	24,550
21	18,528	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	277	580	912	1,403	1,165	455	200	98	0	0	0	0	0	0	0	23,618
22	18,349	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	268	588	1,007	1,165	1,053	452	289	86	0	0	0	0	0	0	23,257
23	18,109	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	279	658	835	1,053	1,162	626	261	80	0	0	0	0	0	23,063
24	17,759	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	322	523	782	1,162	1,305	562	210	76	0	0	0	0	22,701
25	17,437	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	248	509	812	1,305	1,137	437	207	88	0	0	0	22,180
26	17,246	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	233	530	906	1,137	965	433	268	107	0	0	21,825
27	17,196	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	249	599	806	965	1,027	553	282	88	0	21,765
28	16,846	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	251	497	702	1,027	1,154	526	243	61	21,307
29	16,466	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	224	422	746	1,154	1,085	491	197	20,785
30	16,644	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	185	482	798	1,085	1,076	416	20,686
31	17,235	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	232	511	769	1,076	1,059	20,882
32	14,186	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	228	449	755	1,059	16,677
Total	620,654	3,087	3,916	3,973	4,258	4,386	3,495	3,431	3,622	3,635	3,257	3,315	4,018	3,744	3,331	4,446	4,763	5,368	5,733	5,469	5,844	4,895	4,387	4,647	5,379	4,710	3,966	4,230	4,754	4,303	3,729	2,792	751,537

Last edited by Isabella Helter; 07 Dec 2021, 14:08.

Comment

Iryna Hayduk

Join Date: Dec 2021

Posts: 7
#52

07 Dec 2021, 17:58

Dear Fernando,

csdid is a great command. I have been using it for several weeks now, and it worked well. However, I have received the following error message today:

csdid outcome, cluster(state) time(cohort) gvar(staggered_cohort) method(drimp) agg(simple)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxestimates post: matrix has missing values
r(504);

Any help with resolving this issue would be greatly appreciated.

Thank you!

Best,
Iryna
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2487
#53

08 Dec 2021, 06:44

Dear Isabella
So there are two reasons why you have so many X's and that is taking a while to estimate the model.
At any point, you are using the full dataset to estimate 5 models. 4 for the outcome and 1 for the propensity. In all cases, you are using a large dataset (751k observations). Even using "IF's" (as csdid does when calling on drdid), Stata starts by using the full dataset. And that takes time.
The other reason, as I suggested, could be overfitting. Specially for those cases where you have less than 100 obs.
For example, try running the following:
drdid pea $x1list [w=peso] if inlist(tempo_calendario,2,6) & inlist(primero,0,3) , tvar(tempo_calendario) tr(gvar)

And see what happens. If it takes too long, I would also run a logit model using the same sample
logit gvat $x1list [w=peso] if inlist(tempo_calendario,2,6) & inlist(primero,0,3)

That will give you a better idea of whether or not you have overfitting.
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2487
#54

08 Dec 2021, 06:48

Hi Iryna
My first reaction was gonna be if you have -drdid- installed.
What the output is telling you is that nothing was estimated. That is whyyou have only "X" as iterations.
It is possible that your gvar year set up is not correct, so can you show me what happens when you do
tab year gvar

the other alternative is that you have very few observations per cohort and year. This makes drimp very difficult to estimate. In this case i would try method(reg). Without covariates, they will give you the same results.

Finally, its better if you do not request agg(simple). Just let it provide you all ATTGT's and then request the simple average using "estat simple",

HTH
Fernando
Comment

Iryna Hayduk

Join Date: Dec 2021
Posts: 7

#55

13 Dec 2021, 11:27

Dear Fernando,

Thank you so much for your response!

You are right my gvar is not coded correctly, but I am not sure how to fix this issue.

Here is

tab year gvar

	\|			gvar
year \|		0	23	24	25	26	27	28 \|		Total

1	\|	616	0	0	0	0	0	0	\|	616
2	\|	617	0	0	0	0	0	0	\|	617
3	\|	622	0	0	0	0	0	0	\|	622
4	\|	775	0	0	0	0	0	0	\|	775
5	\|	694	0	0	0	0	0	0	\|	694
6	\|	756	0	0	0	0	0	0	\|	756
7	\|	1,142	0	0	0	0	0	0	\|	1,142
8	\|	1,178	0	0	0	0	0	0	\|	1,178
9	\|	1,027	0	0	0	0	0	0	\|	1,027
10	\|	1,048	0	0	0	0	0	0	\|	1,048
11	\|	961	0	0	0	0	0	0	\|	961
12	\|	1,081	0	0	0	0	0	0	\|	1,081
13	\|	1,132	0	0	0	0	0	0	\|	1,132
14	\|	1,165	0	0	0	0	0	0	\|	1,165
15	\|	1,482	0	0	0	0	0	0	\|	1,482
16	\|	1,255	0	0	0	0	0	0	\|	1,255
17	\|	1,598	0	0	0	0	0	0	\|	1,598
18	\|	1,168	0	0	0	0	0	0	\|	1,168
19	\|	1,584	0	0	0	0	0	0	\|	1,584
20	\|	1,251	0	0	0	0	0	0	\|	1,251
21	\|	1,362	0	0	0	0	0	0	\|	1,362
22	\|	1,150	0	0	0	0	0	0	\|	1,150
23	\|	974	288	0	0	0	0	0	\|	1,262
24	\|	697	198	173	0	0	0	0	\|	1,068
25	\|	876	216	72	125	0	0	0	\|	1,289
26	\|	822	360	100	90	143	0	0	\|	1,515
27	\|	666	270	108	180	126	54	0	\|	1,404
28	\|	646	198	90	158	54	54	72	\|	1,272
29	\|	612	192	144	125	88	54	35	\|	1,250
30	\|	562	270	86	78	155	36	0	\|	1,187
31	\|	816	144	112	104	129	72	0	\|	1,377
32	\|	538	198	72	178	201	36	0	\|	1,223
33	\|	405	210	54	115	152	53	0	\|	989
34	\|	391	180	64	67	52	34	42	\|	830
35	\|	481	72	132	67	109	15	18	\|	894
36	\|	607	102	53	28	102	15	46	\|	953
37	\|	338	190	114	58	100	27	0	\|	827
38	\|	273	204	78	79	113	26	16	\|	789
39	\|	280	132	68	38	35	23	0	\|	576
----------	-+	----------	-----------	---------------	-------	-----------	-----------	-----------	-+	----------
Total	\|	33,648	3,424	1,520	1,490	1,559	499	229	\|	42,369

If I add any constant to the positive values of gvar, the csdid command works (but, of course, this solution doesn't make any sense), but the p-value for the pretend test is zero and STATA drops about 60% of the observations. p-value is also 0 for any subsample, which is very strange.

. estat all
Pretrend Test. H0 All Pre-treatment are equal to 0
chi2(25) = 20934.78777012576
p-value = 0

I would greatly appreciate any help with resolving these issues.

Thank you,
Iryna

Last edited by Iryna Hayduk; 13 Dec 2021, 11:34.

Comment

FernandoRios

Join Date: Apr 2014

Posts: 2487
#56

13 Dec 2021, 13:11

Can you contact me via email? I think i ll need more information than what you have here to help you with the problem.
F
Comment
Isabella Helter

Join Date: Dec 2021

Posts: 7
#57

22 Dec 2021, 13:12

Hello Fernando! I have a question about the csdid stata package and R: can I add a time fixed effect? I saw you talking about it in one of the topics and I was confused
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2487
#58

22 Dec 2021, 13:40

No you cant, and there is no need.
Time fixed effects are used to "take care of differences across time". But with Callaway and Sant'Anna , you use the same years (pre and post) for the treated and control group, to obtain a given estimator. So there is no need to control for that.
Also, keep in mind that everytime drdid is used (behind the all operations), you are only using 2 periods of time, thus using trends would make little sense.
HTH
Fernando
Comment
Mahdi Tavalaei

Join Date: Dec 2020

Posts: 3
#59

11 Jan 2022, 05:20

Hello Fernando
Thank you for the great package. I have three simple questions:
Suppose a simple DD mode with fixed effect (with only one treatment shock) in stata with clusterd sd: reghdfe Y treatXpost, absorb(panel_id period) vce(cluster panel_id)

1- using csdid with ivar already provides the clustered sd, right (similar to above code)? I am asking this because it does not allow both cluster and ivar in the code [id may not be both target and by()]
On the other hand, when usning ivar, in the output table it does not explicitly mention the sd is adjusted for clustered and for how many clusters.

2- Regardless of the sd, when we have multiple periods but only one group (i.e., all treated units are treated at the same time vs never-treated units), the ATT of the csdid code (first line below) should be exactly the same as the coeffcient of the second line below, or not? As the coeficients are not the same at all (tested in a few data sets, even in a balanced panel).

csdid Y, time(period) gvar(first_treated) method(reg) ivar(panel_id)
reghdfe Y treatXpost, absorb(panel_id period)

where above first_treated is zero for all never-treated units. And it is the treatment shock period (i.e., the start of the post-treatment periods which is the same for all; let's say t5) for all treated units.

3- I wonder why in the tables z-test and score has been used and reported rather than t-test and score.

Thank you!
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2487
#60

11 Jan 2022, 09:10

Hi Mahdi
That is a good point. I do describe in the paper (still being edited), and the helpfile that when you request panel estimators (using ivar), the standard errors are implicitly clustered at the panel id level.
One way of seeing this. If you use repeated crossection, with only time constant covaraites, and fully balance panel, using -ivar- (panel) or cluster (repeated crossection) will give you the same results
I ll add that to the output next time the program is updated!

2. No, they will not be the same, but very similar.
The reason for this is that when using regression approach, you are forcing all effects "treatxpost" to be constant.
csdid however assume the effects vary across time and groups. Then, when requesting "simple" aggregation, it will take the average of the individual effects across time. THey should be in principle similar to the regression approach, but not the same.

3. It reports Z - stats because by default, csdid uses GMM to estimate standard errors, which are valid asymptotically. THus, like with -ml- estimators, all results statistics are really Zstats, not t-stats.
Best wishes
Fernando
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment