Difference-in-Differences (DiD) Analysis

Fanetti Mazakura

Join Date: Feb 2018

Posts: 48
#1

Difference-in-Differences (DiD) Analysis

30 Mar 2018, 14:34

Hello everyone,

I need to conduct a DiD test for my research. I am using panel data with 150,000 monthly observations over 4 years. I have assigned a dummy treated = 1 for the treated group and zero otherwise. There are around 1,000 companies in the treatment sample, and around 2,000 in the control sample. I have also created a dummy during for the period after the change for the treated group took place.

I have several dependent variables that I will be testing. Some are dummy variables, and the rest are other variables dependent on the dummies, and therefore not represented in the whole dataset. I also have around 10 control variables for each company that I will be using in the regression.

For the Difference-in-Differerences analysis I will be using the following equation:

Dependent Var = L0 + L1TreatedDuring + L2Treated + L3During + Controls + E

I think I should include firm (and perhaps month/year) fixed effects in the regression. Also, what about the standard errors? Should I cluster them by firmID (and perhaps by year)? Is xtreg the right command in this case? Furthermore, is there a right / wrong way on how to present the results? I've seen many papers on DiD and they all seem to have different opinions on that.

Also, how useful is the Chow test on the TreatedDuring coefficient in this type of setting?

Lastly, any advice on how to winsorize some of my control variables?

I would really appreciate any help!

Cheers!
Tags: cluster, difference-in-differences, fixed effects, panel data, regression
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

30 Mar 2018, 16:07

I have several dependent variables that I will be testing. Some are dummy variables, and the rest are other variables dependent on the dummies, and therefore not represented in the whole dataset.

I don't understand what this means. Why would a variable that depends on other variables not be represented in the whole dataset?

I also have around 10 control variables for each company that I will be using in the regression

If these are time invariant attributes of the companies, you will not be able to include them in a fixed-effects regression, as they will be colinear with the fixed effects.

I think I should include firm (and perhaps month/year) fixed effects in the regression.

Yes, I would recommend definitely including firm fixed effects. As for time indicators, that depends on whether the outcome(s) of interest is subject to time-specific shocks that apply equally across all firms. If so, adjusting for them by including them in the model is usually a good idea.

Also, what about the standard errors? Should I cluster them by firmID (and perhaps by year)?

Probably you should. In this kind of data, heteroscedasticity and within-firm error correlation are usually present. Using the cluster robust variance error is a good idea to correct the standard errors. The only exception is when the number of clusters (firms, in this case) is small, but as you have 1,000 there is no worry on that end.

Is xtreg the right command in this case?

Well, you state that several of your outcome variables are dichotomous. In that case you would be fitting a linear probability model. There is nothing inherently wrong with that, but if there are predicted probabilities close to 0 or close to 1 the results can get rather strange (e.g. negative or > 1 values in confidence interval). That is why it is more conventional to use a logistic model for dichotomous outcomes, but if your outcome probabilities are not pushing towards 0 or 1, a linear probability model will be well behaved.

Furthermore, is there a right / wrong way on how to present the results? I've seen many papers on DiD and they all seem to have different opinions on that.

You should tailor your presentation to the background and needs and expectations of your intended audience. There is no one right way.

Also, how useful is the Chow test on the TreatedDuring coefficient in this type of setting?

It isn't. The Chow test was invented decades back when there were no other computationally feasible ways of estimating the difference between coefficients of the same variable in regression equations fit to different population groups. The more modern way is to use a regression with interaction terms. So your regression should resemble:

Code:

some_regression_command i.treated##i.during other_variables, options

The coefficient of 1.treated#1.during will be your DID estimator of the treatment effect. There really isn't anything more to say about the treatment effect. If you did it the old fashioned way, with two separate equations for the treatment and control groups and applied a Chow test you would get the same result.

Lastly, any advice on how to winsorize some of my control variables?

Opinions differ on this. My position on it is one of strong opposition. If your model does not fit the real data, the solution is not to amputate the data to fit into your Procrustean bed of a model. The solution is to modify the model so that it accommodates the real data. I could write a very long rant on all of the reasons I think Winsorising is a bad idea, but I'm tired, so you are spared that.
Comment
Fanetti Mazakura

Join Date: Feb 2018

Posts: 48
#3

30 Mar 2018, 17:34

Thank you for your thorough answer, Clyde!

I am analysing the trading behaviour of insiders. The data is aggregated monthly for each company. One of my dependent variables is a dummy and indicates whether or not a company's insiders trade its stock in a particular month. You are suggesting a logit model, but is it possible to put FE, and what command do you propose to do that?

I also test additional variables that are conditional on the trade dummy variable (for instance the size of the trade). Hence, I don't have observations of them for every month. Will that cause problems when i try to xtset the panel data and would xtreg do the job for taking into account fixed effects in this case? If not, what other alternatives do I have? Heckman regression?

I don't expect time-specific shocks that apply equally across time, only the shock that I am testing in the DiD. Does that mean that I should use only firm FE, or also use year FE as well? Also, should I cluster the SEs by year and firm?

The controls that I use are time-variant (monthly observations). Therefore, I can include them in the regression?

With regards to the winsorizing, I was advised to do it on 1% and 99% level. Which command does the best job for that task?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

30 Mar 2018, 17:55

I am analysing the trading behaviour of insiders. The data is aggregated monthly for each company. One of my dependent variables is a dummy and indicates whether or not a company's insiders trade its stock in a particular month. You are suggesting a logit model, but is it possible to put FE, and what command do you propose to do that?

You can do this with the -xtlogit- command; specify the -fe- option.

I also test additional variables that are conditional on the trade dummy variable (for instance the size of the trade). Hence, I don't have observations of them for every month. Will that cause problems when i try to xtset the panel data and would xtreg do the job for taking into account fixed effects in this case? If not, what other alternatives do I have? Heckman regression?

I really don't know. My first reaction is that the amount of insider trading is only defined when an insider trade actually happens. I don't see it as analogous to a Heckman selection model, where, for example, there really would potentially be an income for a person who is not in the work force. But maybe it is; I'm certainly no expert in this area. There are a number of economists and finance specialists on the Forum, and I hope one of them will chime in. This more of a content-area question than a statistical one, as I see it.

I don't expect time-specific shocks that apply equally across time, only the shock that I am testing in the DiD. Does that mean that I should use only firm FE, or also use year FE as well? Also, should I cluster the SEs by year and firm?

Well, if you don't expect variation in your outcome over time, then there is no reason to include time covariates. As for clustering by year and firm, you will not be able to do that if you -xtset- your data with firm as the panel variable. The clusters you specify in -vce(cluster )- must be constant within all of the panels, so you can't subdivide the panels into separate clusters for this purpose.

The controls that I use are time-variant (monthly observations). Therefore, I can include them in the regression?

Yes.

With regards to the winsorizing, I was advised to do it on 1% and 99% level. Which command does the best job for that task?

Well, as I indicated, I really regard winsorizing as beyond the pale. I never use it at all, and am not familiar with commands for doing it. Try using Stata's -search- or -findit- commands; I'm sure you'l find something there.
Comment
Fanetti Mazakura

Join Date: Feb 2018

Posts: 48
#5

04 Apr 2018, 10:14

Thanks for your thorough reply, Clyde!
However, I did some tests and I got even more confused about the whole concept of DiD and fixed effects.

From what I've read, the DiD is a type of fixed effects model because the differencing gets rid of the individual fixed effects. I first did it without fixed effects using:

Code:

reg DepVar treatedxduring treated during Controls, cluster(firm)

Referring to another post on the forum ( https://www.statalist.org/forums/for...ferences-model ), I also tried the second approach of including fixed effects, but dropping the treated and during variables from the regression:

Code:

xtreg DepVar treatedxduring Controls, fe cluster(firm)

The results of the two approaches are not similar at all. The significance of the treatedxduring, as well magnitude of the coefficients are different. For some dependent variables even the signs of the coefficients are different. I must be doing something wrong...

Any advice, please?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#6

04 Apr 2018, 10:48

In the -xtreg- version you should not have dropped during.

In any case, don't calculate your own interaction terms. Use the ## factor-variable notation and let Stata decide what, if anything, to drop.

See -help fvvarlist-
Comment
Fanetti Mazakura

Join Date: Feb 2018

Posts: 48
#7

04 Apr 2018, 12:10

Thank you, Clyde!

If I were to include year-month fixed effects, how would I do that? Moreover, I would have to drop the during variable as well?

Best regards!
Comment
Fanetti Mazakura

Join Date: Feb 2018

Posts: 48
#8

04 Apr 2018, 13:18

Code:

xtreg DepVar treatedxduring Controls i.month , fe cluster(firm)

Is that the correct code? Moreover, is there an easy way to absorb the regression output for all the months, since I am not interested in reporting them? (I just figured out that outreg2 has the drop/keep(varlist) function to suppress the output)

Last edited by Fanetti Mazakura; 04 Apr 2018, 13:24.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#9

04 Apr 2018, 13:42

Yes, but, again, I urge you not to create and use treatexduring as a variable. Better to use i.treated##i.during.

One way to suppress the output of the month variables is to use -areg- and include month in the -absorb()- option. There will be a difference in the way that -areg- calculates degrees of freedom from what you get with -xtreg, fe-, however.
Comment
Fanetti Mazakura

Join Date: Feb 2018

Posts: 48
#10

07 Apr 2018, 03:09

I am getting insignificant results for the DD estimators in my regressions.. I am not exactly sure whether it is because of the dataset, or I am doing something wrong.

Code:

xtreg DepVar treatedxduring Controls i.month, fe cluster(firm)

In this case I have created a variable treatedxduring as the DD estimator. If there is a significant change in the DepVar for the treated group after the treatment, the coefficient should be significant?

Code:

xtreg DepVar i.treated##i.during Controls i.month, fe cluster(firm)

I followed your advice and used factor variable notation. I get the same t-values for 1.treated#1.during. However, in this case 1.treated and the last i.month are omitted. 1.during is not omitted at the expense of i.month, I guess.

Code:

xtreg DepVar i.treated#i.during Controls i.month, fe cluster(firm)

When I use a single # 1.treated#0.during and 1.treated#1.during are omitted.

I am quite confused and I am not sure whether it is the data or the model are making my results insignificant. The theory kind of states that there should be observable effects on the pilot group after the treatment. I am not the most proficient Stata user, but I think that the model is ok.

Any advice?
Comment
Fanetti Mazakura

Join Date: Feb 2018

Posts: 48
#11

07 Apr 2018, 03:15

Moreover, In several papers that use Difference-in-Differences they report all of the coefficients for Treatment, During and TreatmentxDuring, but also state in the description that firm and time fixed effects are included in the regressions.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#12

07 Apr 2018, 09:50

Re #10: only the second of the three models you show is correct. 1.treated is omitted because of colinearity with the fixed effects, and the last month is omitted due to colinearity with 1.during. These are expected omissions and do not signify any problem.

The other models are misspecified because neither one includes during by itself, and without that, the interaction term is no longer a DID estimator--in fact it has no useful meaning at all!

Moreover, In several papers that use Difference-in-Differences they report all of the coefficients for Treatment, During and TreatmentxDuring, but also state in the description that firm and time fixed effects are included in the regressions.

You can do this by using -regress- instead of -xtreg, fe-, and explicitly including i.firm and i.year in the varlist. In fact, due to the colinearity between treated and i.firm, and between during and i.year, this is just illusion with no substance. It is, in principle, impossible to identify these baseline effects in the presence of that colinearity. And if you play with which levels of firm and year you choose as the omitted base categories, you will see that the coefficients of treated and during change. And, in fact, if you're good enough at linear algebra, you can concoct ways to modify the model so as to get during and treated coefficients to come out to be anything at all. They are meaningless and reporting them in results probably means the authors don't understand what they're doing, or some reviewer who didn't understand what he/she was doing insisted on the inclusion of meaningless numbers in the results table.

By the way, none of this has anything to do with the use of Stata. These constraints are imposed by linear algebra and they bind all statistical packages. Different programs may have different default methods for how they deal with them (what they choose to omit, or how they otherwise constrain the model), but there is simply no way to have valid estimates of all of these parameters. Fortunately, for your purposes, only the coefficient of the interaction term is important anyway.
Comment

Announcement

Difference-in-Differences (DiD) Analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment