Fixed effect difference-in-differences model

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17704
#31

17 Jun 2017, 02:27

Gaurav:
welcome to the list.
As the subject of your query has a (too) little to do with the previous one, please start a new thread. Otherwise, those interested in replying might inadvertently skip your question. Thanks

Kind regards,
Carlo
(Stata 19.0)
Comment
Lukas Lang

Join Date: Dec 2016

Posts: 42
#32

24 Nov 2017, 11:10

Hi,

Many thanks for such a great discussion.

However, could you please clarify what you mean by "Firm ID effect" and "Years effect" in #10?

Are you still talking about "Firm fixed effects" and "Year fixed effects"? Or, do you refer to some sort of categorical/trend variable (i.e., for example, a variable that takes value 1 for firm 1, value 2 for firm 2, etc.)?

As a consequence, I am not clear with Clyde's answer in #11, which seems to contraddict (although I am sure it doesn't) his own reply in #5 and #7.

Thanks,

Lukas

------
I use Stata 17
Comment
Jeff McMullin

Join Date: Dec 2017

Posts: 1
#33

22 Dec 2017, 08:00

Above Clyde wrote:

"If you include these both [unit and time fixed effects], you eliminate entirely both the treatment group effect (which is constant within firms over time) and the pre-post effect (which is constant across firms within years, at least in your design). Both the TREAT and POST variables will be dropped. So your model can no longer estimate the impact of the intervention when you do this: it is a ghost of a difference-in-differences model and will provide you with no information about the intervention's impact."

I have the same intuition that a model with time and unit level fixed effects is just a ghost of a DID model and that the estimate of the DID estimand is not helpful in assessing the impact of the treatment.

The question I have is, do you know of any published papers that I could cite or more formally describe the problems with such a model?

Last edited by Jeff McMullin; 22 Dec 2017, 08:13.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30069
#34

23 Dec 2017, 14:46

Well, in terms of the classic DID model, I stand by what I said previously. I do not have a reference to cite for this; I'm not sure how you can find a reference to support not doing something in any case.

That said, another Forum member (whose name, unfortunately, I cannot recall) recently called attention to the generalized DID model, which would allow for this, although one has to look at and interpret the outputs in a more highly ramified way. There is a very clear explanation of this at https://www.ipr.northwestern.edu/wor.../Day%204.2.pdf.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30069
#35

26 Apr 2018, 12:08

My remarks in the final paragraph of #11 were quoted by another forum member in another thread today. On re-reading it, I see that it is incorrect. What I had in mind when I wrote it was that if you introduce fixed effects for combinations of firm ID and year you will end up with a zero-information model, as everything will be explained by the fixed effects. But that is not what dupont john was proposing to do in any case, so my comment was completely misguided. I apologize for the confusion this has caused and I will endeavor to avoid similar mistakes in the future.

To be clear, there is no reason you cannot have both firm ID and year effects in a DID model. You can. Interpreting things gets a little tricker, but there is no real problem.
Comment
Katherine Adams

Join Date: Jan 2019

Posts: 52
#36

10 Jan 2019, 15:16

Hello! I have panel data; my model specification is a diff-in-diff estimator that models electricity use conditional on treatment group indicator, post-treatment indicator, month-by-year dummy variables, and household fixed effects. This is to be estimated in OLS using standard fixed effects estimator, using robust se, clustered by household. Is it correct to use 'areg' command to model this? And, if it is OK, how can I code this? Thank you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30069
#37

10 Jan 2019, 17:14

When asking for help with code, always show example data. When showing example data, always use -dataex-.

If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Katherine Adams

Join Date: Jan 2019

Posts: 52
#38

11 Jan 2019, 17:15

Oh, I am sorry. I am new to Stata, so I tend to do stupid things...

As I said, I have panel data for 2017-2018. It is a RCT. The treatment (started on February 2, 2018) is actually a specific type of a bill sent to a household which includes a comparison between a household’s energy use and its neighbors, as well as contains a specific photograph of a household’s home showing its energy use. It is expected that the treatment will reduce the energy use of treated households.

My model specification is a diff-in-diff estimator that models energy use conditional on treatment group indicator, post-treatment indicator, month-by-year dummy variables, non-linear control for temperature (for example, a quadratic temperature term), as well as household fixed effects. This is to be estimated in OLS using standard fixed effects estimator, using robust standard errors, clustered by household.

I am asked to add the aforementioned empirical specification to the Stata code.

location; household’s location id
date
year
month
day
lconsum; log of energy consumption
randomgr; one of three treatment groups (the var be 0,1,2,3)
heatscore; score of a household based on its photograph (can be 1,2,…,10);
calday; day and year 01jan2017
calmonth; month and year 2017m1
daymntemp and tempsq: temperature and tempreaturee^2
treatalt; equals 1 if year==2018 & month>=2 & day>=2 & randomgrp>0, otherwise 0 (it is just one of options to construct this var)
tr_alt_heat; treatment indicator (a multiplication of heatscore and treatalt)

So, I tried this
xtreg lconsum tr_alt_heat#i.calmonth i.calmonth, fe vce(cluster location)
areg lconsum tr_alt_heat#i.calmonth i.calmonth, absorb(location) vce(cluster location)

I got some results, but the code is running pretty slow with many messages like
…
note: 8.tr_alt_heat#706.calmonth omitted because of collinearity
note: 9.tr_alt_heat#684b.calmonth identifies no observations in the sample
…

So, I am not sure about my post-treatment variable (I chose i.month), and I do not know how to generate month-by-year dummies. Overall, I am not sure about the coding (xtreg? areg? …). I will appreciate any help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30069
#39

11 Jan 2019, 19:26

OK. I get the picture overall. I don't understand some of the variables you have created, so the code below creates some new ones from scratch. A couple of them may be the same as some of the variables you already have, but I couldn't tell.

I recommend building this model up in stages from the simplest DID analysis through adding covariates one at a time, and then finally adding the interaction between heatscore and treatment effect.

Code:

// CREATE A GENUINE STATA DATE VARIABLE gen real_date = daily(date, "DM20Y") // THIS MAY BE THE SAME AS calday assert missing(real_date) == missing(date) format real_date %td // EXTRACT MONTH-YEAR FROM IT gen mdate = mofd(real_date) // THIS MAY BE THE SAME AS calmonth format mdate %tm // CREATE INDICATOR FOR PRE VS POST INTERVENTION gen post_intervention = (date >= td(2feb2018)) // TREATMENT IS DEFINED BY RANDOMGR; WE'LL USE THAT AS IS // BASIC DID MODEL xtreg lconsum i.randomgr##i.post_intervention, fe vce(cluster location) // ADD IN MONTH OF YEAR INDICATORS TO ADJUST FOR // "SEASONAL" (MONTHLY) VARIATION THAT RECURS IN BOTH YEARS xtreg lconsum i.randomgr##i.post_intervention i.month, fe vce(cluster location) // NOW ADD LINEAR AND QUADRATIC TERMS FOR TEMPERATURE xtreg lconsum i.randomgr##i.post_intervention i.month c.daymntemp##c.daymntemp, /// fe vce(cluster location) // NOW ALLOW FOR A LINEAR TIME TREND OVER BOTH YEARS xtreg lconsum i.randomgr##i.post_Intervention i.month c.daymntemp##c.daymntemp /// c.mdate, fe vce(cluster location) // FINALLY, ALLOW FOR EFFECT MODIFICATION BY HEATSCORE xtreg lconsum i.heatscore##i.randomgr##i.post_Intervention i.month c.daymntemp##c.daymntemp /// c.mdate, fe vce(cluster location)

Now, I have actually altered your model specification somewhat, so I must explain what I did and why. I have eliminated your intention to include month-year indicator (dummy) variables and replaced them with month-of-year variables (1 through 12) to capture seasonal variation, and a continuous linear time trend in calendar month from the first month of your study data through the last. Ordinarily I don't do things like this when the content is outside my area of expertise (epidemiology). But I pay energy bills every month and think (perhaps arrogantly) that I know something about this. There is no question that there is seasonal variation. month-year dummies will capture that only indirectly, and you will not be able to really extract it from your results. In addition, there may be overall time trends in energy consumption that, again, calendar month indicators will be a very computationally expensive way of capturing, and, again, they will not enable you to distinguish trend from seasonal variation. There is another reason I eliminated the calendar month variables: the error messages you report getting suggest that the data is sparse and that certain types of houses (characterized by heatscore) are not represented in your data in certain calendar months. Eliminating the calendar month variables will resolve that problem: they are trying to cut the data too fine-grained.

As for the difference between -xtreg- and -areg-, there is a minor difference in the way degrees of freedom are calculated. If your sample is reasonably large, this difference will not be noticeable. In most circumstances, if the difference matters, -xtreg, fe- is usually appropriate. I can't tell from the information you provide whether yours is one of the situations where -areg- would be more appropriate. You might want to read http://www.stata.com/statalist/archi.../msg00596.htmlhttp://www.stata.com/statalist/archi.../msg00596.html for details and see. (But, as I say, if your sample is as large as I imagine it to be, the difference will be very tiny in any case.)

As for speed of calculation, how large is your sample? A very large sample is going to take a longer time to process. Another thing to bear in mind is that when you have a lot of predictors, the computation gets pretty slow. And remember that a variable like heatscore that has 10 possible values counts as 9 predictors. Month of the year would be 11 predictors. The big interaction term in the final model shown above involves 27 predictors [(10-1)*(4-1)]S. olving the regression coefficient requires inverting and multiplying matrices calculated from these predictors. The calculation time to do that, when done by simple direct algorithms, grows as the cube of the number of predictors. There are special faster algorithms that can reduce that to about the 2.3 power of the number of predictors. I don't know what algorithms Stata uses for this. But no known algorithm is even as fast as just the square of the number of predictors. Bottom line: be very patient when fitting regression models to very large data sets with many predictors.

I want to ask another question. You treated the heatscore variable in your proposed code as a discrete variable, and I did the same when I introduce it in the final model. But I wonder if that is appropriate. Are they really just 10 different categories. The very name heatscore suggests to me that in fact there is a natural ordering here, that heatscore 1 is in some sense "less than" heatscore 2, which is less than heatscore 3, etc. If that is the case, you can reduce the complexity of the modeling, and also speed up the computations considerably, by treating heatscore as if it were a continuous variable instead, possibly with some transformation or a quadratic term if the "distance" between different heatscores is not additive. (But if there is no such ordering, then this approach would be illegitimate.)

In the future, after using -dataex-, do NOT put a screenshot of its output into the Forum. Copy/paste the -dataex- output directly from Stata's Results window into the Forum editor. Screenshots are not helpful: they cannot be imported into Stata to create a replica of your data to try out code. (They are also often unreadable, but that's a separate issue and not applicable in this case.) Of course, since I couldn't test any of this code it may contain typographical or other errors.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30069
#40

11 Jan 2019, 21:12

I forgot to mention that if you run these models, then you will get warnings from Stata that the randomgr indicator variables are omtited due to colinearity. They are colinear with the fixed effects. This is normal and you should not be concerned when you see it. In fact, if you don't get this message it means that there is something wrong with your data! You may also get a similar message with regard to heatscore. I don't quite know what this variable is, but if it is an unchanging attribute of the location, then it, too, will be colinear with the fixed effects (but not its interactions with the other variables). And again, if it isn't omitted, then, if I have understood heatscore correctly, you have errors in the data.
Comment
Katherine Adams

Join Date: Jan 2019

Posts: 52
#41

12 Jan 2019, 11:56

Clyde, thank you very much for such a thorough explanation!!

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long location str9 date float(year month day lconsum) byte(randomgrp heatscore) float(calday calmonth daymntemp tp) 500001 "01-JAN-17" 2017 1 1 4.332219 0 8 20820 684 -12 0 500001 "02-JAN-17" 2017 1 2 4.396176 0 8 20821 684 -17.85 0 500001 "03-JAN-17" 2017 1 3 4.4483995 0 8 20822 684 -22.141666 0 500001 "04-JAN-17" 2017 1 4 4.4339075 0 8 20823 684 -15.2875 0 500001 "05-JAN-17" 2017 1 5 4.3300753 0 8 20824 684 -10.625 0 end format %td calday format %tm calmonth

tp = post_intervention; I removed the variables treatalt and tr_alt_heat.

1
Indeed, as you suggested, the variables ‘real_date’ and ‘mdate’ are the same as ‘caldate’ and ‘calmonth’ respectively. If possible, could you please explain why you eventually added ‘mdate’ (=‘calmonth’) in your last two regression equations? If I understood you correctly, you were against adding this variable to the model specification (“I have eliminated your intention to include month-year indicator (dummy) variables…”).

2
The variable ‘heatscore’ can take values from 0 to 10 (0,1,2,..,10) and measures a home’s energy use obtained from its specific photograph. You were right that “there is a natural ordering here, that heatscore 1 is in some sense "less than" heatscore 2, which is less than heatscore 3, etc.” because the lower this score, the less energy a household uses. I also found out that, as you said, we can treat this variable as a continuous one. Honestly, I did not know that when we deal with ranking/ordering, we can treat a variable which has a finite number of values (10 for ‘heatscore’) as a continuous one... In this case, what did you mean by “some transformation or a quadratic term if the "distance" between different heatscores is not additive”? Would you mind providing a preliminary code for that?

3
Finally, when I ran the last regression equation
xtreg lconsum i.heatscore##i.randomgr##i.tp i.month c.daymntemp##c.daymntemp c.calmonth, fe vce(cluster location)
I did get ‘normal’ warnings form Stata that
2.heatscore omitted because of collinearity
…
10.heatscore omitted because of collinearity
and
1.randomgrp omitted because of collinearity
…
3. randomgrp omitted because of collinearity

However, I also got that
heatscore#1.randomgrp omitted because of collinearity
heatscore#2.randomgrp omitted because of collinearity
heatscore#3.randomgrp omitted because of collinearity

and I am worried about these last three messages.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17704
#42

12 Jan 2019, 12:01

Katherine:
just a trivial comment about your point #3:
-if the predictors included in a given interaction get omitted, the interaction itself follows the same destiny. Or I'm missing out on something?

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30069
#43

12 Jan 2019, 12:20

Re #41.

1. The inclusion of mdate (or calmonth) as a continuous variable is intended to represent any ongoing linear trend in outcome over the study period. So if there is a general increase (or perhaps a general decrease) in energy consumption over time, say it grows by 0.05% per month, this will incorporate that into the model and thereby reduce error variance. If there is no such general trend, the coefficient will turn out to be (approximately) 0 and you could then just omit it from the model.

2. Actually, I wasn't sufficiently precise or clear in my discussion of how to use heatscore in the model, and what I said there is incorrect. I'm sorry about that. What really matters is this: is the relationship between heatscore and lconsum is quadratic, or like 1/x, or sqrt(x) or maybe lnconsum is proportional to log(heatscore), etc. The correct point is not the one I made about equal distances, but the nature of that relationship. So you might want to explore this graphically, especially in the pre-intervention data, to see how that looks. You then might want to create a new variable from heatscore that does represent the relationship to lconsum in a linear way. If no simple functional transformation seems to do that, you might consider a spline. Of course, maybe you'll be lucky and the relationship is roughly linear--in which case you can just use heatscore as is. I can't offer you specific code, because I don't know what the relationship looks like, and wouldn't be able to infer it from a sample of 5 observations at a single location.

3. Since heatscore is apparently a fixed attribute of a location that does not change over time, it will be omitted due to colinearity with the fixed effects. That isn't a problem. Similarly randomgrp is, by design, a fixed property of a location and is colinear, so it, too, gets omitted. And as Carlo notes, in #42, it follows that the interaction between two variables that both get omitted due to colinearity also gets omitted due to colinearity. This is not a problem. heatscore still enters the model through the three-way interaction with both randomgrp and post_intervention. So, not a problem.
Comment
Justin Niakamal

Join Date: Aug 2017

Posts: 760
#44

12 Jan 2019, 14:15

I should add that in economics there is, from what I have seen, a strong preference for consistent estimators and less regard for efficiency, so that random effects models are nearly always rejected if they do not pass the Hausman test. In some other fields, the traditions are different and if the results of the two estimators appear reasonably similar, a random effects model will be used even if it fails the Hausman test. (Particularly if the sample size is very large, so that the Hausman test has power to pick up tiny but immaterial departures from the assumptions.)

Clyde Schechter I've always found this interesting and wondered why that is. Coming from an economics background, [anecdotally] I've noticed a very strong preference for FE estimation regardless of what the Hausman test suggests, whereas I tend to see RE often in health-related fields.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17704
#45

13 Jan 2019, 03:20

Since I find myself across both these research fields, my gut-feeling is that economists are more interested in the time series dimension of the panel (ie, changes in the unemployment rate within the same country across years), whereas health scientists mainly focus their researches on the cross-sectional dimension (ie, differences in a given health related quality of life score between different patients across years).

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment