Collinearity problem in triple difference regression using panel data

Jon Brandt

Join Date: Feb 2016

Posts: 4
#1

Collinearity problem in triple difference regression using panel data

18 Feb 2016, 21:19

Hello Stata community,

I am experiencing a perfect collinearity issue in a triple difference regression. Specifically, the triple difference estimator is perfectly collinear with one of the double interaction variables, and I'm not sure why.

I am analyzing panel data from 2001-2014, where unit of analysis is university/year (n=31,182). I am interacting the following three variables: type (e.g. public vs. private university), treatment (received at the state-level), and post-period (which begins at different times for each of the treated states). I am using Stata 14.1 on Windows 10.

My code and output are as follows:

*Generate three individual dummy variables*

gen Type = 0
replace Type = 1 if campus_type=="Public"

gen Treat = 0
replace Treat = 1 if state=="CO"
replace Treat = 1 if state=="ID"
replace Treat = 1 if state=="KS"
replace Treat = 1 if state=="MS"
replace Treat = 1 if state=="OR"
replace Treat = 1 if state=="UT"
replace Treat = 1 if state=="WI"

gen Post = 0
replace Post = 1 if state=="CO" & year>=2011
replace Post = 1 if state=="MS" & year>=2012
replace Post = 1 if state=="KS" & year>=2014
replace Post = 1 if state=="OR" & year>=2012
replace Post = 1 if state=="UT" & year>=2007
replace Post = 1 if state=="WI" & year>=2012

*Generate interaction variables*

gen TypeXTreat = (Type*Treat)
gen TypeXPost = (Type*Post)
gen TreatxPost = (Treat*Post)
gen Triplediff = (Type*Treat*Post)

*Conduct regression*

areg rate_y Treat Post Type TypeXTreat TypeXPost TreatxPost Triplediff, absorb(year) r

Linear regression, absorbing indicators Number of obs = 31,183
F( 5, 31164) = 8.46
Prob > F = 0.0000
R-squared = 0.0005
Adj R-squared = -0.0001
Root MSE = 27.8297

----------------------------------------------------------------------------------------------------------------
| Robust
rate_y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+-------------------------------------------------------------------------------------------------
Treat | -.781114 .3308301 -2.36 0.018 -1.429554 -.1326737
Post | -.1396623 .3242237 -0.43 0.667 -.7751538 .4958292
Type | -.991002 .2478075 -4.00 0.000 -1.476715 -.5052894
TypeXTreat | .6589614 .3017063 2.18 0.029 .067605 1.250318
TypeXPost | .2637135 .1993894 1.32 0.186 -.1270977 .6545247
TreatxPost | 0 (omitted)
Triplediff | 0 (omitted)
_cons | 1.260263 .2462696 5.12 0.000 .7775652 1.742962
-------------+--------------------------------------------------------------------------------------------------
year | absorbed (14 categories)

I can remove the collinearity problem by muting one of the assignments of the Treat variable. For example, changing one line in the above code to the following fixes the problem:

*replace Treat = 1 if state=="CO" // (e.g. mute this line of code, no more collinearity)

Maybe I've been staring at the data too long, but I don't understand why there is a collinearity problem before muting one such line. Failing to code that variable means my model does not represent reality, so I'd like to find a way around the issue if possible. Thanks in advance for any insights here.

Sincerely,
Jon
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#2

18 Feb 2016, 22:41

Have you checked the joint distribution of Treat, Type, & Post in the estimation sample? You may find the answer there.

Code:

table Treat Type Post if e(sample), by(year)

Also, there is no need to multiply out your own interaction variables. And by doing so, you lose the ability to use -margins- after estimation (once you get this problem sorted out). Use factor variable notation instead (-help fvvarlist- for details) and your regression command simplifies to

Code:

areg rate_y i.Treat##i.Type##i.Post, absorb(year) r

It is also possible that when you do it the factor variable way that Stata will resolve the colinearity in some other way, and the way it chooses to do that may also help you identify the cause of it.
1 like
Comment
Jon Brandt

Join Date: Feb 2016

Posts: 4
#3

19 Feb 2016, 16:35

Clyde,

Thank you so much for your response. That helps me understand the problem. Unfortunately the factor variable notation does not resolve the collinearity issue because the problem stems from the way I've coded the dummy variables. I would be happy to see if there is an alternative specification strategy you might be able to recommend, but as a disclaimer that has more to do with research design than Stata.

In any case, to illuminate the issue, the output from the suggested areg command is attached (please forgive the novice information-sharing approach):

areg rate_y i.Treat##i.Type##i.Post, absorb(year) r
(see attached photo)

Regarding the regression output, I understand why Treat#Post has not observations in the 0 1 condition. This is because I only assigned Post = 1 to observations where Treat = 1. In other words, no observations received Post unless they also received Treat. This also explains why Treat#Type#Post is empty in the 0 0 1 and 0 1 1 conditions. I think this is the crux of my issue...

The following is more a discussion on research design than Stata. I took the above approach initially because I am interested in capturing the impact of changes that occurred 1) at the state level, 2) only in some states and not others, and, critically 3) at different points in time within the treatment states. In other words, there is not a single point in time when Pre becomes Post; rather, treatment states experience their Post assignment in different years. I think the solution is that I need to code the Post variable into the control observations as well, but I'm not sure what year to use to do this given that there is no single common year of pre/post transition.

Many thanks for the insights you have already provided, and for any more you may be willing to share on my research design.

Sincerely,
Jon

1 Photo
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#4

19 Feb 2016, 17:26

I think this is the crux of my issue...

Indeed it is. And once Treat#Post is collinear with the fixed effect, the three-way interaction is doomed as well.

I think the solution is that I need to code the Post variable into the control observations as well, but I'm not sure what year to use to do this given that there is no single common year of pre/post transition.

Yes indeed, this is how difference-in-differences design founder when there is not a fixed, identifiable start time for the intervention. You don't really have a complete difference-in-differences design here.

You have a few options here, all of which are different attempts to implement the same underlying principle: the switching-time for Post in the control group should be what the switching time would have been if the same panel entity had been part of the treatment group.

It isn't always possible to know this. But sometimes it is. Depending on how much information is available about the determinants of the start date in the intervention entities:

1. There may be some systematic rule, based on observed attributes of the panel entities, that determines when the start date of the intervention was, and that same rule might be applied to the control entities to determine putative start dates had they had the intervention.

2. There may be no deterministic rule, but there may be attributes of the entities that are associated with the start date, at least probabilistically. In that case, it may be possible to create matched pairs between the intervention and control entities based on those attributes, and assign to each control entity a "start date" equal to the actual start date of its matched intervention entity. The more strongly those attributes are related to the start date in the intervention group the better. By the way, it is not necessary that there be equal numbers of intervention and control entities to use this approach. If #intervention > #control, some of the intervention entities can be left unmatched. If #intervention < #control, you can match multiple controls to the same intervention entity, provided they are all good matches. This method can founder if there are numerous control entities for which no reasonable matched intervention entity can be found. (This is analogous to the problem of common support when doing propensity score matching .)

3. If there really is nothing at all to go on, you can assign each non-intervention entity a "start date" drawn at random from the distribution of the actual start dates in the intervention group.

Obviously approach 1 is stronger and gives more believable results than 2, and 2 more so than 3. Whether reviewers in your discipline will accept these approaches, I cannot say, and you might want to see if others have used them in publications in your field before investing much effort into these approaches.
1 like
Comment
Jon Brandt

Join Date: Feb 2016

Posts: 4
#5

19 Feb 2016, 19:02

Clyde,

You're right - I now see that there is no real difference-in-differences design at all. Thank you for helping me come to that understanding. Further disclaimer: this post is entirely related to research design and not Stata technique, and I understand if you consider it too off base for this forum to warrant discussion.

Regarding your three proposals for assignment of intervention to the control entities: the intervention I am trying to model is passage of state-legislation, squarely in the field of social science. Thus, I would characterize these events as non-random, but not necessarily ones that can be imputed from observing attributes in other entities, or ones that can be matched to putative start dates. In other words, the forces at work here are (presumably) independent choices between state governments, so unfortunately, though the first and second approach seem to be both creative and smart, I don't see intuitively that they are valid for my issue. The third approach would certainly be possible, but as you mentioned, least desirable on the hierarchy.

Which leads me to another line of addressing the issue: can there be a triple difference approach that is flexible in such a way to allow for interventions that occur at different times for different entities? For example, I know the standard, non-flexible diff-in-diff model that assumes one distinct intervention time is:

1) Yist = α + B1(Treat) + λ1(Post) + δ1(TreatxPost) + ϵist

While a model that allows (I think) for interventions to occur at different times would be:

2) Yist = α + B1(Treat) + λ1...k(year dummies) + δ1(TreatxPost) + ϵist

So, given that my model in its inflexible form is the following:

3) Yist = α + B1Treat + B2Type + B3Post + λ1(Type∗Treat) + λ2(Type∗Post) + λ3 (Treat∗Post) + δ1(Type∗Treat∗Post) + ϵist

Can the above model be re-written in such a way that allows for different states to experience the intervention at different times? Here is my (probably incorrect) attempt to extrapolate the same logic used between formulas 1) and 2):

4) Yist = α + B1Treat + B2Type + B3...k(Year dummy) + λ1(Type∗Treat) + λ2...k(Type∗Year dummy) + λ3...k(Treat∗Year dummy) + δ1(Type∗Treat∗Post) + ϵist

Would be happy to hear your thoughts on the matter.

Sincerely,
Jon
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30111
#6

20 Feb 2016, 08:41

Well, while I think your model 2 might well be used in a situation where the intervention begins in different years, I don't think it does what you want. Suppose you have a complete differences in differences design and a real start date in both the treatment and control groups, (In your social science context, one could imagine that the control group were states that voted on but rejected the legislation in question, the date of the vote being the start date.) then the only reason for including year dummies (or any other variables that might represent time) is to account for secular trends or annual shocks to the outcome variable. It doesn't really have anything to do with the variation in start dates. In fact, even where the intervention date is the same for all entities, one might still want to include some variables to model time if there is reason to believe that there are time-dependent forces that affect the outcome separately from the intervention. So I don't think adding time variables and their interactions with the treatment indicator will help you here. I think (but I haven't thought this part through too deeply) that you will get a model where you can estimate all the parameters: you will overcome your colinearity issue. But I don't see how you can use the resulting coefficients to estimate an intervention effect; they are only estimators of the differences between the two groups in each year, but year is not an adequate proxy for implementation of the intervention. At the end of the day, your data, and the models proposed in #5, lack the information needed to separately identify the responses of the control group under would-be intervention and actual control conditions.

I would be interested to know what others who deal in this kind of problem think, too.

By the way, while most of the posts on this Forum are about implementing analyses in Stata, design issues are perfectly appropriate here. The Forum is about statistics and Stata, and design is at the very heart of statistics.
1 like
Comment
Jon Brandt

Join Date: Feb 2016

Posts: 4
#7

20 Feb 2016, 14:07

Thanks for your thoughtful contributions to my issue. I will take some time to ruminate on the problem and see what, if anything, I can find in the literature to potentially model some solutions off of.
Comment

Announcement

Collinearity problem in triple difference regression using panel data

Comment

Comment

Comment

Comment

Comment

Comment