Interaction term w/o including one of the variables

fatima ht

Join Date: Aug 2015

Posts: 15
#1

Interaction term w/o including one of the variables

29 Apr 2016, 03:03

Hi,

I have a panel of firms in 8 years, categorized in two sub-samples of treated and control. I want to see the effect of treatment on an outcome variable Y by regressing Y on the treatment dummy variable D.

The thing is that I have a third variable Z, which is only defined for treated firms. In other words, it is missing for the firms in the control sample. I want to compare the effect of treatment on Y for different values of Z. My question is what is the best regression to run?

1- reg Y D D*Z (on the full sample and assigning any value to Z for observations in the control sample. Since D=0 for the control sample, the value assigned to Z does not matter)
2- First: reg Y D (on the full sample) and second: reg Y Z (on only the sample of treated firms)
3- reg Y D D*Z Z (on the full sample and assigning 0 to Z for observations in the control sample)

I personally like the first one the best and dislike number 3 as assigning 0 to Z and including it in the regression can introduce very big bias. But not sure if not including Z in number 1 is correct.

Thank you very much for your helps.
Fatima

Last edited by fatima ht; 29 Apr 2016, 03:11. Reason: interaction missing dummy treated control
Tags: control, interaction, missing, panel, treatment
Clyde Schechter

Join Date: Apr 2014

Posts: 28603
#2

29 Apr 2016, 09:42

I would go with your first option. Here's the logic. Your model is E(Y|D=1, Z) = b0 + b1D, where b1 = c0 + c1Z , and E(Y|D=0, Z) = b0. These equations easily combine to the result E(Y|D, Z) = b0 + c0D + c1D*Z. The main obstacle to estimating this is that missing values of Z would exclude observations with D= 0. But, as you note, when D = 0, D*Z = 0 no matter what Z is, so you can just set Z to some arbitrary value to keep those observations in the estimation sample. By the way, you should use factor variable notation to get the interaction term and not calculate it yourself. So I would code this as -reg Y i.D i.D#i.Z- if Z is categorical, or -reg Y i.D#c.Z- if Z is continuous.

But there may be other approaches, depending on what D and Z are. If, for example, Z is undefined in controls because Z is some sort of dosage or intensity of D, then a better model would just be -regress Y Z-, with Z set to 0 for control firms. There are lots of other situations where a different model might be appropriate, and without knowing what D and Z actually are, it's impossible to advise in greater detail.

All of this said, things get more complicated with panel data. Was the treatment in effect during all 8 years in the treatment group? If not, then you really have the data for a difference-in-differences estimator, and you should do it that way. That would mean something like -xtreg Y i.D##i.pre_post i.d#i.pre_post#i.Z- (or c.Z as the case may be).

If the treatment was in effect during all 8 years, then you do not have a difference-in-differences design, just two cohorts with longitudinal data. If you use -xtreg, fe-, D will be constant within firms, and D will be dropped from estimation due to colinearity. But you can't afford to have that happen here because D is your treatment effect conditional on Z = 0, which is important here. So you will need to use a random-effects or between-effects model to avoid this.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

29 Apr 2016, 09:52

Maybe I didn't get it right, but I fail to envisage a solution for this issue. First, because you have missing data for all observations under a certain treatment group. Second, because your missing data in absolutely MNAR. I really hope you get more insighful advice here in the Forum. Meanwhile, considering you have a panel data, please keep in mind you are supposed to use the - xtreg - instead of - regress - command.

P.S:. I wrote this before being aware of Clyde's message.

Best regards,

Marcos
Comment
fatima ht

Join Date: Aug 2015

Posts: 15
#4

29 Apr 2016, 10:22

Thank you very much Clyde for your detailed and complete response. It was very helpful.

Let me explain a bit more what the variables are. D is an event in the firm, which can happen to any firm at any time during the 8 years period. More specifically, it is realization of a tax obligation. Z can be different things, such as the actual value tax, or the characteristics of the event that caused the tax obligation. As long as the event does not happen, Z is not defined or is irrelevant to Y.
D=1 only in the year that the tax obligation is realized and 0 in other years.

I think regressing Y on Z does not estimate what I am looking after. I know (from regressing Y on D) that the event itself has a positive effect on the outcome. But I need to know if low or high values of Z (dummy or continuous) mitigate or amplify the effect of the event.

Re DiD approach, there is no pre-post dummy other than D itself (before or after the event). I guess I cannot have a firm fixed effect because Y is liquidation and it can only happen once to the firm. But I'm not sure. The event can happen in multiple years. I just cluster at the firm level.

One approach I take is to drop all the observations with D=0 for the firm that is ever treated, and keep only the exact year the treatment happens. (i.e. drop observations with D=0 and keep observations with D=1 if there is at least one observation with D=1 for the firm).

Another approach is keeping all the observations for all the firms and cluster at the firm level. The results from the two approached are very similar.

Thanks again,
Fatima
Comment
fatima ht

Join Date: Aug 2015

Posts: 15
#5

29 Apr 2016, 10:25

Originally posted by Marcos Almeida View Post

Maybe I didn't get it right, but I fail to envisage a solution for this issue. First, because you have missing data for all observations under a certain treatment group. Second, because your missing data in absolutely MNAR. I really hope you get more insighful advice here in the Forum. Meanwhile, considering you have a panel data, please keep in mind you are supposed to use the - xtreg - instead of - regress - command.

P.S:. I wrote this before being aware of Clyde's message.

Thank you Marcos and sorry for being unclear. Maybe my second message makes it more clear. I use xtreg, just typed reg for simplicity. But thanks for reminding!

Fatima
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#6

30 Apr 2016, 07:28

Thank you for clarifying the issue, Fatima.

With regards to the specific commands on Stata, my comment on - xtreg - versus - regress - related to the FAQ' s topic 10.

Since you' re posting quite recently, I decided to share the excerpt:

Don't say "I ran a regression and then ...", say "I ran regress and then ...".

As I said in #3, I really hope you get a (fine) way out of this challenging issue.

Kind regards,

Marcos

Best regards,

Marcos
Comment

Announcement

Interaction term w/o including one of the variables

Comment

Comment

Comment

Comment

Comment