Interaction terms vs. sample split

Johann Beier

Join Date: Mar 2015

Posts: 21
#1

Interaction terms vs. sample split

29 Mar 2015, 03:19

Dear stata forum,

I'm currently working on my final thesis and I got the following setting:

- Panel dataset with 172,000 firm year observations
- I got Interaction terms consisting of a dummy variable that equal one under certain conditions multiplied with cash flow of the specific firm of the specific year (N when dummy ==1 equals is 27062)
- I'm doing an fixed effect regression with xtreg, trying to predict whether the existence of this certain condition has an influence on my dependet variable Investment

Investment = b1*CF + b2*(interaction term= CF*dummy) + b3*controls + u

However when I'm running the regression in stata it's allways insignificant. I tried various settings and definition of the interaction terms but t values are always very low or near 0, indicating that adding the interaction term to my regression does not explain anything of my dependet Variable.

I think that the insignificance has someting to do with the distribution of the subsample of the interaction term and the numbers of observations. When I'm running my model only for the different subsamples in a sample split (when the dummy is one and when its zero) with:

Investment = b1*CF + b2*controls if dummy==1

b1 is significant and lower than in

Investment = b1*CF + b2*controls if dummy==0

My first question is the following: Why is my interaction term always insignificant ? I know this can have a lot of causes but I think (as written above) that it has something to do with different distributions that are mixed up.

My second question is: Can I get a usefull statement when I'm estimation the equation for the different subsamples like formulated above?

I'm really confued here and I hope that someone can explain me what is going wrong.

Thanks a lot for your help!

regards

Johann
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17736
#2

29 Mar 2015, 07:00

Johann:
as per FAQ, please report what you typed and what Stata gave you back (via code delimiters, please. They are the #-labelled option among the Advanced editor capabilities. Thanks).
You seem to overarchingly expecting that the interaction included in your regresion turns out to be statistically significant.
I'm not clear whether the literature in your research field makes you confident about that result or what else.
As you reported, the statistical insignificance may have different causes, included that the interaction has actually no effect on your dependent variable (a result that I would consider as good as statistical significance, anyway).
As far as subsample regressions are concerned, their acceptability depends on the customary rules concerning methods of statistical analyses in your research field.

Last edited by Carlo Lazzaro; 29 Mar 2015, 07:04.

Kind regards,
Carlo
(Stata 19.0)
Comment

Johann Beier

Join Date: Mar 2015
Posts: 21

29 Mar 2015, 09:16

Dear Carlo,

thanks a lot for your answer. I know that insignificance might be justifiable as well as a significance. I only worry about that the interaction terms are not correct for my model and a sample split might be more appropriate. Considering the latter: There are authors in corporate finance which are doing sample splits and some use interaction terms. Anyway I thought there might be a miss specification in my model so below I report what code I'm using with the respective results:

Code:

xtreg I CF l.Q interaction Dummy control1 control2 i.year, fe cluster(number)

Fixed-effects (within) regression               Number of obs      =     78292
Group variable: number                          Number of groups   =     11803

R-sq:  within  = 0.1261                         Obs per group: min =         1
between = 0.1379                                        avg =       6.6
overall = 0.1381                                        max =        16

F(21,11802)        =    124.66
corr(u_i, Xb)  = 0.0246                         Prob > F           =    0.0000

(Std. Err. adjusted for 11803 clusters in number)

Robust
I       Coef.   Std. Err.      t    P>t     [95% Conf. Interval]

CF    .1653265   .0090504    18.27   0.000     .1475862    .1830667
             
Q
L1.    .0149219    .000582    25.64   0.000     .0137812    .0160626
             
interaction   -.0157672   .0100519    -1.57   0.117    -.0354706    .0039362
Dummy    .0026931   .0011209     2.40   0.016      .000496    .0048901
control1    .0188166   .0034513     5.45   0.000     .0120515    .0255817
control2   -.0116199   .0012696    -9.15   0.000    -.0141085   -.0091314
             
year
1999    -.0124077    .001466    -8.46   0.000    -.0152813    -.009534
...........

sigma_u   .06341354
sigma_e    .0445441
rho   .66960421   (fraction of variance due to u_i)

As you can see I use a firm-fixed effects model, use a dummy for each year (1999 . 2013) ,cluster by number which is an identifier of each firm. May be the cluster by firm might not be appropriate. I do this because I got autocorrelation and heteroscedasticity problems. What problems might occur with interaction terms that vary over time in a fixed-effects model with clustering?

thanks a lot

Johann

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#4

29 Mar 2015, 10:03

Well, your model with interaction gives a negative coefficient for interaction. So that model, like the sample-split approach, is saying that the effect of CF on I is lower when Dummy == 1. These results seem consistent, although you don't say how the interaction coefficient compares to the difference in the values of b1 in the split samples. It seems, however, that you are disappointed that the interaction effect is not statistically significant.

There is nothing obviously wrong with your model, and your command syntax seems to correctly implement it. Your sample size is ample, and you even have a reasonable number of observations per firm. Clustering on firms as you have done is quite usual in this type of work, and almost certainly necessary.

I'm a little concerned about missing data here. While your estimation sample, at 78,292 observations on 11,803 firms is quite generous, it is less than half of your full data set, which you say numbers about 172,000 observations. What's that about? It may be that even though you have a nice distribution of Dummy in your full data sample, after you drop most of the data to get to the regression sample, the remaining observations may have a lopsided distribution of Dummy. If that's true, your power to estimate the interaction will be poor. Have you run -tab Dummy if e(sample)- after the regression? You should also look at the distribution of CF within each level of Dummy in the estimation sample (e.g. tabstat CF if e(sample), by(Dummy))--if there is very little variation of CF within one (or both) groups then the interaction will be difficult to estimate precisely.

A purely stylistic note that is in no way related to the questions you raise: since you are using factor variables for the year effects, why don't you go whole hog and use it for the dummy and interaction variables. When you go back to look at this a year from now, you will need to rummage around in your code to figure out what variables "interaction" is the interaction of. So why not code this transparently as:

Code:

xtreg I c.CF##I.Dummy L.Q control1 control2 i.year, fe cluster(number)

and also reap the benefit of being able to use -margins- in your follow-up to these results.

(By the way, I assume that in your actual work you are not using uninformative names like Dummy, control1 and control2, but have done that just for the purposes of posting here. But if that's not true, it is better style to use names that have some mnemonic value--again, you may have to look at and understand this code sometime in the future when you won't remember much detail.)

Finally, where does your insistence on a statistically significant result for this interaction come from? Is this a phenomenon that is well-attested in the literature in this field with consistent results in replicated studies? If not, it may well be that your data are giving you a correct answer to the question you have asked of it, just not the one you were hoping for.
Comment
Johann Beier

Join Date: Mar 2015

Posts: 21
#5

30 Mar 2015, 02:13

Dear Clyde,

thanks a lot for your answer. It helped me a lot in a) to verify that I'm using the correct code and b) thinking about the missing data issue. In my last post I already dropped a number of observations coming up to a sample of 114000 observations. Anyway missing data seems to be an issue as you pointed it out. The first picture attached shows what values the dummy took when estimating the model, while the second shows the distribution of CF after estimation. Seems like the as you suggested the variation of CF is very little in both groups. Does this explain the insignificance?
Also you are right about the style I adjusted that in my model. Variable names are different actually in my real dataset and names were changed in purpose to post them here.
There are previous studies in my field that suggest a significant influence on that particular interaction term, but in a very different setting. E.g. they use GMM instead of OLS, different time frame different countries etc. So I guess I just expected to show the same effect here but it seems like it doesn't exisit in my dataset.

regards

Johann

2 Photos
Comment
Harald Leber

Join Date: Feb 2015

Posts: 50
#6

30 Mar 2015, 02:43

Regarding your initial post:

(1) Investment = b1*CF + b2*(interaction term= CF*dummy) + b3*controls + u

is not the same as

(2) Investment = b1*CF + b2*controls if dummy==1

(3) Investment = b1*CF + b2*controls if dummy==0

as your all other variables in your model (1) are not interacted with dummy (0/1). In model (1) you force alll obs to have the same coefficients on the control variables (and other ind. variables), while in (2) and (3) these variables are allowed to have different coeff, so the solution probably is in the controls..
Comment
Johann Beier

Join Date: Mar 2015

Posts: 21
#7

30 Mar 2015, 03:07

Thanks Harald for your answer. I estimated the model (1) without my controls and get significant results (on 5%) level there. But I don't think these results are valid because I left out my controls. I also tried other common controls but getting insignificant results. I guess so you made a good point with the distribution of my controls in the whole sample as in the sub samples.
Comment
Johann Beier

Join Date: Mar 2015

Posts: 21
#8

30 Mar 2015, 04:21

Well I guess, there was some miss coding in my control variable. I combined the advice of Clyde and Harald. First I estimated the model then took a look at different distributions of my control variables via

Code:

tabstat controlvar1 if e(sample), by(Dummy)

. And saw an that it actually was driven by many missing values in case my dummy as 1. I adjusted the generation of the control variable and hence improved my results.

thanks a lot everybody for help and suggestions, I guess sometimes I rather have to take a look at the code twice.....
regards
Johann
Comment

Announcement