Problems with panel regression model (specifically xtoverid, predicor collinearity)

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#46

29 Dec 2017, 10:32

Maria:
1) under -xtreg- , you use robustified/cluistered standard errors when you suspect heteroskedasticity and/or autocorrelation (although the latter usually does not bite that harder with a large N, small T panel datasets,for which -xtreg- is appropriate);
2) -help xtoverid- (now that you've installed -xtoverid- you can use -help- instead of -search- to select its helpfile) will point you out to the necessary reference.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#47

29 Dec 2017, 10:47

thank you very much.
when I try to check for heteroskedasticity with

Code:

rvfplot

or

Code:

hettest

in my regression (xtreg) i will not work. it only works with regress....which is not the same, right? so far I could not find any information on that here.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

#48

29 Dec 2017, 10:54

Maria:
you can visually inspect your residual distribution (e_i) to check for heteroskedasticity.
As you can see from the following toy-example, with many observations even a minimal departure from normality can formally reject the null, whereas the visual inspection (with a superimposed normal plot) looks more reassuring:

Code:

. use http://www.stata-press.com/data/r15/nlswork.dta
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. xtreg ln_wage i.race tenure, vce(robust)

Random-effects GLS regression                   Number of obs     =     28,101
Group variable: idcode                          Number of groups  =      4,699

R-sq:                                           Obs per group:
     within  = 0.0972                                         min =          1
     between = 0.2079                                         avg =        6.0
     overall = 0.1569                                         max =         15

                                                Wald chi2(3)      =    1797.00
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

                             (Std. Err. adjusted for 4,699 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |
      black  |  -.1345322   .0120266   -11.19   0.000    -.1581039   -.1109605
      other  |   .1039944    .062132     1.67   0.094     -.017782    .2257708
             |
      tenure |   .0376405   .0009364    40.20   0.000     .0358052    .0394758
       _cons |    1.59266   .0067239   236.86   0.000     1.579481    1.605838
-------------+----------------------------------------------------------------
     sigma_u |  .33623102
     sigma_e |  .30357621
         rho |  .55090591   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. predict e_res, e
(433 missing values generated)

. sfrancia e_res

                  Shapiro-Francia W' test for normal data

    Variable |       Obs       W'          V'        z       Prob>z
-------------+-----------------------------------------------------
       e_res |    28,101    0.92713   1077.971    19.615    0.00001

Note: The normal approximation to the sampling distribution of W'
      is valid for 10<=n<=5000 under the log transformation.

. histogram e_res, normal
(bin=44, start=-1.8595626, width=.11358524)

Last edited by Carlo Lazzaro; 29 Dec 2017, 11:11.

Kind regards,
Carlo
(Stata 19.0)

Comment

Maria Kohnen

Join Date: Dec 2017

Posts: 45
#49

29 Dec 2017, 23:21

Dear Carlo,
I have a question about the regression code you suggested for me.
My old code reads:

Code:

xtreg RDlog POST_FINE_DUMMY LENIENCY_DUMMY post_len_inter fine_category fine_cat_inter i.year , fe vce(robust)

with the variables POST_FINE_DUMMY, LENIENCY_DUMMY, fine_category and the interactions between POST_FINE_DUMMY and LENIENCY_DUMMY, as well as POST_FINE_DUMMY and fine_category

i used the

Code:

g index=0 replace index=1....

code to recode the fine_cat, so I have it as a facgor variable.
the new code reads:

Code:

xtreg RDlog POST_FINE_DUMMY##LENIENCY_DUMMY POST_FINE_DUMMY##index i.year, fe > vce(robust)

or

Code:

xtreg RDlog i.POST_FINE_DUMMY##i.LENIENCY_DUMMY i.POST_FINE_DUMMY##i.index, f > e vce(robust)

however, with the new code, my post_fine_dummy turns insignificant...
for the old code, i created the interaction terms by

Code:

g POST_FINE_DUMMY * LENIENCY_DUMMY

the fine_category variable merely got recoded and turned into the index variable. why does the sig of the variables change?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#50

29 Dec 2017, 23:56

Maria:
change in statistical significance (for what it worths) may depend on the absence of conditional main effects in your previous code.
Compare in detail the old vs the new regression code and see where they differ.

Kind regards,
Carlo
(Stata 19.0)
Comment

Maria Kohnen

Join Date: Dec 2017
Posts: 45

#51

30 Dec 2017, 02:20

Dear Carlo,
thats strange. the two regressions should be the same.
One only includes the interaction terms that i created by multiplying the two variables (is that correct to do?)
and the other one includes the way you suggested it for me using

Code:

##

also, looking at the output (exported with outreg2), it looks very confusing (just looking at the interaction terms):

	(1)	(2)
VARIABLES	RDlog	RDlog

1.POST_FINE_DUMMY	0.105**	0.105**
	(0.0482)	(0.0482)
1o.LENIENCY_DUMMY	-	-

0b.POST_FINE_DUMMY#0b.LENIENCY_DUMMY	0	0
	(0)	(0)
0b.POST_FINE_DUMMY#1o.LENIENCY_DUMMY	0	0
	(0)	(0)
1o.POST_FINE_DUMMY#0b.LENIENCY_DUMMY	0	0
	(0)	(0)
1.POST_FINE_DUMMY#1.LENIENCY_DUMMY	-0.0354	-0.0354
	(0.0905)	(0.0905)
1o.index	-	-

2o.index	-	-

0b.POST_FINE_DUMMY#0b.index	0	0
	(0)	(0)
0b.POST_FINE_DUMMY#1o.index	0	0
	(0)	(0)
0b.POST_FINE_DUMMY#2o.index	0	0
	(0)	(0)
1o.POST_FINE_DUMMY#0b.index	0	0
	(0)	(0)
1.POST_FINE_DUMMY#1.index	0.0667	0.0667
	(0.0846)	(0.0846)
1.POST_FINE_DUMMY#2.index	0.0432	0.0432
	(0.113)	(0.113)
1997.year	0.142	0.142
	(0.0955)	(0.0955)
1998.year	0.242***	0.242***
	(0.0907)	(0.0907)
1999.year	0.225**	0.225**
	(0.0977)	(0.0977)
2000.year	0.399***	0.399***
	(0.0990)	(0.0990)
2001.year	0.343***	0.343***
	(0.107)	(0.107)
2002.year	0.320***	0.320***
	(0.111)	(0.111)
2003.year	0.261**	0.261**
	(0.111)	(0.111)
2004.year	0.256**	0.256**
	(0.113)	(0.113)
2005.year	0.252**	0.252**
	(0.117)	(0.117)
2006.year	0.271**	0.271**
	(0.122)	(0.122)
2007.year	0.240*	0.240*
	(0.129)	(0.129)
2008.year	0.178	0.178
	(0.134)	(0.134)
2009.year	0.166	0.166
	(0.137)	(0.137)
2010.year	0.233	0.233
	(0.142)	(0.142)
2011.year	0.272*	0.272*
	(0.149)	(0.149)
2012.year	0.394**	0.394**
	(0.159)	(0.159)
2013.year	0.136	0.136
	(0.173)	(0.173)
2014.year	0.163	0.163
	(0.182)	(0.182)
2015.year	0.517**	0.517**
	(0.211)	(0.211)
Constant	18.81***	18.81***
	(0.117)	(0.117)

Observations	1,446	1,446
R-squared	0.096	0.096
Number of ID	145	145
Time FE	YES
Year FE		YES

Robust standard errors in parentheses
*** p<0.01, ** p<0.05, * p<0.1

i dont thinl i have to include the year dummies when refering to it in the bottom right?
is there a better way to extract output tables? what is te prederred and best way?
thank you

Comment

Maria Kohnen

Join Date: Dec 2017
Posts: 45

#52

30 Dec 2017, 02:22

Code:

##

also, looking at the output (exported with outreg2), it looks very confusing (just looking at the interaction terms):

	(1)	(2)
VARIABLES	RDlog	RDlog

1.POST_FINE_DUMMY	0.105**	0.105**
	(0.0482)	(0.0482)
1o.LENIENCY_DUMMY	-	-

0b.POST_FINE_DUMMY#0b.LENIENCY_DUMMY	0	0
	(0)	(0)
0b.POST_FINE_DUMMY#1o.LENIENCY_DUMMY	0	0
	(0)	(0)
1o.POST_FINE_DUMMY#0b.LENIENCY_DUMMY	0	0
	(0)	(0)
1.POST_FINE_DUMMY#1.LENIENCY_DUMMY	-0.0354	-0.0354
	(0.0905)	(0.0905)
1o.index	-	-

2o.index	-	-

0b.POST_FINE_DUMMY#0b.index	0	0
	(0)	(0)
0b.POST_FINE_DUMMY#1o.index	0	0
	(0)	(0)
0b.POST_FINE_DUMMY#2o.index	0	0
	(0)	(0)
1o.POST_FINE_DUMMY#0b.index	0	0
	(0)	(0)
1.POST_FINE_DUMMY#1.index	0.0667	0.0667
	(0.0846)	(0.0846)
1.POST_FINE_DUMMY#2.index	0.0432	0.0432
	(0.113)	(0.113)

Comment

Maria Kohnen

Join Date: Dec 2017
Posts: 45

#53

30 Dec 2017, 04:09

Dear Carlo,
thats strange, the two regressions are the same...
for the frst, i just created the interaction terms by multiplying the two variables (is that correct?)
for the second i used your suggestion with

Code:

##

in the first there is a categorial variable measuring the varibale of the fine, and in the second the index i created

also,
when I do the regression as a RE instead of FE model, two coefficients (that were dropped out in the FE model) become highly significant. However, their interaction terms not. here, the interaction terms are of interest, right?

Code:

 xtreg RDlog POST_FINE_DUMMY##LENIENCY_DUMMY POST_FINE_DUMMY## index i.year, re vce (robust)

Random-effects GLS regression                   Number of obs     =      1,446
Group variable: ID                              Number of groups  =        145

R-sq:                                           Obs per group:
     within  = 0.0954                                         min =          7
     between = 0.2532                                         avg =       10.0
     overall = 0.2934                                         max =         19

                                                Wald chi2(26)     =     284.05
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

                                                     (Std. Err. adjusted for 145 clusters in ID)
------------------------------------------------------------------------------------------------
                               |               Robust
                         RDlog |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------------------+----------------------------------------------------------------
             1.POST_FINE_DUMMY |   .0937966    .047636     1.97   0.049     .0004317    .1871615
              1.LENIENCY_DUMMY |  -1.188936   .5741838    -2.07   0.038    -2.314315   -.0635563
                               |
POST_FINE_DUMMY#LENIENCY_DUMMY |
                          1 1  |   -.032627   .0906532    -0.36   0.719     -.210304      .14505
                               |
                         index |
                            1  |  -1.414091   .3144783    -4.50   0.000    -2.030457   -.7977244
                            2  |  -3.318151   .4542453    -7.30   0.000    -4.208456   -2.427847
                               |
         POST_FINE_DUMMY#index |
                          1 1  |   .0667658   .0847663     0.79   0.431    -.0993731    .2329047
                          1 2  |    .043198   .1129048     0.38   0.702    -.1780913    .2644874
                               |
                          year |
                         1997  |   .1428365   .0964652     1.48   0.139    -.0462318    .3319048
                         1998  |   .2415162   .0897411     2.69   0.007     .0656268    .4174055
                         1999  |   .2279303   .0960447     2.37   0.018     .0396862    .4161744
                         2000  |   .4015464   .0971418     4.13   0.000      .211152    .5919409
                         2001  |   .3466999   .1053512     3.29   0.001     .1402154    .5531845
                         2002  |   .3250426   .1089631     2.98   0.003     .1114788    .5386064
                         2003  |   .2719324   .1089493     2.50   0.013     .0583956    .4854692
                         2004  |   .2688308   .1111449     2.42   0.016     .0509907    .4866708
                         2005  |     .26701   .1150357     2.32   0.020     .0415442    .4924758
                         2006  |   .2871169   .1197246     2.40   0.016      .052461    .5217727
                         2007  |   .2585619   .1270097     2.04   0.042     .0096274    .5074964
                         2008  |   .1993598   .1317654     1.51   0.130    -.0588956    .4576152
                         2009  |    .189391    .134246     1.41   0.158    -.0737264    .4525083
                         2010  |   .2571608   .1396395     1.84   0.066    -.0165276    .5308492
                         2011  |   .2989827   .1465978     2.04   0.041     .0116562    .5863091
                         2012  |   .4223348   .1564944     2.70   0.007     .1156114    .7290581
                         2013  |   .1673456   .1705159     0.98   0.326    -.1668593    .5015505
                         2014  |   .1950685   .1806357     1.08   0.280     -.158971     .549108
                         2015  |   .5490416   .2103324     2.61   0.009     .1367977    .9612854
                               |
                         _cons |   19.86893   .2233757    88.95   0.000     19.43112    20.30674
-------------------------------+----------------------------------------------------------------
                       sigma_u |  1.6148069
                       sigma_e |  .33627256
                           rho |  .95843715   (fraction of variance due to u_i)

Last edited by Maria Kohnen; 30 Dec 2017, 04:11.

Comment

Maria Kohnen

Join Date: Dec 2017

Posts: 45
#54

30 Dec 2017, 04:34

I just want to end up with simple regression...haha...this becomes very frustrating.
All I want to test is the influence of a fine on the R&D expenses of a firm..

I have R&Dlog as DV
I have the POST_FINE_DUMMY as a predictor, comparing the period before and after the fine
BUT, since the data also contains firms that were granted full leniency and ultimately paid no fine at all, I want to include a LENIENCY_DUMMY with 0=leniency and 1= no lneincy, and creat an interaction between the POST_FINE_DUMMY and the LENIENCY_DUMMY to control for that and see, if the fine is significant
further, I want to test if the level of the fine has an impact. I created the categorial variable small,medium and large fine and created the index variable accounting for it with 0,1,2
now i creat an interaction term between the index variable and the POST_FINE_DUMMY to check if it is relevant.
does that make sense? f yes, what regression to use?

in the first model i check:

Code:

xtreg RDlog POST_FINE_DUMMY i.year, fe vce (robust)

just so see if there is an effect. of course I know the full leniency firms are included.. so i check my second model, including leneincy (and also fine, though i could make a third model from this):

Code:

xtreg RDlog POST_FINE_DUMMY LENIENCY_DUMMY post_len_inter fine_category fine_cat_inter i.year , fe vce(robust)

or

Code:

xtreg RDlog POST_FINE_DUMMY##LENIENCY_DUMMY POST_FINE_DUMMY##index i.year, fe vce (robust)

only the POT_FINE_DUMMY on itself is significant.

if I do it as a RE model, the index variables are also signifiant. However, the interaction terms are never sig..

i don]´þ know what is correct to use..maybe you have a suggestion?

thank you,
best
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#55

30 Dec 2017, 04:56

Maria:
1) i would implement the specification (fe or re) supported by -xtoverid- output;
2) if the interaction is not significant, you can decide tp remove it;
3) taking a methodological decision (and justify it) is often not that easy. However, you're seemingly circling around that issue: hence, break the loop and implement your model!

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#56

30 Dec 2017, 07:20

thank you Carlo,
your feedback is much appreciatet.
Comment

Maria Kohnen

Join Date: Dec 2017
Posts: 45

#57

31 Dec 2017, 00:27

Dear Carlo,

I have a question about interpretation of my findings.

I ran the regression:

Code:

 xtreg RDlog i.POST_FINE_DUMMY##i.LENIENCY_DUMMY i.POST_FINE_DUMMY##i.index i.year, fe vce(robust)

and received the output:

Code:

 . xtreg RDlog i.POST_FINE_DUMMY##i.LENIENCY_DUMMY i.POST_FINE_DUMMY##i.index i.
> year, fe vce(robust)
note: 1.LENIENCY_DUMMY omitted because of collinearity
note: 2.index omitted because of collinearity
note: 3.index omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      1,446
Group variable: ID                              Number of groups  =        145

R-sq:                                           Obs per group:
     within  = 0.0956                                         min =          7
     between = 0.0973                                         avg =       10.0
     overall = 0.0007                                         max =         19

                                                F(23,144)         =       7.51
corr(u_i, Xb)  = -0.0796                        Prob > F          =     0.0000

                                   (Std. Err. adjusted for 145 clusters in ID)
------------------------------------------------------------------------------
             |               Robust
       RDlog |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.POST_FIN~Y |   .1048217   .0482441     2.17   0.031     .0094637    .2001798
1.LENIENCY~Y |          0  (omitted)
             |
POST_FINE_~Y#|
LENIENCY_D~Y |
        1 1  |  -.0353759     .09053    -0.39   0.697    -.2143153    .1435635
             |
       index |
          2  |          0  (omitted)
          3  |          0  (omitted)
             |
POST_FINE_~Y#|
       index |
        1 2  |   .0667084   .0845967     0.79   0.432    -.1005034    .2339202
        1 3  |   .0432342   .1125363     0.38   0.701    -.1792023    .2656706
             |
        year |
       1997  |   .1420507   .0955104     1.49   0.139    -.0467328    .3308342
       1998  |   .2417893   .0907284     2.66   0.009     .0624577    .4211208
       1999  |   .2254666   .0977008     2.31   0.022     .0323537    .4185795
       2000  |   .3989208   .0990069     4.03   0.000     .2032262    .5946153
       2001  |   .3433501   .1072809     3.20   0.002     .1313014    .5553989
       2002  |   .3195207   .1108762     2.88   0.005     .1003656    .5386757
       2003  |    .260894   .1112029     2.35   0.020     .0410931     .480695
       2004  |   .2556291   .1134245     2.25   0.026      .031437    .4798211
       2005  |    .252128   .1173151     2.15   0.033     .0202459    .4840101
       2006  |   .2709381   .1220501     2.22   0.028     .0296969    .5121794
       2007  |   .2402154   .1294255     1.86   0.065    -.0156037    .4960346
       2008  |   .1780891   .1343336     1.33   0.187    -.0874314    .4436095
       2009  |    .166175   .1366889     1.22   0.226    -.1040009    .4363509
       2010  |   .2325389   .1420704     1.64   0.104    -.0482739    .5133518
       2011  |   .2720732   .1491341     1.82   0.070    -.0227016    .5668479
       2012  |   .3936022   .1586722     2.48   0.014     .0799746    .7072298
       2013  |   .1361632   .1726021     0.79   0.431    -.2049979    .4773242
       2014  |   .1633292   .1820071     0.90   0.371    -.1964214    .5230799
       2015  |   .5168145   .2113234     2.45   0.016     .0991178    .9345112
             |
       _cons |   18.81188   .1168634   160.97   0.000     18.58089    19.04287
-------------+----------------------------------------------------------------
     sigma_u |  2.0608159
     sigma_e |  .33627256
         rho |  .97406464   (fraction of variance due to u_i)
------------------------------------------------------------------------------

as my DV is a log variable (RDlog), and the predictor variab POST_FINE_DUMMY a dummy comparing two periods of time, 0= 5 years pre fine, and 1=5 post fine, does the coefficient .1048217 mean that R&D expenses are around 10.5 % higher in the post perdiod?

and second,about the index variable I created. I have three categories, small, medium and large fine, 1,2 and 3. should i rather create two dummies and include them in the regression (takes small as a baseline, and include m and l), or is it okay to use this index variable as a categorial variable with three levels?
when interpreting this variable in the interaction with the POST_FINE_DUMMY, i guess cat 1 is taken as the baseline? and the regression compares cat 1 and 2 and cat 1 and 3 and sees if there is a sig difference with respect two the periods?

or, something else i noticed: when i include allthree categories (s,m,l) as interactiond with the POST_FINE_DUMMY, the whole model changes...
i mean, usually you leave out one level of the dummies as the baseline. thats why i thought the small catt would be left out. buisnt the 0 for each individual dummy already the baseline? so, category small: 0=no, 1=yes? while, if taken as a 3 cat variable, small is the baselin. the results however are very , very different...whats the right way?

thank you!

Last edited by Maria Kohnen; 31 Dec 2017, 01:11.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#58

31 Dec 2017, 14:32

Maria:
your interpretation of log-linear model is correct (+10.5%).
As far as the query about interactions is concerned, you should not include all the categories to avoid the so called dummy trap.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#59

01 Jan 2018, 09:58

Dear Carlo,

I hope you had a nice new years eve. thank you very much for the answer.
Unfortunately, the correct log-linear interpretation means that my model is absolute nonsense. In no way did R&D expenses rise by 10% for any of te companies over the two periods I compared...also R_sqaured and adj. R_squared are around 0.08...meaning its a total useless model i guess.....especiyll when you go fixed effect...those should usually become quite high, right?

something with my regressions seems off...
first of all, i recked my sample data...the data on R&D, taken from compustat and datastream are correct....but theres no chance the R&D expenses got up by 10.5% between thr two periods 3 years prior to the fine and 3 years after the fine....it seems that my design is off.....
is it correct to do what i do? creating a dummy variable whic is 0 for the period before the fine and 1 for the period after, and then doing

Code:

xtreg RDlog POST_FINE_DUMMY, fe vce (robust)

to check if there is a signiicant difference in R&D spending between these two periods?

thnak you very much,
best

Last edited by Maria Kohnen; 01 Jan 2018, 10:49.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#60

01 Jan 2018, 11:05

Maria:
thanks. I do hope the same was for you and your dears.
I would interpret the result of your regression with a further statement: when adjusted for the remaining predictors,
Please note that, in multiple regression, the statistical significance (for what itworths) of a given predictor should be considered in the light (ie, adjustmed for) of the remaining ones.
If you use your last regression code to check whether a statistical significant difference exists in R&D spending between the periods included in -POST_FINE_DUMMY- you're actually running a regression model which is totally different from the previous one.
Again, I would recommend you to take a look at the literature in your research field and see what others did in the past when presented with the same research topic.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment