Problems with panel regression model (specifically xtoverid, predicor collinearity)

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#31

21 Dec 2017, 00:34

Maria:
so the issue is to choose the regression model that offers the truest and fiarest view of the data generating process underlying the sample under investigation.
Otherwise, you can report both -fe- and -re- specification, explaining in your paper the pros and cons of both.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#32

21 Dec 2017, 00:48

Thank you. I'll do just that. I just have to know which version is the most true and fair
Is it okay to take zB einen log from the DV and the IVs since they were heavily skewed to the right?
what would that imply for my solutions? It since to matter, since it will change the model I'll use. Are there any downsides with taking ether the dv, ivs or both as log values? The dummy always stay dummies right?
best regards
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#33

21 Dec 2017, 00:48

Thank you. I'll do just that. I just have to know which version is the most true and fair
Is it okay to take zB einen log from the DV and the IVs since they were heavily skewed to the right?
what would that imply for my solutions? It since to matter, since it will change the model I'll use. Are there any downsides with taking ether the dv, ivs or both as log values? The dummy always stay dummies right?
best regards
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#34

21 Dec 2017, 00:49

Srry for the repeated replies. Keyboard stuck
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#35

21 Dec 2017, 00:55

Maria:
- usually, literature (not statistics) points you to the best specified model;
- logging a dummy is technically meaningless, as numbers are really levels;
- there are no downsides in logging right-skewed variable, but the interpretation of them changes. see any decent econometrics textbook.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#36

28 Dec 2017, 04:26

Dear Carlo, how are you? I hope you had a wonderful christmas!
I have another question for my regression, maybe you have an answer.
I want to divide a variable measuring a fine into three categoreis: namely small fine, medum fine ad large fine. At first, I wanted to create three dummy variables. Then I used the

Code:

g Categorysize=recode( SIZE_AVERAGE_REVENUE,10000000000,50000000000,100000000000)

function. Is that possble?
Now I would want to see if the öevel of the fine had an effect on R&D Investment. In my regression, I have apre_fine post_fine dummy comparing the level of investments before and afte the fine.
Can I just create an interaction variable: categorysize * post_fine_dummy?
And, do I have to put

Code:

i.

in front of this term?
best regards
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

#37

28 Dec 2017, 04:43

Maria:
thanks. I do hope the same for you and your dears.
What you have in mind is feasible, as per the following toy-example (where -foreign- is basically replaced by -country_car- ):

Code:

. use "C:\Program Files (x86)\Stata15\ado\base\a\auto.dta"
(1978 Automobile Data)

. g country_car=recode(foreign ,0,1)

. regress price i.country_car##i.rep78
note: 1.country_car#1b.rep78 identifies no observations in the sample
note: 1.country_car#2.rep78 identifies no observations in the sample
note: 1.country_car#5.rep78 omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =        69
-------------+----------------------------------   F(7, 61)        =      0.39
       Model |    24684607         7  3526372.43   Prob > F        =    0.9049
    Residual |   552112352        61  9051022.16   R-squared       =    0.0428
-------------+----------------------------------   Adj R-squared   =   -0.0670
       Total |   576796959        68  8482308.22   Root MSE        =    3008.5

-----------------------------------------------------------------------------------
            price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
    1.country_car |   2088.167   2351.846     0.89   0.378     -2614.64    6790.974
                  |
            rep78 |
               2  |   1403.125   2378.422     0.59   0.557    -3352.823    6159.073
               3  |   2042.574   2204.707     0.93   0.358    -2366.011    6451.159
               4  |   1317.056   2351.846     0.56   0.578    -3385.751    6019.863
               5  |       -360   3008.492    -0.12   0.905    -6375.851    5655.851
                  |
country_car#rep78 |
             1 1  |          0  (empty)
             1 2  |          0  (empty)
             1 3  |  -3866.574   2980.505    -1.30   0.199    -9826.462    2093.314
             1 4  |  -1708.278   2746.365    -0.62   0.536    -7199.973    3783.418
             1 5  |          0  (omitted)
                  |
            _cons |     4564.5   2127.325     2.15   0.036      310.651    8818.349
-----------------------------------------------------------------------------------

Kind regards,
Carlo
(Stata 19.0)

Comment

Maria Kohnen

Join Date: Dec 2017
Posts: 45

#38

28 Dec 2017, 05:03

Dear Carlo,

thank you for the reply.

So my regression looks as follows:

Code:

 xtreg RDlog POST_FINE_DUMMY LENIENCY_DUMMY post_len_inter fine_category fine_cat_inter i.year , fe vce(robust)

where POST_FINE_DUMMY compares the periods before and after the fine
LENIENCY_DUMMY is wether the firm was granted full leniency
post_len_inter the interaction of POST_FINE_DUMMY and LENIENCY_DUMMY
fine_category te new new variable including s,m,l fine
fine_cat_inter the inteaction of fine_category and the POST_FINE _DUMMY

Code:

  xtreg RDlog POST_FINE_DUMMY LENIENCY_DUMMY post_len_inter fine_category fine_cat_inter i.year , fe vce(robust)
note: LENIENCY_DUMMY omitted because of collinearity
note: fine_category omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =      1,446
Group variable: ID                              Number of groups  =        145

R-sq:                                           Obs per group:
     within  = 0.0941                                         min =          7
     between = 0.0547                                         avg =       10.0
     overall = 0.0000                                         max =         19

                                                F(22,144)         =       7.59
corr(u_i, Xb)  = -0.0541                        Prob > F          =     0.0000

                                      (Std. Err. adjusted for 145 clusters in ID)
---------------------------------------------------------------------------------
                |               Robust
          RDlog |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
POST_FINE_DUMMY |   .1311123   .0464526     2.82   0.005     .0392952    .2229293
 LENIENCY_DUMMY |          0  (omitted)
 post_len_inter |  -.0608626   .0858443    -0.71   0.479    -.2305403    .1088152
  fine_category |          0  (omitted)
 fine_cat_inter |   .1857077   .7588183     0.24   0.807    -1.314154    1.685569
                |
           year |
          1997  |   .1408881   .0962422     1.46   0.145    -.0493418     .331118
          1998  |   .2455628   .0897055     2.74   0.007     .0682532    .4228724
          1999  |   .2301396   .0954123     2.41   0.017     .0415501    .4187291
          2000  |   .4015061    .096725     4.15   0.000     .2103219    .5926903
          2001  |   .3462281   .1049296     3.30   0.001     .1388269    .5536293
          2002  |   .3215827   .1077026     2.99   0.003     .1087004     .534465
          2003  |   .2631099   .1081642     2.43   0.016     .0493152    .4769046
          2004  |   .2568348   .1106968     2.32   0.022     .0380343    .4756352
          2005  |   .2527749   .1147696     2.20   0.029     .0259241    .4796257
          2006  |     .27069   .1195602     2.26   0.025     .0343704    .5070096
          2007  |   .2378767   .1273098     1.87   0.064    -.0137606     .489514
          2008  |   .1761791   .1325034     1.33   0.186    -.0857237     .438082
          2009  |   .1640074   .1349029     1.22   0.226    -.1026383    .4306531
          2010  |   .2286983   .1406743     1.63   0.106    -.0493551    .5067516
          2011  |   .2681223   .1481219     1.81   0.072    -.0246518    .5608963
          2012  |   .3901673   .1577436     2.47   0.015     .0783753    .7019594
          2013  |   .1361936   .1717045     0.79   0.429    -.2031932    .4755803
          2014  |   .1637635    .181262     0.90   0.368    -.1945144    .5220414
          2015  |   .5215185   .2107859     2.47   0.015     .1048844    .9381526
                |
          _cons |   18.81203   .1144009   164.44   0.000     18.58591    19.03815
----------------+----------------------------------------------------------------
        sigma_u |  2.0577645
        sigma_e |  .33642112
            rho |  .97396728   (fraction of variance due to u_i)
---------------------------------------------------------------------------------

.

does that makes sense?

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#39

28 Dec 2017, 08:32

Maria:
I would strop at the first interaction and rewrite your code in more efficient way, relying on -fvvarlist- for creating categorical variables and interactions:

Code:

xtreg RDlog i.POST_FINE_DUMMY##i.fine_category i.LENIENCY_DUMMY i.year , fe vce(robust)

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#40

28 Dec 2017, 08:47

thank you carlo.
I have some trouble understanding the regression correctly. for clarification purposes:
1)

Code:

##

accounts for main effects and interaction effects of the vaiable, right? so POST_FINE_DUMMY, fine_cat and their interaction?
2) why is the

Code:

i.

in front of the POST_FINE_DUMMY and the LENIENCY_DUMMY needed?
3) what happened to the interaction between POST_FINE_DUMMY and LENIENCY_DUMMY? was that just left out because my regression looked to confusing and you may have overlooked it? or does have a reason?
4) the dummys are already created. would I still need the

Code:

i.

i front? or would that be double? I mean, POST_FINE_DUMMY indicates wether it is pre or post fine. Would I still put the prefix

Code:

i.

in front?
thank you ver much for the help!

EDIT: unfortunately, the fine_categories include non-integer values since they are so small (0.02..) since te are ratios, so I cannot make a factor variable out of them i guess?

Last edited by Maria Kohnen; 28 Dec 2017, 09:06.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#41

28 Dec 2017, 09:12

Maria:
1) yes, you're right. Double # allows both conditional main effect and interactions;
2) the -i.- operator tells Stata to treat the variables included in the interactions as categorical. You can omit it when dealing with a two-level categorical variable. However, I've pursued the habit to use it always, as it is one of the take-home message I've learnt from -help fvvarlist-.
3) your regression code looked a bit sparse, actually. Besides, if you include another interaction, I suspect that you will lose something along the way due to collinearity.
Anyway, assuming it makes sense in your research field, you may want to try:

Code:

xtreg RDlog i.POST_FINE_DUMMY##i.fine_category i.POST_FINE_DUMMY##i.LENIENCY_DUMMY i.year , fe vce(robust)

Kind regards,
Carlo
(Stata 19.0)
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#42

28 Dec 2017, 09:41

Dear Carlo,

thank you for the reply. Unfortunately my fine_category variable, indicating the three different fine levels small,medium and large are non-integer values, because they are ratio numbers and very small. Should I rather then just indluce dummy variables for medium_fine and large_fine (small as a baseline), or is there something that can be done about the categorial variable?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17707

#43

28 Dec 2017, 10:17

Maria:
you can recode -fine_category-.
Perhaps the following toy-example can give you some hints about how to do it:

Code:

. set obs 100
number of observations (_N) was 0, now 100

. g x=runiform()

. pctile tertiles = x, nq(3)

. tab tertiles

percentiles |
       of x |      Freq.     Percent        Cum.
------------+-----------------------------------
   .3200437 |          1       50.00       50.00
   .7167162 |          1       50.00      100.00
------------+-----------------------------------
      Total |          2      100.00

. su x

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
           x |        100    .4973389      .30869   .0030522   .9874847

. g index=0 if x<=.3200437
(67 missing values generated)

. replace index=1 if x>.3200437 & x<=.7167162
(34 real changes made)

. replace index=2 if x>.7167162 & x!=.
(33 real changes made)

. label define index 0 "small" 1 "medium" 2 "large"

. label val index index

. tab index

      index |      Freq.     Percent        Cum.
------------+-----------------------------------
      small |         33       33.00       33.00
     medium |         34       34.00       67.00
      large |         33       33.00      100.00
------------+-----------------------------------
      Total |        100      100.00

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Maria Kohnen

Join Date: Dec 2017

Posts: 45
#44

28 Dec 2017, 10:19

thank you very much Carlo, much appreciatet!
Comment
Maria Kohnen

Join Date: Dec 2017

Posts: 45
#45

29 Dec 2017, 10:04

Dear Carlo,
I have a question about the

Code:

xtoverid

function. I used it to choose between the fe and re model, since the hausman test cannot be down with robust SE.
1)when do i have to use robust SE? when I suscept heteroskedasticity? if there is none, is Hausman okay?
2) I want to describe why I used the xtoverid function, but need to base my explanation on literature. I looked through the Arellano paper and through Wooldridge but honestly I am not too familiar with econometrics to interprete these two in a proper manner. MAy you please explain to me shortly why xtoverid is used, what the test is based on (Sargan Hansen test? Arellano?...), and if there are any sources to quote that may be more easy to understand?
thank you very much
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment