Interpreting test results on panel data and how to correctly perform fixed effect regression on panel data

Nils Edgren

Join Date: Apr 2019

Posts: 22
#1

Interpreting test results on panel data and how to correctly perform fixed effect regression on panel data

27 Apr 2019, 05:31

Hi members,

I have a large set of panel data with information about 166 bonds, containing some of their characteristics (such as currency, issue date, etc.) followed by daily yield data over a five year period for each bond (although most of these values are missing). I have about 54 000 observations of bond yields.

My goal is to run a regression that shows what variables have an effect on the bond's yield. More specifically, I am trying to run a regression of the yield on a measure of the bond's liquidity to find the unobserved effect that isn't explained by the variable liquidity.

So far I have run various tests to check whether I should use a fixed or random effects model, as well as tests to check for autocorrelation and heteroskedasticity, as well as an F-test. I am not sure if I am interpreting the results of these tests correctly and what my model choice should be going forward to perform regressions in stata.

The regression I am trying to perform is: (Y is yield, P_i is the fixed-effect estimator, Liquidity is the variable for Liquidity). Yield has variable name YIELDDIFF and Liquidity is BIDASKSP

Y_i,t = P_i+Liquidity_i,t +e_i,t with e being the error term.

I have first run an F-test, with the following result:

Code:

xtreg YIELDDIFF BIDASKSP, fe Fixed-effects (within) regression Number of obs = 44,751 Group variable: RIC_2 Number of groups = 166 R-sq: Obs per group: within = 0.0485 min = 13 between = 0.0059 avg = 269.6 overall = 0.0504 max = 1,178 F(1,44584) = 2272.59 corr(u_i, Xb) = 0.0180 Prob > F = 0.0000 ------------------------------------------------------------------------------ YIELDDIFF | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- BIDASKSP | -.2839549 .0059565 -47.67 0.000 -.2956296 -.2722801 _cons | .0158317 .000306 51.74 0.000 .015232 .0164315 -------------+---------------------------------------------------------------- sigma_u | .15915535 sigma_e | .06431106 rho | .8596394 (fraction of variance due to u_i) ------------------------------------------------------------------------------ F test that all u_i=0: F(165, 44584) = 1130.58 Prob > F = 0.0000

1. Am I interpreting this test correctly as saying that my fixed-effect estimator has an explanatory value for the yield of the bonds, as given by u_i=0: F(165, 44584) = 1130.58 - and that I should in fact use a fixed effect model? What does the large F value mean?

2. Is it correct to run xtreg with YIELDDIFF (i.e. yield) as the dependent variable and only BIDASKSP (Liquidity) as the independent variable to isolate the fixed-effect estimator Pi (as outlined in the equation above)?

I then ran the Hausman test which I understand as indicating that I should be using a fixed effect rather than random effect model:

Code:

Test: Ho: difference in coefficients not systematic chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B) = 3.95 Prob>chi2 = 0.0468

Following that, I tested for Autocorrelation using a Wooldridge test, with the following result:

Code:

xtserial YIELDDIFF BIDASKSP Wooldridge test for autocorrelation in panel data H0: no first-order autocorrelation F( 1, 165) = 19.676 Prob > F = 0.0000

I also performed a Modified Wald test for heteroskedasticity

Code:

xttest3 Modified Wald test for groupwise heteroskedasticity in fixed effect regression model H0: sigma(i)^2 = sigma^2 for all i chi2 (166) = 1.1e+09 Prob>chi2 = 0.0000

3. Am I correct in interpreting these results as having both autocorrelation as well as heteroskedasticity in my data?

As for going forward, my understanding is that I should be doing a regression with robust standard errors (After skimming through previous research, it seems as if many regressions are performed with White standard errors, but I am unsure what this entails and how to do that in Stata). Would I then run the same xtreg as I did before but also adding robust standard errors?

Many thanks for your help. It has been a while since I took statistics at my university and I am unfortunately not entirely up to speed on my statistical knowledge.

/N
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

27 Apr 2019, 09:33

Nils:
1) the result of the F-test appearing as a footnote of the outcome table tells you that your dataset shows evidence of panelwise effect; hence a pooled OLS would not be appropriate given your data.
2) As you detected both heteroskedasticity and autocorrelation, you're right in invoking cluster robust standard errors. However, your next step should be to test whether -re- specification fits your data via the community-contributed command -xtoverid- (just type -search xtoverid- from within Stata to spot and install it), as -hausman- deos not support non default standard errors.
If the outcome of -xtoverid- reaches tstistical significance, you should go -fe-; otherwise, stick with -re-.

To wrap up, you should do something along the following lines, one you have installed -xtoverid-:

Code:

xtreg YIELDDIFF BIDASKSP, re robust
xtoverid

That said, It seems strange that your model specification is ok with one predictor only. This is something that you can test, as you can see in the following toy-example:

Code:

. use "http://www.stata-press.com/data/r15/union.dta"
(NLS Women 14-24 in 1968)

. xtset idcode year
       panel variable:  idcode (unbalanced)
        time variable:  year, 70 to 88, but with gaps
                delta:  1 unit

. xtreg age i.grade, robust

Random-effects GLS regression                   Number of obs     =     26,200
Group variable: idcode                          Number of groups  =      4,434

R-sq:                                           Obs per group:
     within  = 0.1045                                         min =          1
     between = 0.0188                                         avg =        5.9
     overall = 0.0298                                         max =         12

                                                Wald chi2(17)     =          .
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =          .

                             (Std. Err. adjusted for 4,434 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
         age |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       grade |
          1  |  -6.513248   1.867801    -3.49   0.000    -10.17407   -2.852425
          2  |   1.156889   3.683763     0.31   0.753    -6.063154    8.376932
          3  |   5.386752   1.867801     2.88   0.004     1.725929    9.047575
          4  |  -1.744937   2.538105    -0.69   0.492    -6.719532    3.229658
          5  |  -1.670922   2.542305    -0.66   0.511    -6.653749    3.311905
          6  |  -1.813056   2.122946    -0.85   0.393    -5.973953    2.347841
          7  |  -2.909495   2.064685    -1.41   0.159    -6.956202    1.137213
          8  |  -4.115661   1.958789    -2.10   0.036    -7.954816   -.2765063
          9  |  -3.057316   1.945918    -1.57   0.116    -6.871245     .756613
         10  |   -3.55541   1.907172    -1.86   0.062    -7.293397    .1825779
         11  |  -3.813377   1.899413    -2.01   0.045    -7.536158   -.0905966
         12  |  -3.357136   1.871466    -1.79   0.073    -7.025141    .3108696
         13  |  -.6027537   1.893576    -0.32   0.750    -4.314094    3.108587
         14  |   .5878884   1.894662     0.31   0.756    -3.125581    4.301358
         15  |  -.5306016   1.917914    -0.28   0.782    -4.289644     3.22844
         16  |  -1.235125    1.88085    -0.66   0.511    -4.921523    2.451272
         17  |   2.430688   1.891372     1.29   0.199    -1.276333    6.137708
         18  |   5.683285   1.884807     3.02   0.003     1.989132    9.377438
             |
       _cons |   32.11325   1.867801    17.19   0.000     28.45243    35.77407
-------------+----------------------------------------------------------------
     sigma_u |  3.8130736
     sigma_e |  5.2673733
         rho |  .34384807   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. predict u, xb

. g sq_u=u^2

. xtreg age u sq_u, robust

Random-effects GLS regression                   Number of obs     =     26,200
Group variable: idcode                          Number of groups  =      4,434

R-sq:                                           Obs per group:
     within  = 0.1042                                         min =          1
     between = 0.0189                                         avg =        5.9
     overall = 0.0299                                         max =         12

                                                Wald chi2(2)      =    1305.46
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

                             (Std. Err. adjusted for 4,434 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
         age |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           u |   .8801686   .5719794     1.54   0.124    -.2408904    2.001228
        sq_u |    .001519   .0087168     0.17   0.862    -.0155656    .0186036
       _cons |   2.231944   9.296946     0.24   0.810    -15.98974    20.45362
-------------+----------------------------------------------------------------
     sigma_u |  3.7731301
     sigma_e |  5.3901961
         rho |  .32885822   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. test sq_u

 ( 1)  sq_u = 0

           chi2(  1) =    0.03
         Prob > chi2 =    0.8617
*as the -test- outcome does not reach statistical significance, there's no evidence of model misspecification*
.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nils Edgren

Join Date: Apr 2019
Posts: 22

28 Apr 2019, 03:22

Carlo Lazzaro Thank you for your reply.

I have run -xtoverid- giving the following result:

Code:

. xtreg YIELDDIFF BIDASKSP, re robust

Random-effects GLS regression                   Number of obs     =     44,751
Group variable: RIC_2                           Number of groups  =        166

R-sq:                                           Obs per group:
     within  = 0.0485                                         min =         13
     between = 0.0059                                         avg =      269.6
     overall = 0.0504                                         max =      1,178

                                                Wald chi2(1)      =      11.27
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0008

                                (Std. Err. adjusted for 166 clusters in RIC_2)
------------------------------------------------------------------------------
             |               Robust
   YIELDDIFF |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    BIDASKSP |  -.2832264   .0843749    -3.36   0.001    -.4485982   -.1178546
       _cons |    .001954   .0123481     0.16   0.874    -.0222477    .0261557
-------------+----------------------------------------------------------------
     sigma_u |  .15764626
     sigma_e |  .06431106
         rho |  .85732453   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. 
. xtoverid

Test of overidentifying restrictions: fixed vs random effects
Cross-section time-series model: xtreg re  robust cluster(RIC_2)
Sargan-Hansen statistic   1.888  Chi-sq(1)    P-value = 0.1694

As we get a P-value of 0.1694, I interpret this as meaning that I should use -fe- rather than -re- since the test does not reach statistical significance. To clarify, the reason that we perform this test rather than the Hausman test is that since we have both heteroscedasticity and autocorrelation in our data, meaning that we should use robust standard errors, we need to run another type of test as the Hausman test does not work for non default standard errors?

2. On your last point, I think I should clarify what I am trying to achieve. The main purpose of my bachelor thesis is to isolate and show the value of Pi, i.e. the unobserved effect in the first regression, as well as identifying what variables affect Pi itself. As such, I will be doing two regressions. The first one is the one specified in the original post, where I am trying to isolate the effect of Pi on the YIELDDIFF. I have used a matching method to retrieve matching pairs of bonds which should eliminate most variables that explain any difference in yield, which is why I only have one predictor (BIDASKSP) and an unobserved fixed effect estimator (Pi). So to clarify, in the first regression I am simply trying to "isolate" the unobserved effect Pi.

In the second step, I will run a regression on Pi, using more predictors. These predictors will be some of the characteristics of the bond, i.e. rating, issue date, issue amount, etc. At this stage I am trying to identify what variables explain and affect the unobserved effect in the first regression, Pi. The second regression is what will give me my main result, and so the first regression simply aims to isolate Pi so that I can run a regression with Pi as the dependent variable.

Does this clarify your point about model misspecification? I ran the tests you suggested although I'm not quite sure how to interpret the results. The results were as follows:

Code:

 xtreg YIELDDIFF BIDASKSP, fe robust

Fixed-effects (within) regression               Number of obs     =     44,751
Group variable: RIC_2                           Number of groups  =        166

R-sq:                                           Obs per group:
     within  = 0.0485                                         min =         13
     between = 0.0059                                         avg =      269.6
     overall = 0.0504                                         max =      1,178

                                                F(1,165)          =      11.22
corr(u_i, Xb)  = 0.0180                         Prob > F          =     0.0010

                                (Std. Err. adjusted for 166 clusters in RIC_2)
------------------------------------------------------------------------------
             |               Robust
   YIELDDIFF |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    BIDASKSP |  -.2839549   .0847902    -3.35   0.001    -.4513685   -.1165413
       _cons |   .0158317   .0004935    32.08   0.000     .0148573    .0168062
-------------+----------------------------------------------------------------
     sigma_u |  .15915535
     sigma_e |  .06431106
         rho |   .8596394   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. predict u, xb
(248,862 missing values generated)

. g sq_u=u^2
(248,862 missing values generated)

. xtreg YIELDDIFF u sq_u, fe robust

Fixed-effects (within) regression               Number of obs     =     44,751
Group variable: RIC_2                           Number of groups  =        166

R-sq:                                           Obs per group:
     within  = 0.0540                                         min =         13
     between = 0.0010                                         avg =      269.6
     overall = 0.0295                                         max =      1,178

                                                F(2,165)          =      16.92
corr(u_i, Xb)  = -0.0394                        Prob > F          =     0.0000

                                (Std. Err. adjusted for 166 clusters in RIC_2)
------------------------------------------------------------------------------
             |               Robust
   YIELDDIFF |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           u |   1.081878   .2274152     4.76   0.000     .6328589    1.530897
        sq_u |  -2.409346   1.230557    -1.96   0.052    -4.839013    .0203221
       _cons |   .0016674   .0046555     0.36   0.721    -.0075246    .0108594
-------------+----------------------------------------------------------------
     sigma_u |  .16169048
     sigma_e |  .06412671
         rho |  .86408553   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. test sq_u

 ( 1)  sq_u = 0

       F(  1,   165) =    3.83
            Prob > F =    0.0519

Does the fact that this test "almost" reaches statistical significance (with an alpha of 5%) mean that I have modelmisspecification? Since this is only the step 1 regression and the step 2 regression with Pi as the dependent variable will have more predictors, is this really an issue?

3. My final question would be how to actually retrieve descriptive statistics for the dependent variable in regression 1, Pi? To my understanding the regression only shows how a change in one variable affects the dependent variable. How do I get the implied values of Pi in my data following the first regression? As mentioned, it is my understanding that I need these to run the second regression.

I hope I have been able to make myself clear. I appreciate your help and any input from any one else who cares to chime in.

/N

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

28 Apr 2019, 03:33

Nils:
1) as -xtoverid- does not reach stsistical significance, you should go -re-, not -fe-.
2) there's no evidence of misspecification in your step 1 model.
3) If what you want to get from you first step model is the predicted value of -PI-, you should type after regression:

Code:

predict <choosethenameyoulike>, xb

Kind regards,
Carlo
(Stata 19.0)
Comment

Nils Edgren

Join Date: Apr 2019
Posts: 22

28 Apr 2019, 04:15

Carlo Lazzaro
1) This is surprising to me - the paper that I am drawing inspiration from for my bachelor thesis uses -fe- instead of -re- (and has similar data). The data I have does not hold for a broader category of bonds and I'm looking to show the bond-specific time invariant unobserved effect in my regression, i.e. Pi. From what I have been able to understand so far about fixed effect vs. random effect - wouldn't this point to using a fixed effect model? If so, why does -xtoverid- still show that I should go -re-?

FYI, these are the results I get from both an -fe- and an -re- xtreg in my step 1 regression:
-fe-:

Code:

xtreg YIELDDIFF BIDASKSP, fe robust

Fixed-effects (within) regression               Number of obs     =     44,751
Group variable: RIC_2                           Number of groups  =        166

R-sq:                                           Obs per group:
     within  = 0.0485                                         min =         13
     between = 0.0059                                         avg =      269.6
     overall = 0.0504                                         max =      1,178

                                                F(1,165)          =      11.22
corr(u_i, Xb)  = 0.0180                         Prob > F          =     0.0010

                                (Std. Err. adjusted for 166 clusters in RIC_2)
------------------------------------------------------------------------------
             |               Robust
   YIELDDIFF |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    BIDASKSP |  -.2839549   .0847902    -3.35   0.001    -.4513685   -.1165413
       _cons |   .0158317   .0004935    32.08   0.000     .0148573    .0168062
-------------+----------------------------------------------------------------
     sigma_u |  .15915535
     sigma_e |  .06431106
         rho |   .8596394   (fraction of variance due to u_i)
------------------------------------------------------------------------------

-re-:

Code:

xtreg YIELDDIFF BIDASKSP, re robust

Random-effects GLS regression                   Number of obs     =     44,751
Group variable: RIC_2                           Number of groups  =        166

R-sq:                                           Obs per group:
     within  = 0.0485                                         min =         13
     between = 0.0059                                         avg =      269.6
     overall = 0.0504                                         max =      1,178

                                                Wald chi2(1)      =      11.27
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0008

                                (Std. Err. adjusted for 166 clusters in RIC_2)
------------------------------------------------------------------------------
             |               Robust
   YIELDDIFF |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    BIDASKSP |  -.2832264   .0843749    -3.36   0.001    -.4485982   -.1178546
       _cons |    .001954   .0123481     0.16   0.874    -.0222477    .0261557
-------------+----------------------------------------------------------------
     sigma_u |  .15764626
     sigma_e |  .06431106
         rho |  .85732453   (fraction of variance due to u_i)
------------------------------------------------------------------------------

The high p-value of the intercept (_cons) strikes me as surprising. How should that be interpreted?

2. I have run predict to generate a variable for the predicted value of Pi. Using summarize I get the following values:

Code:

predict GREENPREMIUM, xb
(248,862 missing values generated)

. summarize GREENPREMIUM

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
GREENPREMIUM |     54,397    .0134974    .0434491  -.2865282   .2611284

To perform the step 2 regression, as outlined above, would I then run a regression with GREENPREMIUM being the dependent variable and the other bond characteristics being my predictors? This regression will have 4 predictors + a dummy variable for industry of the issuer firm.

Last edited by Nils Edgren; 28 Apr 2019, 04:19.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

28 Apr 2019, 04:30

Nils:
1) it may well be that the Authors of the paper you mention went -fe- disregarding any test aimed at identifying the bets specification for their data.
constant in -xtreg- is not that relevant; the interpretation is that its value can be <0, but that's all. What should be of some concern is the very low R-sq between, that stems from having one predictor only in your Step 1 regression.
2) Yes, you should go that way.
As an aside, I would recommend you to take all your doubts up with you teacher/supervisor.

Kind regards,
Carlo
(Stata 19.0)
Comment

Nils Edgren

Join Date: Apr 2019
Posts: 22

28 Apr 2019, 08:31

Carlo Lazzaro
Thank you for your assistance. I have tried running the step 2 regression but am having trouble creating categorical variables for some of the bond characteristics. I've browsed the forum and saw that you replied to similar posts in the past but I couldn't quite make sense of the syntax required to achieve my desired outcome.

A few of the variables in the Step 2 regressions are strings, namely Currency and Rating. I have generated new numerical variables using -encode name, generate (newname)-. Using -tabulate- to show the categories in the data for these variables I get the following result:

Code:

. tabulate RATING_n

  CORRATING |      Freq.     Percent        Cum.
------------+-----------------------------------
          A |     78,561       25.90       25.90
         A  |      1,827        0.60       26.51
         AA |     63,944       21.08       47.59
        AAA |     71,250       23.49       71.08
        BBB |     36,540       12.05       83.13
        N/A |     51,156       16.87      100.00
------------+-----------------------------------
      Total |    303,278      100.00

. tabulate CURRENCY_n

   CURRENCY |      Freq.     Percent        Cum.
------------+-----------------------------------
        AUD |     12,789        4.22        4.22
        CAD |      3,654        1.20        5.42
        CHF |      5,481        1.81        7.23
        EUR |     89,523       29.52       36.75
        HKD |      9,135        3.01       39.76
        IDR |      1,827        0.60       40.36
        INR |      3,654        1.20       41.57
        NOK |      3,654        1.20       42.77
        NZD |      3,654        1.20       43.98
        SEK |     65,772       21.69       65.66
        USD |    104,135       34.34      100.00
------------+-----------------------------------
      Total |    303,278      100.00

I am trying to make these variables into categorical variables with subgroups for each rating and each currency. The desired base levels (reference categories) are AAA for rating and USD for currency.

1) How do I achieve this desired outcome so that I can regress GREENPREMIUM on these variables (and a few others)?

Again, I appreciate your help. I have scheduled a meeting with my supervisor to discuss my method in the coming days but his experience with using Stata is unfortunately limited. Luckily I found this great forum!

Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

28 Apr 2019, 08:38

Nils:
a possible approach mirrors the following toy-example:

Code:

sysuse auto.dta
tab rep78, gen(new_rep78)
egen overall_dummies=rowtotal( new_rep78*)
replace overall_dummies=1 if rep78==1
replace overall_dummies=2 if rep78==2
replace overall_dummies=3 if rep78==3
replace overall_dummies=4 if rep78==4
replace overall_dummies=5 if rep78==5
replace overall_dummies=. if overall_dummies==0
tab overall_dummies

Yiou shoild consider -label- and -fvvarlist- notation as well.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nils Edgren

Join Date: Apr 2019
Posts: 22

28 Apr 2019, 09:00

Carlo Lazzaro
I read through the -help fvvarlist- to try to understand the proper way of achieving this but wasn't able to determine how you actually create the categories within each variable. As I understand it you don't actually create dummy variables but just specify subcategories within an already existing variable?

I tested the approach you gave with the following result:

Code:

egen overall_dummies=rowtotal(RATING_n)

. 
. replace overall_dummies=1 if RATING_n==1
(0 real changes made)

. 
. replace overall_dummies=2 if RATING_n==2
(0 real changes made)

. 
. replace overall_dummies=3 if RATING_n==3
(0 real changes made)

. 
. replace overall_dummies=4 if RATING_n==4
(0 real changes made)

. 
. replace overall_dummies=5 if RATING_n==5
(0 real changes made)

. 
. replace overall_dummies=6 if RATING_n==6
(0 real changes made)
. 
. replace overall_dummies=. if overall_dummies==0
(0 real changes made)

. 
. tab overall_dummies

overall_dum |
       mies |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     78,561       25.90       25.90
          2 |      1,827        0.60       26.51
          3 |     63,944       21.08       47.59
          4 |     71,250       23.49       71.08
          5 |     36,540       12.05       83.13
          6 |     51,156       16.87      100.00
------------+-----------------------------------
      Total |    303,278      100.00

. egen overall_dummies_CURRENCY=rowtotal(CURRENCY_n)

. 
. replace overall_dummies_CURRENCY=1 if CURRENCY_n==1
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=2 if CURRENCY_n==2
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=3 if CURRENCY_n==3
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=4 if CURRENCY_n==4
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=5 if CURRENCY_n==5
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=6 if CURRENCY_n==6
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=7 if CURRENCY_n==7
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=8 if CURRENCY_n==8
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=9 if CURRENCY_n==9
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=10 if CURRENCY_n==10
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=11 if CURRENCY_n==11
(0 real changes made)

. 
. replace overall_dummies_CURRENCY=. if overall_dummies_CURRENCY==0
(0 real changes made)

. 
. tab overall_dummies_CURRENCY

overall_dum |
mies_CURREN |
         CY |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     12,789        4.22        4.22
          2 |      3,654        1.20        5.42
          3 |      5,481        1.81        7.23
          4 |     89,523       29.52       36.75
          5 |      9,135        3.01       39.76
          6 |      1,827        0.60       40.36
          7 |      3,654        1.20       41.57
          8 |      3,654        1.20       42.77
          9 |      3,654        1.20       43.98
         10 |     65,772       21.69       65.66
         11 |    104,135       34.34      100.00
------------+-----------------------------------
      Total |    303,278      100.00

. xtreg GREENPREMIUM overall_dummies overall_dummies_CURRENCY, fe robust
note: overall_dummies omitted because of collinearity
note: overall_dummies_CURRENCY omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =     54,402
Group variable: RIC_2                           Number of groups  =        166

R-sq:                                           Obs per group:
     within  = 0.0000                                         min =         18
     between = 0.0350                                         avg =      327.7
     overall =      .                                         max =      1,180

                                                F(0,165)          =          .
corr(u_i, Xb)  =      .                         Prob > F          =          .

                                            (Std. Err. adjusted for 166 clusters in RIC_2)
------------------------------------------------------------------------------------------
                         |               Robust
            GREENPREMIUM |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------------------+----------------------------------------------------------------
         overall_dummies |          0  (omitted)
overall_dummies_CURRENCY |          0  (omitted)
                   _cons |   .0134972          .        .       .            .           .
-------------------------+----------------------------------------------------------------
                 sigma_u |  .03493133
                 sigma_e |  .01669494
                     rho |  .81405202   (fraction of variance due to u_i)
------------------------------------------------------------------------------------------

I have two questions following this:
1) Did this succesfully create subcategories within the two variables? Since the variables were removed in the regression I am unsure if they are treated as "separate" subcategories that would be given their own beta values.

2) Why are both of these variables removed due to collinearity? I understand that it is due to dependencies between the predictors but I don't quite understand why that happens?

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#10

28 Apr 2019, 09:03

Nils:
can you please post an excerpt of your data via -dataex-? Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Nils Edgren

Join Date: Apr 2019

Posts: 22
#11

29 Apr 2019, 02:23

Carlo Lazzaro

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str13 RIC float GREENPREMIUM str3(CORRATING CURRENCY) double(ISSUEAMOUNT MATURITY) "29874QCW2=" . "AAA" "USD" 6.500e+08 .5561643835616439 "29874QCW2=" . "AAA" "USD" 6.500e+08 .5561643835616439 "29874QDG6=" . "AAA" "USD" 5.000e+08 2.5397260273972604 "302154BZ1=" .015852291 "AA" "USD" 4.000e+08 2.117808219178082 "30216BGU0=" .015900685 "AAA" "USD" 5.000e+08 1.4191780821917808 "50064YAN3=" . "AA" "USD" 6.000e+08 4.567123287671233 "62630CAH4=" .017109547 "A" "USD" 5.000e+08 2.7260273972602738 "690353C21=" .1342553 "N/A" "USD" 47300000 10.715068493150685 "690353E52=" . "N/A" "USD" 37400000 10.715068493150685 "89114QBT4=" . "AA" "USD" 1.000e+09 1.6986301369863013 end

The data includes more currencies than USD, these were simply the ones chosen when selecting a random sample. I also made numeric variables of both currency and rating that can be used in regressions, but couldn't make those work in dataex so I used the string variables here instead. Again, I'm trying to make rating (CORRATING) and currency (CURRENCY) into categorical variables with AAA being the base level for rating and USD the base level for currency.

The variables that are being removed due to collinearity are constant for each bond for the entire time-series, i.e. rating, currency, issueamount, maturity, etc. are the same for all observations of each respective bond. From reading on this forum it seems that this causes these variables to be excluded from the regression? The paper I mentioned previously used these variables as predictors for the dependent variable GREENPREMIUM, as I am trying to do. The variables should be significant in the regression, based on previous research, and so it seems strange to me that I cannot include them in the regression.

Last edited by Nils Edgren; 29 Apr 2019, 02:33.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#12

29 Apr 2019, 02:31

Nils:
this is something you coulld have easily done yourself with -encode-:

Code:

. encode CORRATING, g(CORRATING_NUM) . encode CURRENCY , g( CURRENCY_NUM)

Get yourself familiar with -fvvarlist- (and related options) to set the reference categories you're interested in.

Kind regards,
Carlo
(Stata 19.0)
Comment

Nils Edgren

Join Date: Apr 2019
Posts: 22

#13

29 Apr 2019, 03:55

Carlo Lazzaro

I probably should've made myself clearer - I have already used encode to make numerical variables, I only didn't find a way to include them with labels in -dataex- so used the "original" variables instead.

I have read over -help fvvarlist- and have been able to understand it better now that I have some experience (albeit very limited) with Stata. I have managed to create the factor variables I was after.

However, I am running into the problem with collinearity. I am afraid I will have to expose my statistical illiteracy here, again it's been a while since I've taken courses in Statistics so my knowledge needs some brushing up.

I am running xtreg with fixed effects as follows:

Code:

xtreg GREENPREMIUM logISSUEAMOUNT MATURITY i.CURRENCY_n i.RATING_n i.PROJECTTYPE_n, fe robust
note: logISSUEAMOUNT omitted because of collinearity
note: MATURITY omitted because of collinearity
note: 1.CURRENCY_n omitted because of collinearity
note: 2.CURRENCY_n omitted because of collinearity
note: 3.CURRENCY_n omitted because of collinearity
note: 4.CURRENCY_n omitted because of collinearity
note: 5.CURRENCY_n omitted because of collinearity
note: 6.CURRENCY_n omitted because of collinearity
note: 7.CURRENCY_n omitted because of collinearity
note: 8.CURRENCY_n omitted because of collinearity
note: 9.CURRENCY_n omitted because of collinearity
note: 10.CURRENCY_n omitted because of collinearity
note: 1.RATING_n omitted because of collinearity
note: 3.RATING_n omitted because of collinearity
note: 5.RATING_n omitted because of collinearity
note: 6.RATING_n omitted because of collinearity
note: 2.PROJECTTYPE_n omitted because of collinearity
note: 3.PROJECTTYPE_n omitted because of collinearity
note: 5.PROJECTTYPE_n omitted because of collinearity
note: 6.PROJECTTYPE_n omitted because of collinearity
note: 7.PROJECTTYPE_n omitted because of collinearity
note: 8.PROJECTTYPE_n omitted because of collinearity

Fixed-effects (within) regression               Number of obs     =     54,402
Group variable: RIC_2                           Number of groups  =        166

R-sq:                                           Obs per group:
     within  = 0.0000                                         min =         18
     between = 0.0350                                         avg =      327.7
     overall =      .                                         max =      1,180

                                                F(0,165)          =          .
corr(u_i, Xb)  =      .                         Prob > F          =          .

                                     (Std. Err. adjusted for 166 clusters in RIC_2)
-----------------------------------------------------------------------------------
                  |               Robust
     GREENPREMIUM |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
   logISSUEAMOUNT |          0  (omitted)
         MATURITY |          0  (omitted)
                  |
       CURRENCY_n |
             AUD  |          0  (omitted)
             CAD  |          0  (omitted)
             CHF  |          0  (omitted)
             EUR  |          0  (omitted)
             HKD  |          0  (omitted)
             IDR  |          0  (omitted)
             INR  |          0  (omitted)
             NOK  |          0  (omitted)
             NZD  |          0  (omitted)
             SEK  |          0  (omitted)
                  |
         RATING_n |
               A  |          0  (omitted)
              AA  |          0  (omitted)
             BBB  |          0  (omitted)
             N/A  |          0  (omitted)
                  |
    PROJECTTYPE_n |
          Energy  |          0  (omitted)
        Land Use  |          0  (omitted)
             N/A  |          0  (omitted)
  Transportation  |          0  (omitted)
Waste Management  |          0  (omitted)
           Water  |          0  (omitted)
                  |
            _cons |   .0134972          .        .       .            .           .
------------------+----------------------------------------------------------------
          sigma_u |  .03493133
          sigma_e |  .01669494
              rho |  .81405202   (fraction of variance due to u_i)
-----------------------------------------------------------------------------------

As you can see, all of my variables are being omitted from the regression. After reading the forum and browsing the Stata manual I suspect that this is some form of the dummy variable trap? These variables are constant for each bond and do not change over the time period. Since I am specifying the model as a fixed effect, is this problem arising because there are no changes in neither the dependent nor any of the predictor variables? This intuitively makes sense as the fixed effect specification, if I understand correctly, means that what I am doing is basically running a regression on variables that are all constant, which would explain why there is collinearity? Am I thinking correctly and if so, how do I get around this?

Many thanks for your help Carlo, I really appreciate it.

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#14

29 Apr 2019, 04:00

Nils:
your intuition is correct: the -fe-machinery wipes out time-invariant predictors (and, unfortunately, currencies, like family names, are not expected to change as time goes by).
The last resort is to check whether -re-specification fits your data better than -fe-.

Kind regards,
Carlo
(Stata 19.0)
Comment

Nils Edgren

Join Date: Apr 2019
Posts: 22

#15

29 Apr 2019, 04:26

Carlo Lazzaro
That does make sense, and is something I probably should've realized earlier. In the step 1 regression mentioned earlier I ran a fixed effect regression to obtain the variable GREENPREMIUM. This variable is constant for each bond over the time-series but varies between bonds. Would it be reasonable to use the -re- specification in the step 2 regression even if I used the -fe- specification to obtain the dependent variable in step 2 (i.e. GREENPREMIUM)? Or would a pooled OLS be a better approach? To illustrate, I tried performing both. Doing so yields the following result:

Code:

 xtreg GREENPREMIUM logISSUEAMOUNT MATURITY i.CURRENCY_n i.RATING_n i.PROJECTTYPE_n, re robust

Random-effects GLS regression                   Number of obs     =     54,402
Group variable: RIC_2                           Number of groups  =        166

R-sq:                                           Obs per group:
     within  = 0.0000                                         min =         18
     between = 0.1362                                         avg =      327.7
     overall = 0.1798                                         max =      1,180

                                                Wald chi2(21)     =          .
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =          .

                                     (Std. Err. adjusted for 166 clusters in RIC_2)
-----------------------------------------------------------------------------------
                  |               Robust
     GREENPREMIUM |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
   logISSUEAMOUNT |  -.0034405   .0035123    -0.98   0.327    -.0103245    .0034435
         MATURITY |   .0031035   .0013315     2.33   0.020     .0004938    .0057131
                  |
       CURRENCY_n |
             AUD  |   .0010443   .0071359     0.15   0.884    -.0129418    .0150303
             CAD  |   .0019359   .0057097     0.34   0.735    -.0092549    .0131266
             CHF  |  -.0314472    .014296    -2.20   0.028    -.0594669   -.0034274
             EUR  |   -.011928   .0073211    -1.63   0.103     -.026277     .002421
             HKD  |  -.0405451   .0277118    -1.46   0.143    -.0948592     .013769
             IDR  |   .0288521   .0197445     1.46   0.144    -.0098464    .0675506
             INR  |   .0031791   .0115173     0.28   0.783    -.0193944    .0257526
             NOK  |  -.0261247   .0133215    -1.96   0.050    -.0522344    -.000015
             NZD  |  -.0319061   .0128562    -2.48   0.013    -.0571039   -.0067083
             SEK  |  -.0020055   .0082697    -0.24   0.808    -.0182138    .0142028
                  |
         RATING_n |
               A  |   .0014689   .0067198     0.22   0.827    -.0117016    .0146395
              AA  |  -.0008373   .0058051    -0.14   0.885     -.012215    .0105404
             BBB  |   .0024144   .0048281     0.50   0.617    -.0070485    .0118772
             N/A  |   .0061399   .0102482     0.60   0.549    -.0139461    .0262259
                  |
    PROJECTTYPE_n |
          Energy  |   .0012968   .0087217     0.15   0.882    -.0157975    .0183911
        Land Use  |  -.0037861   .0132794    -0.29   0.776    -.0298132     .022241
             N/A  |   -.009492   .0116256    -0.82   0.414    -.0322778    .0132938
  Transportation  |   .0035156   .0082554     0.43   0.670    -.0126646    .0196958
Waste Management  |   .0213694   .0275832     0.77   0.439    -.0326927    .0754315
           Water  |   .0393904     .01969     2.00   0.045     .0007987     .077982
                  |
            _cons |   .0755063   .0697502     1.08   0.279    -.0612015    .2122142
------------------+----------------------------------------------------------------
          sigma_u |  .03484352
          sigma_e |  .01669494
              rho |  .81328881   (fraction of variance due to u_i)
-----------------------------------------------------------------------------------

Pooled OLS:

Code:

regress GREENPREMIUM logISSUEAMOUNT MATURITY i.CURRENCY_n i.RATING_n i.PROJECTTYPE_n, vce(cluster RIC_2)

Linear regression                               Number of obs     =     54,402
                                                F(20, 165)        =          .
                                                Prob > F          =          .
                                                R-squared         =     0.2052
                                                Root MSE          =     .03874

                                     (Std. Err. adjusted for 166 clusters in RIC_2)
-----------------------------------------------------------------------------------
                  |               Robust
     GREENPREMIUM |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
   logISSUEAMOUNT |  -.0038494   .0055061    -0.70   0.485    -.0147209    .0070222
         MATURITY |   .0047137   .0020394     2.31   0.022     .0006869    .0087405
                  |
       CURRENCY_n |
             AUD  |  -.0003502   .0074541    -0.05   0.963    -.0150678    .0143675
             CAD  |   .0031494   .0084453     0.37   0.710    -.0135254    .0198242
             CHF  |  -.0332557   .0225441    -1.48   0.142    -.0777678    .0112564
             EUR  |  -.0202907   .0113747    -1.78   0.076    -.0427494     .002168
             HKD  |  -.0702176   .0319255    -2.20   0.029    -.1332528   -.0071824
             IDR  |    .039081   .0312817     1.25   0.213     -.022683     .100845
             INR  |   .0084365   .0187356     0.45   0.653    -.0285559    .0454289
             NOK  |   -.050783   .0208695    -2.43   0.016    -.0919887   -.0095772
             NZD  |  -.0248428   .0168512    -1.47   0.142    -.0581146    .0084291
             SEK  |   .0070028   .0106498     0.66   0.512    -.0140246    .0280303
                  |
         RATING_n |
               A  |   .0060837   .0076685     0.79   0.429    -.0090573    .0212246
              AA  |  -.0010465   .0090199    -0.12   0.908    -.0188557    .0167628
             BBB  |   .0103699   .0088591     1.17   0.243    -.0071219    .0278616
             N/A  |   .0175343   .0124369     1.41   0.160    -.0070217    .0420903
                  |
    PROJECTTYPE_n |
          Energy  |   .0162782   .0147251     1.11   0.271    -.0127957    .0453521
        Land Use  |   .0139873   .0244345     0.57   0.568    -.0342573     .062232
             N/A  |  -.0021629   .0177791    -0.12   0.903    -.0372667    .0329409
  Transportation  |   .0109533   .0106392     1.03   0.305    -.0100533    .0319599
Waste Management  |   .0513677   .0341644     1.50   0.135     -.016088    .1188234
           Water  |   .0760223   .0273014     2.78   0.006     .0221171    .1299275
                  |
            _cons |   .0645238   .1117128     0.58   0.564    -.1560471    .2850947
-----------------------------------------------------------------------------------

After reading your replies to earlier posts on the forum, I also ran -xttest0- with the following result:

Code:

xttest0

Breusch and Pagan Lagrangian multiplier test for random effects

        GREENPREMIUM[RIC_2,t] = Xb + u[RIC_2] + e[RIC_2,t]

        Estimated results:
                         |       Var     sd = sqrt(Var)
                ---------+-----------------------------
               GREENPR~M |   .0018877       .0434479
                       e |   .0002787       .0166949
                       u |   .0012141       .0348435

        Test:   Var(u) = 0
                             chibar2(01) =  8.9e+06
                          Prob > chibar2 =   0.0000

If I understood those replies correctly, this would point to -xtreg, re- being the better choice due to the statistical significance?

Announcement