Help fixed effects regression: including new variables turns regressors insignificant

Tisi Regen

Join Date: Nov 2014

Posts: 13
#1

Help fixed effects regression: including new variables turns regressors insignificant

29 Dec 2014, 08:33

Hi folks,

after composing my model for estimating the relationship between the old-age dependency ratio (share of elderly, i.e. 65+ who are dependent on the working population) and GDP per capita, I stumbled upon various problems where I would require the help of experts like you. I use (unbalanced) panel data comprising 28 EU member countries over a time period from 1970-2012. My dependent variable is GDP per capita and my explanatory variable the OADR. Other variables I will control for are saving rates, Human Development Index, HCPI (consumer prices indices), and total factor productivity (share of productivity which, broadly speaking, can explain the technological advance), which I retrieved from the UN, World Bank and Eurostat.

First I ran a normal regression, but already here the output was confusing as I was expecting a negative correlation since one would expect that the higher the share of elderly dependent people, the lower GDP per capita would be.

Since the results seemed to be significant I moved on as I thought there might be another explanation. However, I also ran a multiple linear regression where OADR became insignificant. At first I thought this might be due to multicollinearity but the results are showing otherwise.

Despite the results, I thought I should move on using either random or fixed effects regression for panel data. The Hausman test also suggested (as I expected) to use a fixed-effects model and hence I tested for autocorrelation, normality of residuals, Breusch-Pagan test for heteroskedasticity among residuals and Wald test for groupwise heteroskedasticity.

All the tests seem to suggest that the general assumptions hold true and that my model should be okay. However, if I include to the fixed effects regression all my control variables all my regressors become highly insignificant.

My question for you guys is:
- Where did I make any mistakes?
- How can I improve my model? (in order to be able to draw conclusions whether age structure has an effect on economic performance in already ageing societies)

As always thank you for your help and have a happy new year 2015 everyone!
Best,
Tisi

P.S.: Sorry I had to use pictures, but the copy paste function did not seem to function anymore
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17729
#2

29 Dec 2014, 09:41

Tisi:
I think you should change your mind about the existence of a "best model" for explaining your depvar. You would be better considering what other researchers did in the past for analysing similar data.
You do not report the code of -xtreg, fe- and -xtreg, re- you compared with the hausman test, neither explain why you decided to perform two -xtreg, fe- with a different set of predictors.
Anyway, your second -xtreg, fe- (the one with the wider set of predictors) looks bewildering probably due to a huge number of missing values (# of observations and groups drops dramatically vs the first -xtreg, fe- model, due to the default listwise deletion approach that Stata applies for -xt- (and regression, in general) when it spots observation with any missing value).
In the first part of your post, you seemingly ignored the panel data structure of your data and use -reg- instead of -xtreg-; this may seem the source of your problems, but at a second glance, missing values are again the culprit.
Hence, you have to investigate the mechanism and the pattern underlying the missingness and act consistently to deal with them (-help mi- can be a good first place to start).

Kind regards,
Carlo
(Stata 19.0)
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 876
#3

29 Dec 2014, 09:57

Suggestion 1: Please read the FAQ on how to post Stata outputs. You could have easily copy the output and use the CODE delimiter option (click A, then click '#' and paste the copied materials inside) to show us the outputs.

Suggestion 2: I will ignore the results from normal regression as you have a panel data and normal regression will not provide unbiased estimation.

Suggestion 3: The last two outputs I will frown upon and think why the number of observation in the model with 'oadr' is 1064 and it drastically reduced to 29 when other covariates are used ... missing values?

Suggestion 3: A significant hausman test is not always the suggestion for fixed effect model which ignores between country effect. A random intercept/coefficient model also should be tried if between country effect needs consideration.

Just noticed the reply from Carlo and Clyde (we were writing about the same time, i suppose), and I think we all pointed at the same problem.

Last edited by Roman Mostazir; 29 Dec 2014, 10:23.

Roman
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#4

29 Dec 2014, 10:18

I agree with Carlo that the massive amount of missing data (only 29 complete cases among 15 groups) is the root of Tisi's problem. But I'm not sure how much help MI will be. With so little primary data to start with, the imputed variables will probably exhibit large variance, and this will pass through to large standard errors in the MI analysis. And some people would take a highly skeptical view of any results based on multiply imputing 97% of the observations.

I think that Tisi may have to get more real data to fix this. The variables in question do not strike this non-economist as particularly exotic--they are the stuff often reported and commented on in the lay press, and it is hard to believe that complete data on these for EU countries cannot be obtained somewhere. In fact, it is hard to believe they are not available from the sources noted in the original post. Which leads me to think that the problem will ultimately trace itself back to some mis-step in building the data set being analyzed.

Last edited by Clyde Schechter; 29 Dec 2014, 10:21.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17729
#5

29 Dec 2014, 11:15

Clyde's take is a very good one, as it underlines the importance to double-check database consistency before start out any statisitical procedure (especially inferential ones).
Whenever I did not pay attention to this cautionary tale, I had sadly to regret thereafter.
Just elaboratinb a bit on Roman's intersting advice, Tisi may want to take a look at -help mixed-.
It goes without saying that any change in statistical model will not fix in itself the substantive missingn data issue Tisi should be dealing with.

Kind regards,
Carlo
(Stata 19.0)
Comment
Tisi Regen

Join Date: Nov 2014

Posts: 13
#6

30 Dec 2014, 01:04

I really appreciate your helpful and constructive comments - thank you! I will double-check the available data, but even though it is not necessarily exotic data it is - as far as I recall - difficult to find specific data for such a long time span. Maybe the use of different control variables instead of aggregated indicators such as HDI will be more effective and sensible as well.

In case I do not find more suitable data, would you suggest it makes more sense to reduce the time period of investigation in order to yield a more balanced panel?

Best,
Tisi
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17729
#7

30 Dec 2014, 01:44

Tisi:
yes, reducing the span of time of your investigation may be worth trying, provided that your dataset is not composed of countries for which data are sistematically missing. If this were the case, you would face a non-ignorable missingness mechanism, that should be properly addressed.

Kind regards,
Carlo
(Stata 19.0)
Comment

Tisi Regen

Join Date: Nov 2014
Posts: 13

05 Jan 2015, 05:16

Okay, so I reduced the span of time from 1970-2012 to 1980-2012 as well as dropped and included some variables (e.g. economically active population "eap") which were not as unbalanced as before. The panel is now only slightly unbalanced and results look much better now, so in this regard, thanks for your help!

However, I am encountering some new issues where I would appreciate your help and comments!

Here my rather extensive results:

Code:

 xtset co year
       panel variable:  co (strongly balanced)
        time variable:  year, 1980 to 2012
                delta:  1 unit

. xtreg gdppc oadr eap savingsrate tfp hicp, fe

Fixed-effects (within) regression               Number of obs      =       142
Group variable: co                              Number of groups   =        14

R-sq:  within  = 0.5230                         Obs per group: min =         6
       between = 0.0000                                        avg =      10.1
       overall = 0.0071                                        max =        12

                                                F(5,123)           =     26.98
corr(u_i, Xb)  = -0.9793                        Prob > F           =    0.0000

------------------------------------------------------------------------------
       gdppc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        oadr |   847.4269   735.4887     1.15   0.251    -608.4278    2303.282
         eap |   5.366884     .99389     5.40   0.000      3.39954    7.334229
 savingsrate |   284.1763   253.1895     1.12   0.264    -216.9967    785.3493
         tfp |   304.7452   149.5268     2.04   0.044     8.766058    600.7243
        hicp |    218.725   84.10931     2.60   0.010      52.2358    385.2142
       _cons |  -102718.5   22725.99    -4.52   0.000    -147703.2   -57733.77
-------------+----------------------------------------------------------------
     sigma_u |   57047.73
     sigma_e |  4845.7665
         rho |  .99283649   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0:     F(13, 123) =    31.79             Prob > F = 0.0000

. estimates store fixed

. xtreg gdppc oadr eap savingsrate tfp hicp, re

Random-effects GLS regression                   Number of obs      =       142
Group variable: co                              Number of groups   =        14

R-sq:  within  = 0.4133                         Obs per group: min =         6
       between = 0.2335                                        avg =      10.1
       overall = 0.2863                                        max =        12

                                                Wald chi2(5)       =     89.59
corr(u_i, X)   = 0 (assumed)                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
       gdppc |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        oadr |   44.22357   625.7191     0.07   0.944    -1182.163    1270.611
         eap |   .2064205   .2652732     0.78   0.436    -.3135054    .7263463
 savingsrate |  -54.37277   256.7287    -0.21   0.832    -557.5519    448.8063
         tfp |   21.14092   146.3144     0.14   0.885      -265.63    307.9118
        hicp |   454.3088    76.6919     5.92   0.000     303.9954    604.6222
       _cons |  -19910.43   17764.36    -1.12   0.262    -54727.94    14907.07
-------------+----------------------------------------------------------------
     sigma_u |  8287.9677
     sigma_e |  4845.7665
         rho |  .74524272   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. estimates store random

. hausman fixed random

                 ---- Coefficients ----
             |      (b)          (B)            (b-B)     sqrt(diag(V_b-V_B))
             |     fixed        random       Difference          S.E.
-------------+----------------------------------------------------------------
        oadr |    847.4269     44.22357        803.2033        386.5477
         eap |    5.366884     .2064205        5.160464        .9578348
 savingsrate |    284.1763    -54.37277        338.5491               .
         tfp |    304.7452     21.14092        283.6042        30.82791
        hicp |     218.725     454.3088       -235.5838        34.53589
------------------------------------------------------------------------------
                           b = consistent under Ho and Ha; obtained from xtreg
            B = inconsistent under Ha, efficient under Ho; obtained from xtreg

    Test:  Ho:  difference in coefficients not systematic

                  chi2(5) = (b-B)'[(V_b-V_B)^(-1)](b-B)
                          =       17.34
                Prob>chi2 =      0.0039
                (V_b-V_B is not positive definite)


. xtreg gdppc oadr eap savingsrate tfp hicp, fe vce(cluster co)

Fixed-effects (within) regression               Number of obs      =       142
Group variable: co                              Number of groups   =        14

R-sq:  within  = 0.5230                         Obs per group: min =         6
       between = 0.0000                                        avg =      10.1
       overall = 0.0071                                        max =        12

                                                F(5,13)            =      9.99
corr(u_i, Xb)  = -0.9793                        Prob > F           =    0.0004

                                    (Std. Err. adjusted for 14 clusters in co)
------------------------------------------------------------------------------
             |               Robust
       gdppc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        oadr |   847.4269   1347.151     0.63   0.540    -2062.915    3757.769
         eap |   5.366884   1.690367     3.17   0.007     1.715069      9.0187
 savingsrate |   284.1763   449.7413     0.63   0.538    -687.4308    1255.783
         tfp |   304.7452   313.7781     0.97   0.349    -373.1311    982.6215
        hicp |    218.725   182.9132     1.20   0.253    -176.4349    613.8849
       _cons |  -102718.5   41309.83    -2.49   0.027    -191962.9   -13473.99
-------------+----------------------------------------------------------------
     sigma_u |   57047.73
     sigma_e |  4845.7665
         rho |  .99283649   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. xtreg gdppc oadr eap savingsrate tfp hicp, re vce(cluster co)

Random-effects GLS regression                   Number of obs      =       142
Group variable: co                              Number of groups   =        14

R-sq:  within  = 0.4133                         Obs per group: min =         6
       between = 0.2335                                        avg =      10.1
       overall = 0.2863                                        max =        12

                                                Wald chi2(5)       =     73.55
corr(u_i, X)   = 0 (assumed)                    Prob > chi2        =    0.0000

                                    (Std. Err. adjusted for 14 clusters in co)
------------------------------------------------------------------------------
             |               Robust
       gdppc |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        oadr |   44.22357   1189.797     0.04   0.970    -2287.736    2376.184
         eap |   .2064205   .2544364     0.81   0.417    -.2922656    .7051066
 savingsrate |  -54.37277   481.4288    -0.11   0.910    -997.9558    889.2103
         tfp |   21.14092   371.7027     0.06   0.955     -707.383    749.6649
        hicp |   454.3088   181.3646     2.50   0.012      98.8407    809.7769
       _cons |  -19910.43   44170.71    -0.45   0.652    -106483.4    66662.57
-------------+----------------------------------------------------------------
     sigma_u |  8287.9677
     sigma_e |  4845.7665
         rho |  .74524272   (fraction of variance due to u_i)
------------------------------------------------------------------------------


. xtreg gdppc oadr eap savingsrate tfp hicp dco1 dco2 dco3 dco4 dco5 dco6 dco7 d
> co8 dco9 dco10 dco11 dco12 dco13 dco14 dco15 dco16 dco17 dco18 dco19 dco20 dco
> 21 dco22 dco23 dco24 dco25 dco26 dco27, re
note: dco3 omitted because of collinearity
note: dco4 omitted because of collinearity
note: dco7 omitted because of collinearity
note: dco10 omitted because of collinearity
note: dco11 omitted because of collinearity
note: dco15 omitted because of collinearity
note: dco16 omitted because of collinearity
note: dco17 omitted because of collinearity
note: dco18 omitted because of collinearity
note: dco20 omitted because of collinearity
note: dco21 omitted because of collinearity
note: dco22 omitted because of collinearity
note: dco23 omitted because of collinearity
note: dco27 omitted because of collinearity

Random-effects GLS regression                   Number of obs      =       142
Group variable: co                              Number of groups   =        14

R-sq:  within  = 0.5230                         Obs per group: min =         6
       between = 1.0000                                        avg =      10.1
       overall = 0.8618                                        max =        12

                                                Wald chi2(18)      =    766.85
corr(u_i, X)   = 0 (assumed)                    Prob > chi2        =    0.0000

------------------------------------------------------------------------------
       gdppc |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        oadr |   847.4269   735.4887     1.15   0.249    -594.1044    2288.958
         eap |   5.366884     .99389     5.40   0.000     3.418896    7.314873
 savingsrate |   284.1763   253.1895     1.12   0.262    -212.0659    780.4185
         tfp |   304.7452   149.5268     2.04   0.042     11.67805    597.8123
        hicp |    218.725   84.10931     2.60   0.009      53.8738    383.5762
        dco1 |   134869.6   25056.57     5.38   0.000     85759.62    183979.6
        dco2 |   128720.7   24819.06     5.19   0.000     80076.21    177365.1
        dco3 |          0  (omitted)
        dco4 |          0  (omitted)
        dco5 |   113288.8   24410.77     4.64   0.000     65444.57      161133
        dco6 |   154086.6   26713.02     5.77   0.000       101730    206443.1
        dco7 |          0  (omitted)
        dco8 |   141529.6   26365.18     5.37   0.000     89854.81    193204.4
        dco9 |   7889.392   3849.661     2.05   0.040      344.196    15434.59
       dco10 |          0  (omitted)
       dco11 |          0  (omitted)
       dco12 |   108731.9   24047.56     4.52   0.000     61599.53    155864.2
       dco13 |   166254.5   27899.32     5.96   0.000     111572.8    220936.1
       dco14 |    19912.6   6881.852     2.89   0.004      6424.42    33400.78
       dco15 |          0  (omitted)
       dco16 |          0  (omitted)
       dco17 |          0  (omitted)
       dco18 |          0  (omitted)
       dco19 |   117090.1   21249.31     5.51   0.000     75442.23      158738
       dco20 |          0  (omitted)
       dco21 |          0  (omitted)
       dco22 |          0  (omitted)
       dco23 |          0  (omitted)
       dco24 |   137887.2    27646.2     4.99   0.000     83701.59    192072.7
       dco25 |    43622.1   10102.59     4.32   0.000     23821.39     63422.8
       dco26 |   134276.3    24781.1     5.42   0.000     85706.19    182846.3
       dco27 |          0  (omitted)
       _cons |  -201613.2   36522.14    -5.52   0.000    -273195.3   -130031.1
-------------+----------------------------------------------------------------
     sigma_u |          0
     sigma_e |  4845.7665
         rho |          0   (fraction of variance due to u_i)

Why is it that using robust standard errors is - once again - turning most of my regressors insignificant?

Also, I am a bit confused as the number of observations and groups is still very low and does not cover my entire panel. When I include the country-specific dummies, most of them are being omitted due to collinearity. Does this have something to do with my high intraclass correlation term? And how problematic are these high rho values?

Lastly, I was wondering to what extent this omission is affecting the inferential capability of my model?

Happy new year,
Tisi

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17729
#9

05 Jan 2015, 07:32

Tisi:
1) robust clustered standar errors (SEs) account for serial correlation of residuals between observations, heteroskedasticity and dispersion of the coefficient estimates across clusters (not observations), as under the panel structure you have the same id measured multiple times across the time-series indentifier, That said, it is not surprising that most of your predictors fail to reach statistical significance (however, I do not consider this a problem, because, as reported by Altman & Bland "absence of evidence is not evidence of absence" (http://www.bmj.com/content/311/7003/...iant=full-text). Moreover, you have a few observations when contrasted against the number of predictors (the rule of thumb requires 20 observations per predictor¹, and your data seem quite to the limit in this respect): this may be another reason why statistical significance seems out of reach (but Altman & Bland's cautionary tale still holds).

2) Country-specific dummies are omitted due to correlation because they are already included as cross-sectional identifier in your -xtset-.

As an aside, I would go -fe- without country-specific dummies, as the hausman test seems to point you out that way.
Eventually, since your default SEs seem to widely differ from the clustered ones, you may want to further investigate via a robust hausman test (http://www.stata.com/statalist/archi.../msg01069.html) which specification (i.e., -fe- or -re-) fits better your data under -xtreg-.

¹Katz MH. Multivariable Analysis. Second Edtion. NY: Cambridge University Press, 2006: 81.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#10

05 Jan 2015, 08:50

So in your latest analyses, you have only 14 groups, and in the analysis that does not include the dummies, the ICC is 0.74, or higher in the other analyses. This is functionally the equivalent of doing a regression on at most a few dozen observations altogether. You simply don't have enough data here--I think it is hopeless no matter how you try to analyze it.

That said, it makes no sense to me to include dummy variables for country at the same time you are incorporating a random effect for country. I wouldn't know what to make of those results even if they looked satisfying to you.
Comment
Tisi Regen

Join Date: Nov 2014

Posts: 13
#11

06 Jan 2015, 02:11

Thanks for the advice!

Clyde,

You simply don't have enough data here--I think it is hopeless no matter how you try to analyze it.

would you suggest to stop my analysis and look for another model/topic as it is hopeless as you said? Or what would your recommendation and everyone else's be in order to fix this (assuming that it is possible)?
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1438
#12

06 Jan 2015, 03:11

I agree with Clyde that it's strange to include both country dummy variables and country random effects. In economists' jargon, you trying to have both fixed and random effects. I don't think you can do this! I suggest that you need to do at least 2 things for your research project: (1) look seriously into getting better data. E.g. without a large number of time periods, it's going to be hard to identify cross-time effects. (2) study and learn from the huge literature about this sort of model. In political science, the pioneers were (I think) Beck and Katz ((1995). ‘What to do (and not to do) with time-series cross-section data’, The American Political Science Review, 89 (3): 634–647). There are many subsequent papers, including by those authors. What political science researchers refer to as TSCS models, are typically referred to as country panels by economists. A fundamental assumption is the extent to which you may treat countries as "exchangeable".
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30155
#13

06 Jan 2015, 08:09

What I am characterizing as hopeless is getting any useful results out of this particular data set with so many missing observations. I have no expertise in your discipline, but thinking generically, unless you are working on a topic that your colleagues view as inappropriate, it usually makes more sense to try to find a better source of data (other sources of the same variables, or other variables that are reasonable proxies) than to abandon your topic. But I don't know what's available to you, so it's hard to advise. I do know, though, that continuing to struggle with this particular data set is likely to be a waste of your time.
Comment
Tisi Regen

Join Date: Nov 2014

Posts: 13
#14

06 Jan 2015, 08:22

I appreciate the honesty and all of your help! For now, I will follow your advice and work on the underlying data and will get back to you as soon as I get the new results when, potentially, more question will arise.
Comment
Tisi Regen

Join Date: Nov 2014

Posts: 13
#15

20 Jan 2015, 02:58

So I changed my entire model and oriented it towards the common literature as it was suggested by some of you. Results look very promising and I am actually about to finish the project.

One of my three models to estimate the effect of age structure on economic growth is:

Code:

g_ŷ = δ_0 + δ_1 ŷ + δ_2 g_lab + δ_3 g_pop + δ_4 ln(L/N) + δ_5 X+ ε

where g_yhat denotes the growth rate of GDP per capita and g_lab as well as g_pop the growth rate of the labour force and total population respectively.

To interpret the results, I will have to state the effects of the growth rates on economic growth. Whereas population growth has a negative effect, the sign of labour force growth is positive. Both variables are significant. Now, I want to make a statement about which growth rate is growing at faster rates and therefore overshadows the effect of the other.

However, I don't really know how to compare growth rates within a panel of 27 countries over the time span 1950-2010 in 5 year intervals. I was considering a GARCH model although it seemed too complicated in order to simply compare the rates.

Does anyone of you have a suggestion/idea?

Is there any STATA command I could use?
Comment

Announcement