Hi! I'm fairly new to STATA and so hope you can forgive anything that may sound like a stupid question.
I have collected data on the average years of schooling (AYS), educational inequality (GINI) and capital (LOGCAP) to regress with the dependant variable LOGGDP in China. The data contains 589 observations on 31 Chinese provinces across 19 time periods (years).
Initial simple OLS returned results of heteroskedasticity and hence I have used robust standard errors subsequently.
I am primarily interested in whether geographical location has a fixed effect which is correlated with the regressors (e.g. lower average years of schooling in Western, rural areas) on GDP. I initially set up geography dummy variables EAST, CENTRAL and WEST with each province being assigned the number 1 for the group in which it falls and zero otherwise. Then I instead went onto perform fixed effects estimation yielding the attached results.
My question: am I correct in my interpretation that including (xtreg, fe) accounts for unobserved, time-invariant heterogeneity across provinces (e.g. geographical region) as well as other effects and thus the geography dummy variables are unnecessary, or would it be more clear to my point to specifically use these dummy variables rather than the fe command. Further, when interpreting my results, do I read the 'within' or 'overall' rsquared value, or the rho value, to test the suitability of my model's fit? Could I improve the fit through another method, such as adding in a time-trend (which I've read about but can't understand how to apply here).
I have searched the internet and textbooks extensively already for these answers and so am not asking here out of laziness but genuine confusion. Also note I am aware of the panel being unbalanced however this is due to missing data in variables I am not using, and I did perform the Hausman test to ensure fixed-effects was the correct model.
Thank you very much for any help you can give!
I have collected data on the average years of schooling (AYS), educational inequality (GINI) and capital (LOGCAP) to regress with the dependant variable LOGGDP in China. The data contains 589 observations on 31 Chinese provinces across 19 time periods (years).
Initial simple OLS returned results of heteroskedasticity and hence I have used robust standard errors subsequently.
Code:
. xtset PROVINCE1 DATE panel variable: PROVINCE1 (unbalanced) time variable: DATE, 1997 to 2015 delta: 1 unit
Code:
. xtsum LOGGDP AYS GINI LOGCAP Variable | Mean Std. Dev. Min Max | Observations -----------------+--------------------------------------------+---------------- LOGGDP overall | 9.733266 .8980339 6.472346 11.58952 | N = 589 between | .4966316 8.860931 10.94758 | n = 31 within | .7532411 6.915146 11.17617 | T = 19 | | AYS overall | 7.924623 1.28708 2.94794 12.17608 | N = 589 between | 1.063565 4.153079 10.39051 | n = 31 within | .7483536 4.215633 9.710199 | T = 19 | | GINI overall | .2386409 .061669 .126716 .5569839 | N = 589 between | .0547499 .1904903 .4685663 | n = 31 within | .0299544 .1224043 .4040085 | T = 19 | | LOGCAP overall | 5.383967 .9777727 0 6.376727 | N = 589 between | .2456492 4.997763 5.819978 | n = 31 within | .9473876 .2619565 6.733011 | T = 19
I am primarily interested in whether geographical location has a fixed effect which is correlated with the regressors (e.g. lower average years of schooling in Western, rural areas) on GDP. I initially set up geography dummy variables EAST, CENTRAL and WEST with each province being assigned the number 1 for the group in which it falls and zero otherwise. Then I instead went onto perform fixed effects estimation yielding the attached results.
Code:
. xtreg LOGGDP AYS GINI LOGCAP, fe robust Fixed-effects (within) regression Number of obs = 589 Group variable: PROVINCE1 Number of groups = 31 R-sq: Obs per group: within = 0.6205 min = 19 between = 0.3435 avg = 19.0 overall = 0.3549 max = 19 F(3,30) = 92.62 corr(u_i, Xb) = -0.7887 Prob > F = 0.0000 (Std. Err. adjusted for 31 clusters in PROVINCE1) ------------------------------------------------------------------------------ | Robust LOGGDP | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- AYS | .5883079 .056908 10.34 0.000 .4720862 .7045297 GINI | -9.588103 1.711882 -5.60 0.000 -13.08423 -6.091974 LOGCAP | -.0774428 .0276631 -2.80 0.009 -.1339384 -.0209471 _cons | 7.776209 .6808327 11.42 0.000 6.385763 9.166655 -------------+---------------------------------------------------------------- sigma_u | .91221865 sigma_e | .47764761 rho | .78482564 (fraction of variance due to u_i) ------------------------------------------------------------------------------
My question: am I correct in my interpretation that including (xtreg, fe) accounts for unobserved, time-invariant heterogeneity across provinces (e.g. geographical region) as well as other effects and thus the geography dummy variables are unnecessary, or would it be more clear to my point to specifically use these dummy variables rather than the fe command. Further, when interpreting my results, do I read the 'within' or 'overall' rsquared value, or the rho value, to test the suitability of my model's fit? Could I improve the fit through another method, such as adding in a time-trend (which I've read about but can't understand how to apply here).
I have searched the internet and textbooks extensively already for these answers and so am not asking here out of laziness but genuine confusion. Also note I am aware of the panel being unbalanced however this is due to missing data in variables I am not using, and I did perform the Hausman test to ensure fixed-effects was the correct model.
Thank you very much for any help you can give!
Comment