Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Fixed Effects estimation on panel data, how to interpret results?

    Hi! I'm fairly new to STATA and so hope you can forgive anything that may sound like a stupid question.

    I have collected data on the average years of schooling (AYS), educational inequality (GINI) and capital (LOGCAP) to regress with the dependant variable LOGGDP in China. The data contains 589 observations on 31 Chinese provinces across 19 time periods (years).
    Initial simple OLS returned results of heteroskedasticity and hence I have used robust standard errors subsequently.

    Code:
    . xtset PROVINCE1 DATE
           panel variable:  PROVINCE1 (unbalanced)
            time variable:  DATE, 1997 to 2015
                    delta:  1 unit
    Code:
    . xtsum LOGGDP AYS GINI LOGCAP
    
    Variable         |      Mean   Std. Dev.       Min        Max |    Observations
    -----------------+--------------------------------------------+----------------
    LOGGDP   overall |  9.733266   .8980339   6.472346   11.58952 |     N =     589
             between |             .4966316   8.860931   10.94758 |     n =      31
             within  |             .7532411   6.915146   11.17617 |     T =      19
                     |                                            |
    AYS      overall |  7.924623    1.28708    2.94794   12.17608 |     N =     589
             between |             1.063565   4.153079   10.39051 |     n =      31
             within  |             .7483536   4.215633   9.710199 |     T =      19
                     |                                            |
    GINI     overall |  .2386409    .061669    .126716   .5569839 |     N =     589
             between |             .0547499   .1904903   .4685663 |     n =      31
             within  |             .0299544   .1224043   .4040085 |     T =      19
                     |                                            |
    LOGCAP   overall |  5.383967   .9777727          0   6.376727 |     N =     589
             between |             .2456492   4.997763   5.819978 |     n =      31
             within  |             .9473876   .2619565   6.733011 |     T =      19



    I am primarily interested in whether geographical location has a fixed effect which is correlated with the regressors (e.g. lower average years of schooling in Western, rural areas) on GDP. I initially set up geography dummy variables EAST, CENTRAL and WEST with each province being assigned the number 1 for the group in which it falls and zero otherwise. Then I instead went onto perform fixed effects estimation yielding the attached results.



    Code:
    . xtreg LOGGDP AYS GINI LOGCAP, fe robust
    
    Fixed-effects (within) regression               Number of obs     =        589
    Group variable: PROVINCE1                       Number of groups  =         31
    
    R-sq:                                           Obs per group:
         within  = 0.6205                                         min =         19
         between = 0.3435                                         avg =       19.0
         overall = 0.3549                                         max =         19
    
                                                    F(3,30)           =      92.62
    corr(u_i, Xb)  = -0.7887                        Prob > F          =     0.0000
    
                                 (Std. Err. adjusted for 31 clusters in PROVINCE1)
    ------------------------------------------------------------------------------
                 |               Robust
          LOGGDP |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             AYS |   .5883079    .056908    10.34   0.000     .4720862    .7045297
            GINI |  -9.588103   1.711882    -5.60   0.000    -13.08423   -6.091974
          LOGCAP |  -.0774428   .0276631    -2.80   0.009    -.1339384   -.0209471
           _cons |   7.776209   .6808327    11.42   0.000     6.385763    9.166655
    -------------+----------------------------------------------------------------
         sigma_u |  .91221865
         sigma_e |  .47764761
             rho |  .78482564   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------

    My question: am I correct in my interpretation that including (xtreg, fe) accounts for unobserved, time-invariant heterogeneity across provinces (e.g. geographical region) as well as other effects and thus the geography dummy variables are unnecessary, or would it be more clear to my point to specifically use these dummy variables rather than the fe command. Further, when interpreting my results, do I read the 'within' or 'overall' rsquared value, or the rho value, to test the suitability of my model's fit? Could I improve the fit through another method, such as adding in a time-trend (which I've read about but can't understand how to apply here).

    I have searched the internet and textbooks extensively already for these answers and so am not asking here out of laziness but genuine confusion. Also note I am aware of the panel being unbalanced however this is due to missing data in variables I am not using, and I did perform the Hausman test to ensure fixed-effects was the correct model.

    Thank you very much for any help you can give!

  • #2
    am I correct in my interpretation that including (xtreg, fe) accounts for unobserved, time-invariant heterogeneity across provinces (e.g. geographical region)...
    Yes.

    ...as well as other effects
    No. Only time-invariant effects are adjusted for by the fixed-effects.

    thus the geography dummy variables are unnecessary
    They are not only unnecessary, you can't use them in a fixed-effects model even if you want to. They would be colinear with the fixed effects and would be omitted by Stata if you tried to put them in. You cannot estimate region effects in a province-level fixed-effects model.

    or would it be more clear to my point to specifically use these dummy variables
    If your research goal is specifically to investigate region effects, then you must abandon province-level fixed effects altogether and instead use OLS with the region indicator variables. There are two downsides to this. One is that you are no longer automatically adjusting for time-invariant effects within provinces, so omitted variable bias may creep in. The other is that the errors within regions may not be independent, due to province-level clustering. Using vce(cluster province) would help.

    when interpreting my results, do I read the 'within' or 'overall' rsquared value, or the rho value, to test the suitability of my model's fit?
    R2 overall measures the overall fit of the model at the level of individual observations. R2 between measure the fit of the panel-level predictions to the panel-level means. And R2 within is the R2 of the actual fe-regression (which is a within-panel model only.) Which is most suitable depends on what you are most interested in fitting. I would guess that in your setting that would be R2 overall. And rho is not a measure of model fit at all. It tells you how much of the unexplained variation in the data occurs at the panel level.

    Could I improve the fit through another method, such as adding in a time-trend
    Essentially, you can always improve fit by adding more variables. The problem is that at some point you start overfitting the noise in the data. So inclusion of time trends should be based on a theoretical basis for doing so. If previous research or good theory says that there should be time trends affecting your outcome, then your model might well be improved by their inclusion. Exactly how you would do that would depend on the nature of the time trends themselves. So you'll have to do some research into that. If you know about the nature of the time trends and are unsure how to code them, post back with more information about them.

    I am aware of the panel being unbalanced ...
    Being unbalanced is not a problem for this kind (or most kinds) of analysis. Don't worry about that.

    ...however this is due to missing data in variables I am not using
    Well, that's odd. If the missing data is only in variables you are not using anyway, why not bring those observations back into the sample for estimation. If those observations contain all the variables you are using, they are more information to bring to bear on your analysis. And excluding them could bias your analysis if the reasons for the missingness of those other variables is connected in some way to the variables you are using.


    Comment


    • #3
      Thanks very much for your answer and guidance. Regarding the results, I am happy with my coefficients and P values for the variables in my robust FE regression (providing an intuitively sound explanation). However, given the overall Rsquared of just 0.35.., does this appear to be a poor model for the data and I should somehow improve it, or can I proceed with using this outcome? Again, information on the internet regarding the interpretation of Rsquared values in FEM appears to be sparse and hence why I am asking for some clarification.

      Comment


      • #4
        There are no hard and fast rules here, and it really depends on your research goals. For some purposes R2 = 0.35 would be considered miraculously great, and for others it could be considered pathetic. If the purpose of your model is to provide a close prediction of every individual observation, then this is not particularly good (though it's not terrible as these things go.) But if your purpose is to just identify regional effects on the outcome variable, y, R2 is not particularly important: the focus should be on whether the confidence intervals around those region effects in your regression output are narrow enough for you to feel that you have usefully precise estimates. And, again, if the purpose is identifying regional effects, it is not really reasonable to expect the model to make tight predictions about individual outcomes--there are so many other factors influencing that outcome besides just the region.

        Comment

        Working...
        X