Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unrealistic R squared in LSDV model

    Hello everybody,

    Situation:
    I use Stata 14.2. I want to investigate the effects of mobile phone penetration (mobile_p100) on Human Development Index (hdi, an index ranging from 0 to 1). I am using an unbalanced panel data set of N=120 and T=10. As my main model, I will use the GMM estimator. Following related research, I additionally want to use the least-squares dummy variable (LSDV) estimator including the lagged dependent variable and country and time fixed effects for comparison purposes.

    Problem:
    - Starting with “regress y1 x1…xn i.year, robust” R^2 is at 0.6 which is reasonable
    - Modifying “regress y1 L.y1 x1…xn i.year i.id, robust” R^2 reaches levels higher than 0.99. This happens also when I only add one of the modifications, either the lagged variable (L.y1) or the country fixed effects (i.id)
    - Using “xtreg y1 x1 …xn i.year, fe robust” provides a R^2 of 0.80. As soon as I add the lagged dependent variable, R^2 reaches >0.99.

    Solution tried:
    - Related literature: It is not uncommon to report R^2 of around 0.80 in this field of research, but for a R^2 > 0.99 there is no justification.
    - Dataset: I double checked the observations included in the dataset and did not find any irregularities (duplications, unrealistic values etc.)
    - Excluding independent variables: I excluded independent variables each at a time and ran the model again. Even when only one independent variable is left in the model, R^2 stays at around 0.99
    - Spurious regression: I suppose this is not the cause of the inflated R^2 since I do not have a problem with multicollinearity nor with exceptional high t-values
    - Detrend dependent variable: Helps to decrease R^2 to a reasonable level, but the results differ completely from my GMM estimation and previous research on the topic at hand.
    - Multicollinearity: Does not seem to be a problem in my model, since VIF is at maximum 2.40

    Questions
    1) Is it possible that there is a general problem with the dependent variable which could also distort the GMM results?
    2) Do you see any possibilities to overcome the problem described?

    Model: With country fixed effects and lagged dependent variable
    Code:
     regress hdi L.hdi mobile_p100 mobile_gdp gdp_pc_growth gfcf_share fdi_share pop_growth i.year i.
    > id, robust
    
    Linear regression                               Number of obs     =      1,177
                                                    F(135, 1041)      =   44627.47
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.9996
                                                    Root MSE          =      .0032
    
    -------------------------------------------------------------------------------
                  |               Robust
              hdi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
              hdi |
              L1. |   .8331407   .0175466    47.48   0.000       .79871    .8675714
                  |
      mobile_p100 |   .0000201   7.99e-06     2.52   0.012     4.43e-06    .0000358
       mobile_gdp |  -.0000426   .0000288    -1.48   0.138    -.0000991    .0000138
    gdp_pc_growth |   .0004162   .0000522     7.98   0.000     .0003139    .0005185
       gfcf_share |   .0000549   .0000294     1.87   0.062    -2.79e-06    .0001126
        fdi_share |   .0000116   9.64e-06     1.21   0.228    -7.29e-06    .0000305
       pop_growth |   .0001226   .0002455     0.50   0.618    -.0003591    .0006044
                  |
             year |
            2010  |    .000235   .0005479     0.43   0.668    -.0008401    .0013101
            2011  |   .0013773   .0005606     2.46   0.014     .0002772    .0024773
            2012  |   .0018995   .0005998     3.17   0.002     .0007226    .0030764
            2013  |   .0033575   .0006629     5.06   0.000     .0020567    .0046584
            2014  |   .0031978   .0007293     4.38   0.000     .0017668    .0046288
            2015  |   .0033857   .0007334     4.62   0.000     .0019466    .0048248
            2016  |   .0037577   .0007513     5.00   0.000     .0022834    .0052319
            2017  |    .003713   .0008163     4.55   0.000     .0021112    .0053149
            2018  |   .0033513   .0008486     3.95   0.000     .0016862    .0050165
                  |
            id |
             ALB  |   .0339267   .0046612     7.28   0.000     .0247802    .0430732
            [...all 120 countries...]
             ZMB  |   .0005514    .001493     0.37   0.712    -.0023783    .0034811
                  |
            _cons |    .093336   .0089645    10.41   0.000     .0757455    .1109264
    -------------------------------------------------------------------------------

    Model: Without country fixed effects and lagged dependent variable
    Code:
     regress hdi mobile_p100 mobile_gdp gdp_pc_growth gfcf_share fdi_share pop_growth i.year, robust
    
    Linear regression                               Number of obs     =      1,295
                                                    F(16, 1278)       =     107.59
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.6149
                                                    Root MSE          =     .08985
    
    -------------------------------------------------------------------------------
                  |               Robust
              hdi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
      mobile_p100 |   .0015012   .0000934    16.07   0.000     .0013179    .0016844
       mobile_gdp |  -.0065454   .0006642    -9.85   0.000    -.0078484   -.0052424
    gdp_pc_growth |  -.0069348   .0010797    -6.42   0.000     -.009053   -.0048166
       gfcf_share |  -.0006863   .0003695    -1.86   0.064    -.0014112    .0000386
        fdi_share |   .0003447   .0001267     2.72   0.007     .0000962    .0005932
       pop_growth |  -.0241198   .0023824   -10.12   0.000    -.0287936   -.0194459
                  |
             year |
            2009  |  -.0512804   .0121719    -4.21   0.000    -.0751596   -.0274012
            2010  |  -.0304352   .0109545    -2.78   0.006     -.051926   -.0089443
            2011  |  -.0422082   .0108261    -3.90   0.000    -.0634471   -.0209694
            2012  |  -.0546703   .0110025    -4.97   0.000    -.0762553   -.0330853
            2013  |  -.0529743   .0114669    -4.62   0.000    -.0754703   -.0304783
            2014  |  -.0531553   .0116728    -4.55   0.000    -.0760553   -.0302553
            2015  |  -.0549756   .0122075    -4.50   0.000    -.0789246   -.0310266
            2016  |  -.0543252   .0120781    -4.50   0.000    -.0780203     -.03063
            2017  |  -.0474105   .0118148    -4.01   0.000    -.0705891   -.0242319
            2018  |  -.0504147    .012046    -4.19   0.000    -.0740467   -.0267827
                  |
            _cons |    .709139    .016166    43.87   0.000     .6774242    .7408537
    -------------------------------------------------------------------------------
    Thank you very much and best wishes,
    Patrick


    PS: A similar question was asked here: link. Unfortunately, the recommendations given there did not solve the prevalent problem.

  • #2
    You have the same variable on the left hand side and the right hand side, so that would explain the high R2 statistic. HDI would be highly correlated with its lag. Accounting for time-invariant country effects and country invariant time effects would contribute a lot to the high R2, but my money is on the lagged dependent variable (LDV). Notice that your second regression not only drops the fixed effects, but also the LDV. Try running

    Code:
    regress hdi l.hdi
    and look at what percentage of variation in HDI is explained by variation in its lag.

    Comment


    • #3
      Hi Andrew,
      Thanks for the very fast response.

      You are absolutely right, the LDV is one factor causing the high R^2. Using the code suggested by you -regress hdi l.hdi- provides a R^2 > 0.99. However, when I drop the LDV but using country effects in the first model I posted, R^2 stays at >0.99.

      My original idea was to provide results of the LSDV estimator as a supplement to the GMM estimator (although the former is suffering from a potential Nickell bias). At this point it is not clear how to overcome my R^2 problem. Is it reasonable to keep the LDV and accept the high R^2; is it preferable to use the model without LDV and without country effects; or are there further options that I didn`t consider? Do you have any suggestions?

      Comment


      • #4
        The high R2 is not a problem if you want to present the FE results alongside the GMM results. It is expected given that you have a lagged dependent variable.

        Comment

        Working...
        X