Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel data: how the Mundlak method performs heteroskedasticity tests and cross-sectional correlation tests

    Dear all,

    I have some questions about the Mundlak method. This hybrid model is actually a random effects model. I would like to know how this method performs heteroskedasticity tests and cross-sectional correlation tests?

    The purpose of my research is to conduct a longitudinal analysis of the average daily station-level metro ridership and its determinants in Xi'an, China, during December of each year from 2011 to 2019.

    First I need to introduce my data structure:
    Xi'an opened its first line in 2011, and by 2019, it had opened five lines. My research focuses on 88 stations on four lines (one line is excluded) within the study area. The number of stations has increased over time, and data for the 25 new stations that opened in 2019 is available only for a single period.
    Code:
    xtset station_id year
    xtdes
    
    station_id:  1, 2, ..., 88                                   n =         88
        year:  2011, 2012, ..., 2019                             T =          9
               Delta(year) = 1 unit
               Span(year)  = 9 periods
               (station_id*year uniquely identifies each observation)
    
    Distribution of T_i:   min      5%     25%       50%       75%     95%     max
                             1       1       1         4         7       9       9
    
         Freq.  Percent    Cum. |  Pattern
     ---------------------------+-----------
           25     28.41   28.41 |  ........1
           24     27.27   55.68 |  .....1111
           18     20.45   76.14 |  ..1111111
           17     19.32   95.45 |  111111111
            4      4.55  100.00 |  ...111111
     ---------------------------+-----------
           88    100.00         |  XXXXXXXXX
    I read Kasraian D, Maat K, Van Wee B. Urban developments and daily travel distances: Fixed, random and hybrid effects models using a Dutch pseudo-panel over three decades[J]. Journal of Transport Geography, 2018, 72: 228-236. This paper uses POLS, fixed-effects models, random-effects models, and hybrid models (Mundlak approach). The hybrid model can capture both within-group and between-group effects of time-varying variables, which I think is very consistent with my research goals, so I am very interested in using this approach. Therefore, I also selected these four models for comparative analysis.

    This paper selects the following variables. The time-invariant variables are because the data is limited to only one year.
    Time-varying variables:
    1. Ridership: station-level ridership for each station per year
    2. Restaurant: the number of restaurants in station catchment areas (SCAs) per year
    3. Accessibility: the number of people reachable within 40 minutes per station per year
    4. School: the number of schools in SCAs per year
    5. Terminal: terminal station (dummy)
    6. Price: mean house price in SCAs per year
    7. Population: Population density in SCAs per year
    8. Bus: the number of bus lines in SCAs per year
    Time-invariant variables:
    1. age20_29: the proportion of the population aged 20-29 in SCAs each year
    2. Primary: the proportion of the population with primary education in SCAs each year
    3. timedistri_class5: The stations are divided into five categories based on the number of bordings and alightings at each station in each hour of the day.
    The prefix l indicates log transformation.

    Code:
    sort station_id year
    foreach var of varlist lridership lrestaurant laccessibility lschool lprice lpopulation lbus {
        egen m`var' = mean(`var'),by(station_id)
        gen d`var' = `var' - m`var'
    }
    
    reg lridership  lrestaurant  laccessibility  lschool  terminal  lprice lpopulation lbus  lage20_29  lPrimary  i. timedistri_class5  i.year, vce(cluster station_id)
    est store pols_rob
    xtreg lridership  lrestaurant  laccessibility  lschool  terminal  lprice lpopulation lbus  i.year,fe vce(cluster station_id)
    est store fe_rob
    xtreg lridership  lrestaurant  laccessibility  lschool  terminal  lprice lpopulation lbus  lage20_29  lPrimary   i.timedistri_class5  i.year,re vce(cluster station_id)
    est store re_rob
    xtreg lridership   dlrestaurant  dlaccessibility  dlschool  dterminal  dlprice dlpopulation dlbus mlrestaurant  mlaccessibility  mlschool  mterminal  mlprice mlpopulation mlbus  lage20_29  lPrimary1500  i.timedistri_class5  i.year, re vce(cluster station_id)
    est store hybrid_rob
    Below are the model results.
    Click image for larger version

Name:	01Model estimates based on ordinary least squares.png
Views:	1
Size:	479.2 KB
ID:	1781272


    From what I have learned, the fixed-effect model requires a series of tests, including testing for heteroskedasticity, serial correlation, and cross-sectional dependence, in order to select the appropriate estimation command.
    The three tests I used are as follows:

    xtcd2 //Testing for cross-sectional dependence. Since 25 stations only have single-period data, xtcsd, pesaran abs cannot obtain results.
    xttest //Testing for heteroskedasticity
    xtserial y x // Testing for serial correlation


    My questions are:

    1. Will the single-period data for these 25 stations affect the use of the hybrid model?
    2. The xtcd2 and xttest3 commands can only be used after xtreg,fe. The variables used in the hybrid model differ from those in the FE model, so I believe retesting is necessary. I'd like to know how to test the hybrid model for heteroskedasticity and cross-sectional correlation.
    3. I know I can use xtreg,re vce(cluster station_id) to obtain robust estimates of the hybrid model, but I'm concerned about cross-sectional dependence. If cross-sectional dependence exists, how should I address it? Can I still use the hybrid model?

    I'll show you what I tried to do to test for cross-sectional correlation:
    Code:
    xtreg lridership_new  lcatering1000 lnet40_052102    lZXX05   terminal  lreal_price  lpopden1000 lsmallbus  i.year,fe robust
    xtcd2
    Pesaran (2015) test for cross sectional dependence
    Postestimation. Unbalanced panel detected, test adjusted.
    
    H0: errors are weakly cross sectional dependent. 
    
            CD =   1.286368
       p_value =  .19831467
    This shows that the fixed effect model does not have cross-sectional correlation problems. And I made a (rather naive) attempt at the hybrid model, the time-invariant variables is omitted, but this suffers from cross-sectional issues:

    Code:
    xtreg lridership   dlrestaurant  dlaccessibility  dlschool  dterminal  dlprice dlpopulation dlbus mlrestaurant  mlaccessibility  mlschool  mterminal  mlprice mlpopulation mlbus   lage20_29  lPrimary1500  i.timedistri_class5  i.year, fe robust
    
    xtcd2
    Pesaran (2015) test for cross sectional dependence
    Postestimation. Unbalanced panel detected, test adjusted.
    
    H0: errors are weakly cross sectional dependent. 
    
            CD =  3.0094369
       p_value =  .00261732
    Thanks in advance.

    Best regards,
    Chen

  • #2
    Hybrid model is estimated by Random-effects GLS, so your concerns about the residuals e_it are also the residuals of the REM model.
    1. For autocorrelation, you can use the xttest1 command to perform LM and Adjusted-LM tests for random effects with first-order autocorrelation.
    2. For heteroskedasticity, there are some procedures suggested in the literature (e.g. Baltagi et al. (2006), Baltagi et al. (2008)), but there is currently no command for STATA to perform them. In related discussions, the Breusch-Pagan/White test procedure is often suggested for composite error (w_it = u_i+e_it). But the results can be seriously misleading, since w_it is correlated over time. I thought of performing the Breusch-Pagan test for u_i and e_it separately (Regress u_i on time-averaged means of the right-hand variables and constant)
    3. For cross-sectional dependence. xtcd2 performs a hypothesis test H0: errors are weakly cross-sectional dependent. So even if the p-value > 0.1 (p_value = .19831467) you still face the problem of weakly cross-sectional dependence. With a large N, small T table, a spatial (geographic) dependence structure can be a solution to weakly cross-sectional dependence. Or more simply, assume that cross-sectional dependence does not give rise to correlation between the errors and the right-hand regressors, then calculate clustered standard errors or DK standard errors (if no suitable clustering variable can be found). In the context of your study, I would think of finding a clustering variable for these 88 stations to control for some of the cross-sectional dependence. Clustering by the smaller administrative level of Xi'an, for example. I still have no idea how to handle strong cross-sectional dependence.

    Comment


    • #3
      Dear Mr. Ba,

      Thank you very much for your helpful reply and suggestions. Following your advice, I used townships in Xi'an city as the clustering variable, and below are the comparative results between using station_id and townships as clustering variables.


      Code:
      xtreg lridership mlrestaurant mlaccessibility mlschool mterminal mlprice mlpopulation mlbus ///
          dlrestaurant dlaccessibility dlschool dterminal dlprice dlpopulation dlbus lage20_29 ///
          lPrimary1500 i.timedistri_class5 i.year, re vce(cluster station_id)
      
      
      Random-effects GLS regression                   Number of obs     =        424
      Group variable: station_id                      Number of groups  =         88
      
      R-sq:                                           Obs per group:
           within  = 0.8711                                         min =          1
           between = 0.8615                                         avg =        4.8
           overall = 0.8590                                         max =          9
      
                                                      Wald chi2(28)     =    2586.57
      corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
      
                                       (Std. Err. adjusted for 88 clusters in station_id)
      -----------------------------------------------------------------------------------
                        |               Robust
             lridership |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      ------------------+----------------------------------------------------------------
           mlrestaurant |     .20265    .054843     3.70   0.000      .095159      .31014
        mlaccessibility |     .25927    .083268     3.11   0.002      .096064      .42247
               mlschool |    .044068    .023191     1.90   0.057    -.0013861     .089522
              mterminal |     1.0838     .20105     5.39   0.000       .68972      1.4778
                mlprice |    -.77438     .25711    -3.01   0.003      -1.2783     -.27045
           mlpopulation |    -.09414    .047418    -1.99   0.047      -.18708   -.0012016
                  mlbus |     .28815     .10235     2.82   0.005      .087539      .48876
           dlrestaurant |     .11254     .02326     4.84   0.000      .066951      .15813
        dlaccessibility |     .65164     .15341     4.25   0.000       .35096      .95232
               dlschool |    .082737     .17909     0.46   0.644      -.26827      .43374
              dterminal |     .85962     .21419     4.01   0.000       .43982      1.2794
                dlprice |    -.21553     .21819    -0.99   0.323      -.64318      .21213
           dlpopulation |     1.0708     .53025     2.02   0.043      .031546      2.1101
                  dlbus |     .28538      .1781     1.60   0.109     -.063695      .63446
              lage20_29 |     1.7172     .35404     4.85   0.000       1.0233      2.4111
           lPrimary1500 |      1.141     .60521     1.89   0.059     -.045182      2.3272
                        |
      timedistri_class5 |
                     2  |    -.17806     .11764    -1.51   0.130      -.40864     .052511
                     3  |    .054044     .11249     0.48   0.631      -.16643      .27452
                     4  |     .58107     .13352     4.35   0.000       .31938      .84277
                     5  |     .44735     .10806     4.14   0.000       .23557      .65914
                        |
                   year |
                  2012  |     .17228    .036866     4.67   0.000       .10002      .24453
                  2013  |     .20035     .13142     1.52   0.127     -.057233      .45793
                  2014  |     .26443     .15091     1.75   0.080     -.031359      .56021
                  2015  |     .17797     .16424     1.08   0.279      -.14392      .49987
                  2016  |    .096737     .19412     0.50   0.618      -.28373      .47721
                  2017  |     .36876     .22227     1.66   0.097     -.066885       .8044
                  2018  |     .49949     .25284     1.98   0.048       .00393      .99504
                  2019  |     .29396     .28722     1.02   0.306      -.26899       .8569
                        |
                  _cons |     3.4621     3.1335     1.10   0.269      -2.6794      9.6037
      ------------------+----------------------------------------------------------------
                sigma_u |  .31205097
                sigma_e |  .16789349
                    rho |  .77550717   (fraction of variance due to u_i)
      -----------------------------------------------------------------------------------
      
      
      xtreg lridership mlrestaurant mlaccessibility mlschool mterminal mlprice mlpopulation mlbus ///
          dlrestaurant dlaccessibility dlschool dterminal dlprice dlpopulation dlbus lage20_29 ///
          lPrimary1500 i.timedistri_class5 i.year, re  vce(cluster township)
      
      Random-effects GLS regression                   Number of obs     =        424
      Group variable: station_id                      Number of groups  =         88
      
      R-sq:                                           Obs per group:
           within  = 0.8711                                         min =          1
           between = 0.8615                                         avg =        4.8
           overall = 0.8590                                         max =          9
      
                                                      Wald chi2(28)     =   20644.13
      corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000
      
                                         (Std. Err. adjusted for 39 clusters in township)
      -----------------------------------------------------------------------------------
                        |               Robust
             lridership |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      ------------------+----------------------------------------------------------------
           mlrestaurant |     .20265    .054075     3.75   0.000      .096662      .30863
        mlaccessibility |     .25927    .073673     3.52   0.000       .11487      .40366
               mlschool |    .044068    .024787     1.78   0.075    -.0045136      .09265
              mterminal |     1.0838     .15924     6.81   0.000       .77167      1.3959
                mlprice |    -.77438     .23918    -3.24   0.001      -1.2432      -.3056
           mlpopulation |    -.09414    .048551    -1.94   0.053       -.1893    .0010187
                  mlbus |     .28815     .08463     3.40   0.001       .12228      .45402
           dlrestaurant |     .11254    .013993     8.04   0.000      .085113      .13996
        dlaccessibility |     .65164     .15144     4.30   0.000       .35482      .94847
               dlschool |    .082737      .1682     0.49   0.623      -.24692      .41239
              dterminal |     .85962     .22355     3.85   0.000       .42147      1.2978
                dlprice |    -.21553     .22047    -0.98   0.328      -.64764      .21659
           dlpopulation |     1.0708     .44745     2.39   0.017       .19383      1.9478
                  dlbus |     .28538     .18144     1.57   0.116     -.070228      .64099
              lage20_29 |     1.7172     .37969     4.52   0.000       .97298      2.4613
               lPrimary |      1.141     .62523     1.82   0.068     -.084424      2.3664
                        |
      timedistri_class5 |
                     2  |    -.17806     .11608    -1.53   0.125      -.40559     .049457
                     3  |    .054044     .11229     0.48   0.630      -.16605      .27414
                     4  |     .58107     .13603     4.27   0.000       .31447      .84768
                     5  |     .44735     .11162     4.01   0.000       .22859      .66611
                        |
                   year |
                  2012  |     .17228    .034212     5.04   0.000       .10522      .23933
                  2013  |     .20035     .10761     1.86   0.063     -.010559      .41126
                  2014  |     .26443     .12883     2.05   0.040       .01193      .51692
                  2015  |     .17797     .13549     1.31   0.189     -.087575      .44352
                  2016  |    .096737     .15417     0.63   0.530      -.20543      .39891
                  2017  |     .36876     .19443     1.90   0.058     -.012312      .74983
                  2018  |     .49949     .22657     2.20   0.027      .055426      .94355
                  2019  |     .29396      .2633     1.12   0.264       -.2221      .81002
                        |
                  _cons |     3.4621     3.4017     1.02   0.309      -3.2052      10.129
      ------------------+----------------------------------------------------------------
                sigma_u |  .31205097
                sigma_e |  .16789349
                    rho |  .77550717   (fraction of variance due to u_i)
      -----------------------------------------------------------------------------------
      I have a few follow-up questions regarding your suggestions:

      1. Should I use specific performance indicators to determine which clustering approach performs better to decide whether to use townships for clustering? What would be the appropriate criteria for making this comparison?

      2. When you mentioned "DK standard errors" are you referring to the "xtscc" command in Stata? Can this command also be used to estimate random effects models? If so, could you please tell me the specific code?

      3. I noticed in another post https://www.statalist.org/forums/for...-in-panel-data that Mr. Kolev mentioned: "If we start to worry about countries' residuals being cross sectionally correlated, I do not know what we can assume that can be possibly cross sectionally uncorrelated -- observations coming from alternative universes? I think that with your data you should stick to -xtreg- because your data is more of the large N variety. I would not worry about cross sectional correlation in your case, and just do -xtreg, robust- which will give you standard errors robust to heteroskedasticity and arbitrary within country correlation."
      Given my current analytical capabilities, I find it challenging to address the cross-sectional
      dependence. Would it be acceptable in my paper to use clustered standard errors as a simplified approach to the model, acknowledging this as a research limitation that may be addressed in future studies?

      Comment


      • #4
        1. The performance of clustered standard errors depends on the number of clusters and the appropriateness of the assumption of independence between clusters. With 39 clusters, it seems statistically easy to accept. Allowing for significant cluster correlation helps to control for some dependence between stations within a cluster. But whether stations in one cluster are independent of stations in another cluster is something that needs careful consideration.
        2. Yes, the DK (Driscoll-Kraay) standard error is calculated using the xtscc command in STATA. Version 1.4 of this command (ssc install xtscc) has the re option that allows estimating the DK standard error for the RE model. Although the DK standard error was originally designed for large T panels. However, Monte-Carlo evidence shows that for small T panels, the DK standard error has better properties than the OLS standard error and the Arrellano standard error (for details, see Hoechle, D. (2007)).
        3. In the context of a large N panel and fixed T, one uses the asymptote N to make inferences about the parameters. The assumption of independent random sampling by ID may be reasonable in some contexts, but not in others. If the population has 88 stations, the assumption of independent random sampling seems unnatural here! Also, for entities with fixed geographic locations, spatial spillovers between them are widely recognized and a class of spatial econometric models has been developed to analyze such situations. I think you can compare the results from different treatments of cross-sectional dependence and draw conclusions about the magnitude of the effects. You can also keep the assumption of independence between cross-sectional units (stations) and use Arellano standard errors (clustered by ID) as Kolev mentioned.

        Comment


        • #5
          Dear Mr. Ba,

          Thank you once again for your thoughtful response. Your insights have been extremely helpful.

          Previously, I was under the impression that xtscc could only be used for fixed effects models and pooled OLS estimation. Your explanation about the "re" option has been very illuminating. Given the potential lack of theoretical justification for clustering by townships, I think xtscc, re is a good idea.

          Best regards,
          Chen

          Comment

          Working...
          X