Panel data: how the Mundlak method performs heteroskedasticity tests and cross-sectional correlation tests

Ma chen

Join Date: Jan 2024

Posts: 4
#1

Panel data: how the Mundlak method performs heteroskedasticity tests and cross-sectional correlation tests

24 Aug 2025, 04:48

Dear all,

I have some questions about the Mundlak method. This hybrid model is actually a random effects model. I would like to know how this method performs heteroskedasticity tests and cross-sectional correlation tests?

The purpose of my research is to conduct a longitudinal analysis of the average daily station-level metro ridership and its determinants in Xi'an, China, during December of each year from 2011 to 2019.

First I need to introduce my data structure:
Xi'an opened its first line in 2011, and by 2019, it had opened five lines. My research focuses on 88 stations on four lines (one line is excluded) within the study area. The number of stations has increased over time, and data for the 25 new stations that opened in 2019 is available only for a single period.

Code:

xtset station_id year xtdes station_id: 1, 2, ..., 88 n = 88 year: 2011, 2012, ..., 2019 T = 9 Delta(year) = 1 unit Span(year) = 9 periods (station_id*year uniquely identifies each observation) Distribution of T_i: min 5% 25% 50% 75% 95% max 1 1 1 4 7 9 9 Freq. Percent Cum. | Pattern ---------------------------+----------- 25 28.41 28.41 | ........1 24 27.27 55.68 | .....1111 18 20.45 76.14 | ..1111111 17 19.32 95.45 | 111111111 4 4.55 100.00 | ...111111 ---------------------------+----------- 88 100.00 | XXXXXXXXX

I read Kasraian D, Maat K, Van Wee B. Urban developments and daily travel distances: Fixed, random and hybrid effects models using a Dutch pseudo-panel over three decades[J]. Journal of Transport Geography, 2018, 72: 228-236. This paper uses POLS, fixed-effects models, random-effects models, and hybrid models (Mundlak approach). The hybrid model can capture both within-group and between-group effects of time-varying variables, which I think is very consistent with my research goals, so I am very interested in using this approach. Therefore, I also selected these four models for comparative analysis.

This paper selects the following variables. The time-invariant variables are because the data is limited to only one year.
Time-varying variables：
1. Ridership: station-level ridership for each station per year
2. Restaurant: the number of restaurants in station catchment areas (SCAs) per year
3. Accessibility: the number of people reachable within 40 minutes per station per year
4. School: the number of schools in SCAs per year
5. Terminal: terminal station (dummy)
6. Price: mean house price in SCAs per year
7. Population: Population density in SCAs per year
8. Bus: the number of bus lines in SCAs per year
Time-invariant variables：
1. age20_29: the proportion of the population aged 20-29 in SCAs each year
2. Primary: the proportion of the population with primary education in SCAs each year
3. timedistri_class5: The stations are divided into five categories based on the number of bordings and alightings at each station in each hour of the day.
The prefix l indicates log transformation.

Code:

sort station_id year foreach var of varlist lridership lrestaurant laccessibility lschool lprice lpopulation lbus { egen m`var' = mean(`var'),by(station_id) gen d`var' = `var' - m`var' } reg lridership lrestaurant laccessibility lschool terminal lprice lpopulation lbus lage20_29 lPrimary i. timedistri_class5 i.year, vce(cluster station_id) est store pols_rob xtreg lridership lrestaurant laccessibility lschool terminal lprice lpopulation lbus i.year,fe vce(cluster station_id) est store fe_rob xtreg lridership lrestaurant laccessibility lschool terminal lprice lpopulation lbus lage20_29 lPrimary i.timedistri_class5 i.year,re vce(cluster station_id) est store re_rob xtreg lridership dlrestaurant dlaccessibility dlschool dterminal dlprice dlpopulation dlbus mlrestaurant mlaccessibility mlschool mterminal mlprice mlpopulation mlbus lage20_29 lPrimary1500 i.timedistri_class5 i.year, re vce(cluster station_id) est store hybrid_rob

Below are the model results.

From what I have learned, the fixed-effect model requires a series of tests, including testing for heteroskedasticity, serial correlation, and cross-sectional dependence, in order to select the appropriate estimation command.
The three tests I used are as follows:

xtcd2 //Testing for cross-sectional dependence. Since 25 stations only have single-period data, xtcsd, pesaran abs cannot obtain results.
xttest //Testing for heteroskedasticity
xtserial y x // Testing for serial correlation

My questions are:

1. Will the single-period data for these 25 stations affect the use of the hybrid model?
2. The xtcd2 and xttest3 commands can only be used after xtreg,fe. The variables used in the hybrid model differ from those in the FE model, so I believe retesting is necessary. I'd like to know how to test the hybrid model for heteroskedasticity and cross-sectional correlation.
3. I know I can use xtreg,re vce(cluster station_id) to obtain robust estimates of the hybrid model, but I'm concerned about cross-sectional dependence. If cross-sectional dependence exists, how should I address it? Can I still use the hybrid model?

I'll show you what I tried to do to test for cross-sectional correlation:

Code:

xtreg lridership_new lcatering1000 lnet40_052102 lZXX05 terminal lreal_price lpopden1000 lsmallbus i.year,fe robust xtcd2 Pesaran (2015) test for cross sectional dependence Postestimation. Unbalanced panel detected, test adjusted. H0: errors are weakly cross sectional dependent. CD = 1.286368 p_value = .19831467

This shows that the fixed effect model does not have cross-sectional correlation problems. And I made a (rather naive) attempt at the hybrid model, the time-invariant variables is omitted, but this suffers from cross-sectional issues:

Code:

xtreg lridership dlrestaurant dlaccessibility dlschool dterminal dlprice dlpopulation dlbus mlrestaurant mlaccessibility mlschool mterminal mlprice mlpopulation mlbus lage20_29 lPrimary1500 i.timedistri_class5 i.year, fe robust xtcd2 Pesaran (2015) test for cross sectional dependence Postestimation. Unbalanced panel detected, test adjusted. H0: errors are weakly cross sectional dependent. CD = 3.0094369 p_value = .00261732

Thanks in advance.
Best regards,
Chen
Tags: None
Manh Hoang Ba

Join Date: Aug 2023

Posts: 23
#2

25 Aug 2025, 01:10

Hybrid model is estimated by Random-effects GLS, so your concerns about the residuals e_it are also the residuals of the REM model.
For autocorrelation, you can use the xttest1 command to perform LM and Adjusted-LM tests for random effects with first-order autocorrelation.

For heteroskedasticity, there are some procedures suggested in the literature (e.g. Baltagi et al. (2006), Baltagi et al. (2008)), but there is currently no command for STATA to perform them. In related discussions, the Breusch-Pagan/White test procedure is often suggested for composite error (w_it = u_i+e_it). But the results can be seriously misleading, since w_it is correlated over time. I thought of performing the Breusch-Pagan test for u_i and e_it separately (Regress u_i on time-averaged means of the right-hand variables and constant)

For cross-sectional dependence. xtcd2 performs a hypothesis test H0: errors are weakly cross-sectional dependent. So even if the p-value > 0.1 (p_value = .19831467) you still face the problem of weakly cross-sectional dependence. With a large N, small T table, a spatial (geographic) dependence structure can be a solution to weakly cross-sectional dependence. Or more simply, assume that cross-sectional dependence does not give rise to correlation between the errors and the right-hand regressors, then calculate clustered standard errors or DK standard errors (if no suitable clustering variable can be found). In the context of your study, I would think of finding a clustering variable for these 88 stations to control for some of the cross-sectional dependence. Clustering by the smaller administrative level of Xi'an, for example. I still have no idea how to handle strong cross-sectional dependence.
2 likes
Comment

Ma chen

Join Date: Jan 2024
Posts: 4

25 Aug 2025, 07:21

Dear Mr. Ba,

Thank you very much for your helpful reply and suggestions. Following your advice, I used townships in Xi'an city as the clustering variable, and below are the comparative results between using station_id and townships as clustering variables.

Code:

xtreg lridership mlrestaurant mlaccessibility mlschool mterminal mlprice mlpopulation mlbus ///
    dlrestaurant dlaccessibility dlschool dterminal dlprice dlpopulation dlbus lage20_29 ///
    lPrimary1500 i.timedistri_class5 i.year, re vce(cluster station_id)


Random-effects GLS regression                   Number of obs     =        424
Group variable: station_id                      Number of groups  =         88

R-sq:                                           Obs per group:
     within  = 0.8711                                         min =          1
     between = 0.8615                                         avg =        4.8
     overall = 0.8590                                         max =          9

                                                Wald chi2(28)     =    2586.57
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

                                 (Std. Err. adjusted for 88 clusters in station_id)
-----------------------------------------------------------------------------------
                  |               Robust
       lridership |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
     mlrestaurant |     .20265    .054843     3.70   0.000      .095159      .31014
  mlaccessibility |     .25927    .083268     3.11   0.002      .096064      .42247
         mlschool |    .044068    .023191     1.90   0.057    -.0013861     .089522
        mterminal |     1.0838     .20105     5.39   0.000       .68972      1.4778
          mlprice |    -.77438     .25711    -3.01   0.003      -1.2783     -.27045
     mlpopulation |    -.09414    .047418    -1.99   0.047      -.18708   -.0012016
            mlbus |     .28815     .10235     2.82   0.005      .087539      .48876
     dlrestaurant |     .11254     .02326     4.84   0.000      .066951      .15813
  dlaccessibility |     .65164     .15341     4.25   0.000       .35096      .95232
         dlschool |    .082737     .17909     0.46   0.644      -.26827      .43374
        dterminal |     .85962     .21419     4.01   0.000       .43982      1.2794
          dlprice |    -.21553     .21819    -0.99   0.323      -.64318      .21213
     dlpopulation |     1.0708     .53025     2.02   0.043      .031546      2.1101
            dlbus |     .28538      .1781     1.60   0.109     -.063695      .63446
        lage20_29 |     1.7172     .35404     4.85   0.000       1.0233      2.4111
     lPrimary1500 |      1.141     .60521     1.89   0.059     -.045182      2.3272
                  |
timedistri_class5 |
               2  |    -.17806     .11764    -1.51   0.130      -.40864     .052511
               3  |    .054044     .11249     0.48   0.631      -.16643      .27452
               4  |     .58107     .13352     4.35   0.000       .31938      .84277
               5  |     .44735     .10806     4.14   0.000       .23557      .65914
                  |
             year |
            2012  |     .17228    .036866     4.67   0.000       .10002      .24453
            2013  |     .20035     .13142     1.52   0.127     -.057233      .45793
            2014  |     .26443     .15091     1.75   0.080     -.031359      .56021
            2015  |     .17797     .16424     1.08   0.279      -.14392      .49987
            2016  |    .096737     .19412     0.50   0.618      -.28373      .47721
            2017  |     .36876     .22227     1.66   0.097     -.066885       .8044
            2018  |     .49949     .25284     1.98   0.048       .00393      .99504
            2019  |     .29396     .28722     1.02   0.306      -.26899       .8569
                  |
            _cons |     3.4621     3.1335     1.10   0.269      -2.6794      9.6037
------------------+----------------------------------------------------------------
          sigma_u |  .31205097
          sigma_e |  .16789349
              rho |  .77550717   (fraction of variance due to u_i)
-----------------------------------------------------------------------------------


xtreg lridership mlrestaurant mlaccessibility mlschool mterminal mlprice mlpopulation mlbus ///
    dlrestaurant dlaccessibility dlschool dterminal dlprice dlpopulation dlbus lage20_29 ///
    lPrimary1500 i.timedistri_class5 i.year, re  vce(cluster township)

Random-effects GLS regression                   Number of obs     =        424
Group variable: station_id                      Number of groups  =         88

R-sq:                                           Obs per group:
     within  = 0.8711                                         min =          1
     between = 0.8615                                         avg =        4.8
     overall = 0.8590                                         max =          9

                                                Wald chi2(28)     =   20644.13
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

                                   (Std. Err. adjusted for 39 clusters in township)
-----------------------------------------------------------------------------------
                  |               Robust
       lridership |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
     mlrestaurant |     .20265    .054075     3.75   0.000      .096662      .30863
  mlaccessibility |     .25927    .073673     3.52   0.000       .11487      .40366
         mlschool |    .044068    .024787     1.78   0.075    -.0045136      .09265
        mterminal |     1.0838     .15924     6.81   0.000       .77167      1.3959
          mlprice |    -.77438     .23918    -3.24   0.001      -1.2432      -.3056
     mlpopulation |    -.09414    .048551    -1.94   0.053       -.1893    .0010187
            mlbus |     .28815     .08463     3.40   0.001       .12228      .45402
     dlrestaurant |     .11254    .013993     8.04   0.000      .085113      .13996
  dlaccessibility |     .65164     .15144     4.30   0.000       .35482      .94847
         dlschool |    .082737      .1682     0.49   0.623      -.24692      .41239
        dterminal |     .85962     .22355     3.85   0.000       .42147      1.2978
          dlprice |    -.21553     .22047    -0.98   0.328      -.64764      .21659
     dlpopulation |     1.0708     .44745     2.39   0.017       .19383      1.9478
            dlbus |     .28538     .18144     1.57   0.116     -.070228      .64099
        lage20_29 |     1.7172     .37969     4.52   0.000       .97298      2.4613
         lPrimary |      1.141     .62523     1.82   0.068     -.084424      2.3664
                  |
timedistri_class5 |
               2  |    -.17806     .11608    -1.53   0.125      -.40559     .049457
               3  |    .054044     .11229     0.48   0.630      -.16605      .27414
               4  |     .58107     .13603     4.27   0.000       .31447      .84768
               5  |     .44735     .11162     4.01   0.000       .22859      .66611
                  |
             year |
            2012  |     .17228    .034212     5.04   0.000       .10522      .23933
            2013  |     .20035     .10761     1.86   0.063     -.010559      .41126
            2014  |     .26443     .12883     2.05   0.040       .01193      .51692
            2015  |     .17797     .13549     1.31   0.189     -.087575      .44352
            2016  |    .096737     .15417     0.63   0.530      -.20543      .39891
            2017  |     .36876     .19443     1.90   0.058     -.012312      .74983
            2018  |     .49949     .22657     2.20   0.027      .055426      .94355
            2019  |     .29396      .2633     1.12   0.264       -.2221      .81002
                  |
            _cons |     3.4621     3.4017     1.02   0.309      -3.2052      10.129
------------------+----------------------------------------------------------------
          sigma_u |  .31205097
          sigma_e |  .16789349
              rho |  .77550717   (fraction of variance due to u_i)
-----------------------------------------------------------------------------------

I have a few follow-up questions regarding your suggestions:

1. Should I use specific performance indicators to determine which clustering approach performs better to decide whether to use townships for clustering? What would be the appropriate criteria for making this comparison?

2. When you mentioned "DK standard errors" are you referring to the "xtscc" command in Stata? Can this command also be used to estimate random effects models? If so, could you please tell me the specific code?

3. I noticed in another post https://www.statalist.org/forums/for...-in-panel-data that Mr. Kolev mentioned: "If we start to worry about countries' residuals being cross sectionally correlated, I do not know what we can assume that can be possibly cross sectionally uncorrelated -- observations coming from alternative universes? I think that with your data you should stick to -xtreg- because your data is more of the large N variety. I would not worry about cross sectional correlation in your case, and just do -xtreg, robust- which will give you standard errors robust to heteroskedasticity and arbitrary within country correlation."
Given my current analytical capabilities, I find it challenging to address the cross-sectional dependence. Would it be acceptable in my paper to use clustered standard errors as a simplified approach to the model, acknowledging this as a research limitation that may be addressed in future studies?

Comment

Manh Hoang Ba

Join Date: Aug 2023

Posts: 23
#4

25 Aug 2025, 08:05

The performance of clustered standard errors depends on the number of clusters and the appropriateness of the assumption of independence between clusters. With 39 clusters, it seems statistically easy to accept. Allowing for significant cluster correlation helps to control for some dependence between stations within a cluster. But whether stations in one cluster are independent of stations in another cluster is something that needs careful consideration.

Yes, the DK (Driscoll-Kraay) standard error is calculated using the xtscc command in STATA. Version 1.4 of this command (ssc install xtscc) has the re option that allows estimating the DK standard error for the RE model. Although the DK standard error was originally designed for large T panels. However, Monte-Carlo evidence shows that for small T panels, the DK standard error has better properties than the OLS standard error and the Arrellano standard error (for details, see Hoechle, D. (2007)).

In the context of a large N panel and fixed T, one uses the asymptote N to make inferences about the parameters. The assumption of independent random sampling by ID may be reasonable in some contexts, but not in others. If the population has 88 stations, the assumption of independent random sampling seems unnatural here! Also, for entities with fixed geographic locations, spatial spillovers between them are widely recognized and a class of spatial econometric models has been developed to analyze such situations. I think you can compare the results from different treatments of cross-sectional dependence and draw conclusions about the magnitude of the effects. You can also keep the assumption of independence between cross-sectional units (stations) and use Arellano standard errors (clustered by ID) as Kolev mentioned.
1 like
Comment
Ma chen

Join Date: Jan 2024

Posts: 4
#5

25 Aug 2025, 21:01

Dear Mr. Ba,

Thank you once again for your thoughtful response. Your insights have been extremely helpful.

Previously, I was under the impression that xtscc could only be used for fixed effects models and pooled OLS estimation. Your explanation about the "re" option has been very illuminating. Given the potential lack of theoretical justification for clustering by townships, I think xtscc, re is a good idea.

Best regards,
Chen
Comment

Announcement