Dear all,
I have some questions about the Mundlak method. This hybrid model is actually a random effects model. I would like to know how this method performs heteroskedasticity tests and cross-sectional correlation tests?
The purpose of my research is to conduct a longitudinal analysis of the average daily station-level metro ridership and its determinants in Xi'an, China, during December of each year from 2011 to 2019.
First I need to introduce my data structure:
Xi'an opened its first line in 2011, and by 2019, it had opened five lines. My research focuses on 88 stations on four lines (one line is excluded) within the study area. The number of stations has increased over time, and data for the 25 new stations that opened in 2019 is available only for a single period.
I read Kasraian D, Maat K, Van Wee B. Urban developments and daily travel distances: Fixed, random and hybrid effects models using a Dutch pseudo-panel over three decades[J]. Journal of Transport Geography, 2018, 72: 228-236. This paper uses POLS, fixed-effects models, random-effects models, and hybrid models (Mundlak approach). The hybrid model can capture both within-group and between-group effects of time-varying variables, which I think is very consistent with my research goals, so I am very interested in using this approach. Therefore, I also selected these four models for comparative analysis.
This paper selects the following variables. The time-invariant variables are because the data is limited to only one year.
Time-varying variables:
1. Ridership: station-level ridership for each station per year
2. Restaurant: the number of restaurants in station catchment areas (SCAs) per year
3. Accessibility: the number of people reachable within 40 minutes per station per year
4. School: the number of schools in SCAs per year
5. Terminal: terminal station (dummy)
6. Price: mean house price in SCAs per year
7. Population: Population density in SCAs per year
8. Bus: the number of bus lines in SCAs per year
Time-invariant variables:
1. age20_29: the proportion of the population aged 20-29 in SCAs each year
2. Primary: the proportion of the population with primary education in SCAs each year
3. timedistri_class5: The stations are divided into five categories based on the number of bordings and alightings at each station in each hour of the day.
The prefix l indicates log transformation.
Below are the model results.

From what I have learned, the fixed-effect model requires a series of tests, including testing for heteroskedasticity, serial correlation, and cross-sectional dependence, in order to select the appropriate estimation command.
The three tests I used are as follows:
xtcd2 //Testing for cross-sectional dependence. Since 25 stations only have single-period data, xtcsd, pesaran abs cannot obtain results.
xttest //Testing for heteroskedasticity
xtserial y x // Testing for serial correlation
My questions are:
1. Will the single-period data for these 25 stations affect the use of the hybrid model?
2. The xtcd2 and xttest3 commands can only be used after xtreg,fe. The variables used in the hybrid model differ from those in the FE model, so I believe retesting is necessary. I'd like to know how to test the hybrid model for heteroskedasticity and cross-sectional correlation.
3. I know I can use xtreg,re vce(cluster station_id) to obtain robust estimates of the hybrid model, but I'm concerned about cross-sectional dependence. If cross-sectional dependence exists, how should I address it? Can I still use the hybrid model?
I'll show you what I tried to do to test for cross-sectional correlation:
This shows that the fixed effect model does not have cross-sectional correlation problems. And I made a (rather naive) attempt at the hybrid model, the time-invariant variables is omitted, but this suffers from cross-sectional issues:
Thanks in advance.
Best regards,
Chen
I have some questions about the Mundlak method. This hybrid model is actually a random effects model. I would like to know how this method performs heteroskedasticity tests and cross-sectional correlation tests?
The purpose of my research is to conduct a longitudinal analysis of the average daily station-level metro ridership and its determinants in Xi'an, China, during December of each year from 2011 to 2019.
First I need to introduce my data structure:
Xi'an opened its first line in 2011, and by 2019, it had opened five lines. My research focuses on 88 stations on four lines (one line is excluded) within the study area. The number of stations has increased over time, and data for the 25 new stations that opened in 2019 is available only for a single period.
Code:
xtset station_id year xtdes station_id: 1, 2, ..., 88 n = 88 year: 2011, 2012, ..., 2019 T = 9 Delta(year) = 1 unit Span(year) = 9 periods (station_id*year uniquely identifies each observation) Distribution of T_i: min 5% 25% 50% 75% 95% max 1 1 1 4 7 9 9 Freq. Percent Cum. | Pattern ---------------------------+----------- 25 28.41 28.41 | ........1 24 27.27 55.68 | .....1111 18 20.45 76.14 | ..1111111 17 19.32 95.45 | 111111111 4 4.55 100.00 | ...111111 ---------------------------+----------- 88 100.00 | XXXXXXXXX
This paper selects the following variables. The time-invariant variables are because the data is limited to only one year.
Time-varying variables:
1. Ridership: station-level ridership for each station per year
2. Restaurant: the number of restaurants in station catchment areas (SCAs) per year
3. Accessibility: the number of people reachable within 40 minutes per station per year
4. School: the number of schools in SCAs per year
5. Terminal: terminal station (dummy)
6. Price: mean house price in SCAs per year
7. Population: Population density in SCAs per year
8. Bus: the number of bus lines in SCAs per year
Time-invariant variables:
1. age20_29: the proportion of the population aged 20-29 in SCAs each year
2. Primary: the proportion of the population with primary education in SCAs each year
3. timedistri_class5: The stations are divided into five categories based on the number of bordings and alightings at each station in each hour of the day.
The prefix l indicates log transformation.
Code:
sort station_id year foreach var of varlist lridership lrestaurant laccessibility lschool lprice lpopulation lbus { egen m`var' = mean(`var'),by(station_id) gen d`var' = `var' - m`var' } reg lridership lrestaurant laccessibility lschool terminal lprice lpopulation lbus lage20_29 lPrimary i. timedistri_class5 i.year, vce(cluster station_id) est store pols_rob xtreg lridership lrestaurant laccessibility lschool terminal lprice lpopulation lbus i.year,fe vce(cluster station_id) est store fe_rob xtreg lridership lrestaurant laccessibility lschool terminal lprice lpopulation lbus lage20_29 lPrimary i.timedistri_class5 i.year,re vce(cluster station_id) est store re_rob xtreg lridership dlrestaurant dlaccessibility dlschool dterminal dlprice dlpopulation dlbus mlrestaurant mlaccessibility mlschool mterminal mlprice mlpopulation mlbus lage20_29 lPrimary1500 i.timedistri_class5 i.year, re vce(cluster station_id) est store hybrid_rob
From what I have learned, the fixed-effect model requires a series of tests, including testing for heteroskedasticity, serial correlation, and cross-sectional dependence, in order to select the appropriate estimation command.
The three tests I used are as follows:
xtcd2 //Testing for cross-sectional dependence. Since 25 stations only have single-period data, xtcsd, pesaran abs cannot obtain results.
xttest //Testing for heteroskedasticity
xtserial y x // Testing for serial correlation
My questions are:
1. Will the single-period data for these 25 stations affect the use of the hybrid model?
2. The xtcd2 and xttest3 commands can only be used after xtreg,fe. The variables used in the hybrid model differ from those in the FE model, so I believe retesting is necessary. I'd like to know how to test the hybrid model for heteroskedasticity and cross-sectional correlation.
3. I know I can use xtreg,re vce(cluster station_id) to obtain robust estimates of the hybrid model, but I'm concerned about cross-sectional dependence. If cross-sectional dependence exists, how should I address it? Can I still use the hybrid model?
I'll show you what I tried to do to test for cross-sectional correlation:
Code:
xtreg lridership_new lcatering1000 lnet40_052102 lZXX05 terminal lreal_price lpopden1000 lsmallbus i.year,fe robust xtcd2 Pesaran (2015) test for cross sectional dependence Postestimation. Unbalanced panel detected, test adjusted. H0: errors are weakly cross sectional dependent. CD = 1.286368 p_value = .19831467
Code:
xtreg lridership dlrestaurant dlaccessibility dlschool dterminal dlprice dlpopulation dlbus mlrestaurant mlaccessibility mlschool mterminal mlprice mlpopulation mlbus lage20_29 lPrimary1500 i.timedistri_class5 i.year, fe robust xtcd2 Pesaran (2015) test for cross sectional dependence Postestimation. Unbalanced panel detected, test adjusted. H0: errors are weakly cross sectional dependent. CD = 3.0094369 p_value = .00261732
Best regards,
Chen
Comment