Why can't we cluster on anything we like?

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#16

02 Aug 2020, 09:43

I am agreed with you on principle. I think that Stata is overly patronistic. She should be calculating whatever we are asking for and is calculable, and it should be up to us users to think whether the calculation makes sense or not.

I was just saying that they give a good reason for the nesting.

There is no theorem to claim that MLE and GLS estimates have to be numerically the same. However they are both consistent under the same set of conditions. Also it is super easy to just check for your dataset how different they are.

Originally posted by paulvonhippel View Post

Documentation aside, there are studies where you want to cluster on a variable that's not nested in the random effects. StataCorp was thinking about a particular data structure when they wrote the documentation, but the right way to cluster depends on the design of the study. In the study I'm working with, it's clear that having student random effects and clustering by teacher is desirable. So I was delighted to see the nonest option.

I'm not sure if -xtreg, re- and -xtreg, re mle- give similar estimates in all data. They do in the nlswork data that you've used as an example, but they might give pretty different estimates when there are a lot of missing values.
1 like
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 502
#17

03 Aug 2020, 20:10

Joro Kolev, you write: "There is no theorem to claim that MLE and GLS estimates have to be numerically the same. However they are both consistent under the same set of conditions."

Is that right? Are they both consistent when there are missing values on the dependent variables? Where can I read more about this?
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

#18

03 Aug 2020, 23:34

I think in this context both -xtreg, re- and xtreg, mle- do what is called complete case analysis, that is they include in the regression only rows for which both the dependent and all independent variables are not missing. As you can see in the regressions below 1) they are almost the same estimators whether or not I replace 90% of the dependent variable to missings at random 2) they have exactly the same number of observations whether or not I replace 90% of the dependent variable to missings at random.

I do not think that -xtreg, mle- does anything special for missing values here. And they are both consistent under the random effects model assumptions.

Code:

. webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. xtset idcode
       panel variable:  idcode (unbalanced)

. xtreg ln_wage age ttl_exp hours, re

Random-effects GLS regression                   Number of obs     =     28,443
Group variable: idcode                          Number of groups  =      4,709

R-sq:                                           Obs per group:
     within  = 0.1373                                         min =          1
     between = 0.2590                                         avg =        6.0
     overall = 0.1801                                         max =         15

                                                Wald chi2(3)      =    5114.31
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0068148   .0006925    -9.84   0.000    -.0081721   -.0054575
     ttl_exp |   .0428326   .0010267    41.72   0.000     .0408202     .044845
       hours |   .0003067   .0002255     1.36   0.174    -.0001353    .0007487
       _cons |   1.597294    .018722    85.32   0.000     1.560599    1.633988
-------------+----------------------------------------------------------------
     sigma_u |  .32309332
     sigma_e |  .29766067
         rho |  .54090192   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. xtreg ln_wage age ttl_exp hours, mle nolog

Random-effects ML regression                    Number of obs     =     28,443
Group variable: idcode                          Number of groups  =      4,709

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =        6.0
                                                              max =         15

                                                LR chi2(3)        =    4678.61
Log likelihood  = -10500.075                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0068087   .0006932    -9.82   0.000    -.0081672   -.0054501
     ttl_exp |   .0428009   .0010297    41.57   0.000     .0407827    .0448191
       hours |   .0003018   .0002256     1.34   0.181    -.0001404    .0007441
       _cons |   1.597463   .0187382    85.25   0.000     1.560737    1.634189
-------------+----------------------------------------------------------------
    /sigma_u |   .3260675    .004153                      .3180286    .3343095
    /sigma_e |   .2984172   .0013726                       .295739    .3011197
         rho |   .5441903   .0069048                      .5306341    .5576952
------------------------------------------------------------------------------
LR test of sigma_u=0: chibar2(01) = 1.2e+04            Prob >= chibar2 = 0.000

. replace ln_wage=. if runiform()>.9
(2,866 real changes made, 2,866 to missing)

. xtreg ln_wage age ttl_exp hours, re

Random-effects GLS regression                   Number of obs     =     25,587
Group variable: idcode                          Number of groups  =      4,645

R-sq:                                           Obs per group:
     within  = 0.1400                                         min =          1
     between = 0.2565                                         avg =        5.5
     overall = 0.1816                                         max =         15

                                                Wald chi2(3)      =    4715.98
corr(u_i, X)   = 0 (assumed)                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0068057   .0007219    -9.43   0.000    -.0082206   -.0053908
     ttl_exp |   .0430682   .0010688    40.29   0.000     .0409733     .045163
       hours |   .0003284   .0002378     1.38   0.167    -.0001377    .0007944
       _cons |   1.594197   .0195405    81.58   0.000     1.555899    1.632496
-------------+----------------------------------------------------------------
     sigma_u |  .32322814
     sigma_e |  .29678027
         rho |  .54257979   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. xtreg ln_wage age ttl_exp hours, mle nolog

Random-effects ML regression                    Number of obs     =     25,587
Group variable: idcode                          Number of groups  =      4,645

Random effects u_i ~ Gaussian                   Obs per group:
                                                              min =          1
                                                              avg =        5.5
                                                              max =         15

                                                LR chi2(3)        =    4307.46
Log likelihood  = -9608.1982                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.0068011   .0007224    -9.41   0.000    -.0082171   -.0053851
     ttl_exp |   .0430403   .0010717    40.16   0.000     .0409397    .0451408
       hours |   .0003239   .0002379     1.36   0.173    -.0001424    .0007903
       _cons |   1.594377   .0195561    81.53   0.000     1.556048    1.632706
-------------+----------------------------------------------------------------
    /sigma_u |   .3260184    .004234                      .3178246    .3344234
    /sigma_e |   .2975915   .0014573                      .2947488    .3004615
         rho |   .5454899   .0071001                      .5315489    .5593751
------------------------------------------------------------------------------
LR test of sigma_u=0: chibar2(01) = 1.0e+04            Prob >= chibar2 = 0.000

Originally posted by paulvonhippel View Post

Joro Kolev, you write: "There is no theorem to claim that MLE and GLS estimates have to be numerically the same. However they are both consistent under the same set of conditions."

Is that right? Are they both consistent when there are missing values on the dependent variables? Where can I read more about this?

Announcement

Comment

Comment

Comment