Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • variance correction for Sandwich estimators.

    I'm fitting a marginal logistic model for clustered data and have a question about the variance multiplier for the robust variance. my understanding is that I need to specify the nmp option in xtgee to get this, and that the multiplier is M/(M-k), where M is the number of clusters and k is the number of cluster level predictor variables in my model. per the attached output, involving 9000 observations in 26 clusters, I get identical variance estimates in xtgee regardless of whether I specify the nmp option. Does this mean that the variance adjustment is already baked in, and that the nmp option is no longer needed, or that the adjustment is not M/(M-k), but rather N/(N-k), which in this case would be a pretty small adjustment? My understanding from the reading I've been doing is that it should be M (the cluster size) and not N (total sample size). As shown in the attached output, I also ran the same model using logistic with the vce(cluster clname) option and, as expected, matched the xtgee output. this is relevant for me since my actual dataset is 5 times as large as this and hence I can't fit xtgee due to matsize constraints in my version of Stata (I'm limited to 800 but need ~3400). However the model fits fine using the logistic command. unfortunately logistic does not even provide a variance adjustment option, even though in this case it is also using a sandwich estimator. so if it turns out I do need to adjust the variances from this model I presume I can just multiply the resulting SEs by sqrt (M/[M-k]). I realize that in any case I should also be treating the resulting test statistic as a t-statistic and adjust the df accordingly (M-k in this case). thanks for your thoughts. Bill
    Attached Files

  • #2
    Welcome to Statalist, Bill! Unfortunately it's tough to read your doc file. The font is very light- and I can't seem to increase the contrast. Also, some very knowledgeable Statalist members don't use Word.Repost and put all the code and results between CODE delimiters, described in FAQ 12. You can put the quotes from the Manual between QUOTE delimiters,which are or just give the page numbers in the Manual, if version 14.
    Last edited by Steve Samuels; 31 May 2016, 18:08.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      I am running Stata/IC 13.1 for Windows, 64-bit platform. My issue relate to the use of the Huber robust sandwich estimator of the variance covariance matrix, vce. Two questions arise, relating to
      1. Scalar adjustment of the variance estimate
      2. Distributional changes in construction of confidence intervals
      I found a very nice discussion of these issues at
      http://www.stata.com/support/faqs/statistics/sandwich-estimate-of-variance/


      My question relates to the first of these issues. I am trying to use the nmp option to inflate the vce, and am finding that (1) in some contexts use of the option has no impact, and (2) in other contexts there is an impact but it isn't as expected. I give detailed illustrations of this below. I am trying to fit a marginal logistic regression model using xtgee for binary data clustered by the variable clinic (I have 26 clinics). My binary outcome is resultever, my primary variable of interest is a binary variable called intv, andI am adjusting for an 8-level factor called sno, which is a clinic level covariate.

      To illustrate my questions I fit 6 models . Model 1 is a base case that uses the logistic command and ignores clustering.

      Code:
      . set matsize 800
      . xtset clinic                      panel variable:  clinic (unbalanced)
      . logistic resultever intv i.sno
      I then added the vce(cluster clinic) option to adjust for the clustering. this is Model 2

      Code:
      . logistic resultever intv i.sno, vce(cluster clinic)
       
      Logistic regression                               Number of obs   =       8178
                                                                   Wald chi2(8)      =      43.69
                                                                   Prob > chi2        =     0.0000
      Log pseudolikelihood = -3004.8192     Pseudo R2        =     0.0314
       
                                               (Std. Err. adjusted for 26 clusters in clinic)
      -----------------------------------------------------------------------------------------------------------
                                  |                      Robust
                 resultever | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
      ----------------------+-----------------------------------------------------------------------------------
                           intv |    1.34326   .2257688     1.76   0.079     .9662605    1.867351
                                  |
                           sno |
        Benton County  |          1          (base)
                        CHC  |   .9999088    .698548    -0.00   1.000     .2542735    3.932056
               La Clinica   |   .6170858   .3492471   -0.85   0.394     .2035157    1.871084
                    Mosaic  |   .7265976   .4355323    -0.53   0.594     .2244256    2.352423
       Multnomah Cty  |   1.762125    .971949      1.03   0.304     .5977686    5.194462
                     OHSU  |   .6028504   .4286384    -0.71   0.477     .1496196    2.429017
              Open Door  |   1.121821   .5975083     0.22   0.829     .3949606     3.18635
        Virginia Garcia  |   2.082583   1.115184     1.37   0.171     .7291217    5.948461
                                  |
                       _cons |   .1061458    .056175     -4.24   0.000     .0376206    .2994885
      ------------------------------------------------------------------------------------------------------------
      As expected, this had no impact on OR estimates but a major impact on the SEs. Next, I fit xtgee with independent correlation structure and vce(robust) specified to just verify that, as expected, it would match model 2, and it did (output not shown). this is my Model 3.

      Code:
      . xtgee resultever intv i.sno, family(binomial) corr(indep) eform vce(robust)
      When I added the nmp option, however, it had no effect the output was identical. The code for this Model 4 was

      Code:
      . xtgee resultever intv i.sno, family(binomial) corr(indep) eform vce(robust) nmp
      A colleague at work suggested that maybe the reason the nmp option here had no impact was because I had specified the corr(indep) option. So I reran the model using the default exchangeable covariance structure. First time is without the nmp option; this is Model 5.

      Code:
      . xtgee resultever intv i.sno, family(binomial) eform vce(robust)
       
      Iteration 1: tolerance = .16750697
      ...
      Iteration 5: tolerance = 1.108e-07
       
      GEE population-averaged model                   Number of obs       =      8178
      Group variable:                     clinic                 Number of groups   =        26
      Link:                                       logit                 Obs per group: min  =        92
      Family:                            binomial                                          avg  =     314.5
      Correlation:             exchangeable                                         max =       694
                                                                             Wald chi2(8)            =     35.76
      Scale parameter:                         1                 Prob > chi2              =    0.0000
       
                                                (Std. Err. adjusted for clustering on clinic)
      ------------------------------------------------------------------------------------------------------------
                                  |                     Robust
                 resultever | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
      ----------------------+-------------------------------------------------------------------------------------
                           intv |   1.380307   .2462459     1.81   0.071     .9730198    1.958075
                                 |
                          sno |
       Benton County  |          1         (base)
                       CHC  |   .8381794   .5731078    -0.26   0.796     .2194448    3.201464
               La Clinica  |   .6331332   .3434769    -0.84   0.399     .2186329    1.833474
                    Mosaic  |   .7462672   .3961804    -0.55   0.581     .2636369    2.112431
       Multnomah Cty  |   1.509214   .7614865      0.82   0.415     .5613972    4.057249
                     OHSU  |   .5199297    .366076     -0.93   0.353      .130805    2.066641
              Open Door  |   1.051037   .5176295     0.10   0.919     .4003194    2.759496
        Virginia Garcia  |   1.902839   .9461655     1.29   0.196     .7180446    5.042577
                                 |
                       _cons |   .1133024   .0560087    -4.41   0.000     .0429997    .2985473
      ------------------------------------------------------------------------------------------------------------
      not unexpectedly, both the OR estimates and their SEs changed. Finally, I refit model 5, but this time with the nmp option and indeed the variances do change now. This is Model 6.

      Code:
      . xtgee resultever intv i.sno, family(binomial) eform vce(robust) nmp
       
      Iteration 1: tolerance = .16743223
      ...
      Iteration 5: tolerance = 1.111e-07
       
      GEE population-averaged model                   Number of obs       =      8178
      Group variable:                     clinic                 Number of groups   =        26
      Link:                                       logit                 Obs per group: min  =        92
      Family:                            binomial                                           avg =     314.5
      Correlation:            exchangeable                                          max =       694
                                                                             Wald chi2(8)            =     35.76
      Scale parameter:                         1                 Prob > chi2             =    0.0000
       
                                                (Std. Err. adjusted for clustering on clinic)
      ------------------------------------------------------------------------------------------------------------
                                  |                      Robust
                 resultever | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
      ----------------------+------------------------------------------------------------------------------------
                           intv |     1.3803   .2462323     1.81   0.071     .9730321    1.958031
                                  |
                           sno |
        Benton County  |          1          (base)
                        CHC  |   .8382422   .5731553    -0.26   0.796     .2194589    3.201739
                La Clinica  |   .6331354   .3434848    -0.84   0.400      .218629    1.833519
                    Mosaic  |   .7462766    .396201      -0.55   0.581     .2636293    2.112545
        Multnomah Cty  |   1.509279   .7615442     0.82   0.415     .5614031    4.057555
                      OHSU  |   .5199509    .366095     -0.93   0.353     .1308083    2.066756
               Open Door  |   1.051067     .51766       0.10   0.919     .4003184    2.759656
         Virginia Garcia  |    1.9029     . 9462247     1.29   0.196     .7180467    5.042888
                                  |
                       _cons |   .1132993   .0560085     -4.41   0.000     .0429975     .298546
      -------------------------------------------------------------------------------------------------------------
      Why does nmp have no impact with corr(indep)?

      Leaving aside the issue of why nmp has no impact if we specify an independent correlation structure, let’s consider why the corr within clinics should be exchangeable, which is the default for xtgee. I get that, if we view the clinics as random effects, then the correlation of two observations within the same clinic is given by the ICC, and hence an exchangeable correlation structure makes sense. But then why even allow for an independence correlation structure, or any of the several other allowable correlation structures? In some sense that correlation is only present if you look at the entire dataset (observations within a cluster are more similar than are observations between clusters due to the shared cluster effect). However if I condition on a given clinic I assume all of the observations are independent (at least for this study), and that is how I interpreted the corr option. So in this sense specifying an independent corr structure makes sense. By contrast for a repeated measures design you might reasonably postulate an autoregressive error structure.

      Let’s look at it another way. If we say the clustering forces an exchangeable correlation, then wouldn’t the converse be true and specification of an independent correlation would imply that the cluster effects are ignorable? Clearly that is not what’s happening. The results from models 2 and 3 above make it quite clear that the logistic command, even with vce(cluster clinic) specified, assumes an independent correlation structure within clinics. And at the same time the results from models 1 and 2 make it very clear that the resulting model is still distinct from the pure independence model. Clearly the adjustment for clinic effects in model 2 has a profound effect on the results even in the absence of an exchangeable correlation structure. Indeed the incremental impact of further assuming an exchangeable correlation structure was much smaller, at least for this example.

      So I am left wondering what is the nature of the adjustment that models 2-4 are making that is distinct from what model 5 is doing? Yes there is the issue of indep vs exch correlation of observations within clinic, but I guess what else is different? Model 2 already adjusts for clinic effects, so is the use of an exch corr adjusting for something other than clinic effects, or is it just a more sophisticated (precise??) adjustment for clinic effects? Is it perhaps just that models 2-4 treat clinic as a fixed effect and model 5 treats it as a random effect? If this were true then we would expect that model 5 would result in higher SEs since we are now adding in an extra component of variability that was ignored in models 2-4. However if you look at the SEs you will see that they aren’t consistently higher in model 5.

      A second issue in relation to this example is the nature of the nmp adjustment. Per Stata’s documentation for xt_vc-_options,

      nmp specifies that the divisor N-P be used instead of the default N, where N is the total number of observations and P is the number of coefficients estimated.
      So this would suggest that Stata really does use total N, and not M=number of clinics. However Stata’s documentation on the robust variance estimator has the following technical note.

      What is written above is asymptotically correct but ignores a finite-sample adjustment to V. For maximum likelihood estimators, when you specify vce(robust) but not vce(cluster clustvar), a better estimate of variance is V* = {N/(N-1)}V. When you also specify the vce(cluster clustvar) option, this becomes V* = {M/(M-1)}V.
      Note the distinction between use of N without clustering and M with clustering, even though this isn’t reflected in the description of the NMP adjustment. These adjustment also subtract by 1 in the denominator, instead of P. It also is not clear if Stata has these adjustments built in already, or if they are just making the observation. I find no option to reflect this specific adjustment. Further, the Stata discussion that I referenced at the beginning of this post clearly states that N should be the number of clusters. This is also supported by the source papers on the topic, including
      Mancl and DeRouen, Biometrics, 57, 126-134, March 2001
      I calculated the ratio of SEs for the terms in model 6 relative to those in model 5, squared these numbers, and then compared against possible formulas for the correction. For this I reran these two models without the eform option so that I would be working with the correct SEs. What I expected to see was (a) the same ratio in each case – what you would expect if indeed it is a constant multiplier of the Covariance matrix, and (b)numbers greater than 1. Instead I got 9 distinct ratios and one of them was<1. I welcome your thoughts.







      Comment


      • #4
        sorry the regr output came out weird. It looked good when I hit post key.

        Comment

        Working...
        X