Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Incorporating survey bootstrap replicate weights in repeated measures design

    Hi,

    Question for the statalist. Please forgive the longish question, but hopefully this is an interesting problem to solve!

    I am wondering how clustering of repeated measures is handled when using bootstrap replicate weights?

    Say I am interested in analysing a dataset with repeated measures and I would like to obtain a population-averaged model. But say there are only replicate weights in the survey data that must be used to estimate the variance.

    I understand that replicate weights are used to generate a copy of the point estimates (say if there are 50 replicate weights, then there would be 50 replicated point estimates), and then the distribution of the point estimates would be used to estimate the variance. But from my readings, I am getting the impression that the use of bootstrap replicate weights is enough to account for any individual-level clustering through an "ultimate cluster" assumption where any variability or clustering nested within the PSU are rolled-up and captured in the composite variance estimate.

    For example, on page 9 of the following paper (from SAS no less): http://support.sas.com/resources/pap.../0767-2017.pdf
    "Again, it bears repeating that we can carry out this hypothesis test using output from PROC SURVEYMEANS because data from the two years (i.e., the two domains) do not emanate from one or more of the same PSUs—granted, we could have used PROC SURVEYFREQ to get the same figures(see Program 4.1 of Lewis (2016)). If this is not the case, then the standard error of the difference will generally contain a non-zero covariance, which PROC SURVEYMEANS (or PROC SURVEYFREQ) is not designed to estimate. Because the covariance is often positive, thinking back to the formulas presented in Section 4.1, it is advantageous to account for it, because it can serve to reduce the estimated standard error of the difference. One way to implicitly account for the covariance is via the general methodology shown in Section 4.4*. In fact, the methodology works for differences in any kind of parameter, including means and totals"

    [*Where Section 4.4 describes the bootstrap methodology]

    So my question is, what "happens" to the within-person clustering when bootstrap replicates are used? Is within-person clustering implicitly accounted for when I use bootstrap replicate weights? Or, do I need to use both bootstrap weights (to account for complex survey design) AND account for within-person correlation (through the use of GEE or by adding a vce(cluster) option)? In my example below, which standard errors would you tend to go with?

    Code:
    /******************************/
    /*load data*/
    /******************************/
    
    webuse nhanes2brr, clear
    
    /******************************/
    /*generate dummy outcomes - pretend these are repeated measures*/
    /******************************/
    
    *high systolic BP
    generate highbpsys=1 if bpsystol>140
    replace highbpsys=0 if bpsystol<=140
    tab highbp highbpsys
    
    *high diastolic BP
    generate highbpdia=1 if bpdiast>90
    replace highbpdia=0 if bpdiast<=90
    tab highbp highbpdia
    
    bysort highbp: tab highbpsys highbpdia
    
    /******************************/
    /*stack the data in person-period format*/
    /******************************/
    
    count
    expand 2, generate(measurement)
    count
    
    generate outcome=highbpsys if measurement==0
    replace outcome=highbpdia if measurement==1
    
    /******************************/
    /*population averaged models*/
    /******************************/
    
    *SVY WITH BOOTSTRAP
    svyset [pw=finalwgt], brrweight(brr_1-brr_32) vce(brr) mse
    svy: logistic outcome height weight age female
    
    *SVY WITH CLUSTER BUT NO DESIGN (LINEARIZED)
    svyset sampl [pw=finalwgt]
    svy: logistic outcome height weight age female
    
    *NON-SURVEY LOGISTIC WITH CLUSTER
    logistic outcome height weight age female [pw=finalwgt], vce(cluster sampl)
    
    *GEE
    xtset sampl
    xtgee outcome height weight age female [pw=finalwgt], family(binomial) link(logit) corr(independent) eform
    xtgee outcome height weight age female [pw=finalwgt], family(binomial) link(logit) corr(exchangeable) eform
    
    *GEE with wrapper
    
    capture program drop geebootstrap
    program geebootstrap, eclass
    version 13
    syntax anything [if] [iw pw]
    if "`weight'" != "" {
    local wgtexp "[`weight' `exp']"
    }
    set buildfvinfo on
    `anything' `if' `wgtexp', family(binomial) link(logit) corr(independent) eform
    end
    
    local mycmdline xtgee outcome height weight age female
    svyset [pw=finalwgt], brrweight(brr_1-brr_32) vce(brr) mse
    svy brr _b, eform: geebootstrap `mycmdline'
    
    /******************************/
    *COMPARE ESTIMATES
    /******************************/
    
    *                                                     | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    *SVY WITH BOOTSTRAP:                           height |   .9674585   .0047008    -6.81   0.000     .9579186    .9770935
    *SVY WITH CLUSTER BUT NO DESIGN (LINEARIZED):  height |   .9674585   .0042054    -7.61   0.000     .9592502    .9757371
    *NON-SURVEY LOGISTIC WITH CLUSTER:             height |   .9674585   .0042054    -7.61   0.000     .9592512    .9757361
    *GEE INDEPENDENT:                              height |   .9674585   .0042054    -7.61   0.000     .9592512    .9757361
    *GEE EXCHANGEABLE:                             height |   .9674585   .0042054    -7.61   0.000     .9592512    .9757361
    *GEE INDEPENDENT WITH BOOTSTRAP:               height |   .9674585   .0047008    -6.81   0.000     .9579186    .9770935
    
    *The point estimates are exactly the same across models, but SEs are different. Which are the best SEs to use (0.0047008 versus 0.0042054)?
    
    *Also note that the SEs are equivalent when specifying clustering in the non-GEE models (0.0042054).
    
    *Also note that the SEs when using bootstrap (0.0047008) are the exact same in the SVY LOGISTIC model and the GEE model with bootstrap wrapper. I suppose this is because the point estimates are the same from each model and the bootstrap is working off of the point estimates - thus the "extra" covariance accounted for by the GEE model does not carry forward when using the bootstrap weights? Do I need to "bootstrap" the GEE standard error value itself?
    Last edited by Mischa Perrier; 20 Jun 2018, 00:50.

  • #2
    Welcome to Statalist, Mischa!

    As you can see, the two brr calculations are identical; they are the only correct ones. All within-replicate variation (not just within-person variation) is accounted for in between-replicate variation. However that is also true for non-replicate survey designs, where for "replicate", substitute "primary sampling unit".

    I future posts, please don't show extensive text in italics. I find the quote from SAS practically unreadable.
    Last edited by Steve Samuels; 20 Jun 2018, 23:09.
    Steve Samuels
    Statistical Consulting
    [email protected]

    Stata 14.2

    Comment


    • #3
      Steve, thank you for replying to my post. This is the answer I was looking for!

      I'm having some difficulty finding any references on this beyond the very technical papers describing the early bootstrap methodology. Would you happen to know of any accessible papers that I could use to support this? Thank you again. And I will remember to turn off italics in my next post!

      Comment


      • #4
        See perhaps Stas Kolenikov's Stata Journal article or Wolter's book:

        Wolter, K. M. (2007). Introduction to variance estimation (2nd ed. ed. Vol. Statistics for social and behavioral sciences). New York: Springer.


        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment

        Working...
        X