Incorporating survey bootstrap replicate weights in repeated measures design

Mischa Perrier

Join Date: Jun 2018
Posts: 2

Incorporating survey bootstrap replicate weights in repeated measures design

20 Jun 2018, 00:21

Hi,

Question for the statalist. Please forgive the longish question, but hopefully this is an interesting problem to solve!

I am wondering how clustering of repeated measures is handled when using bootstrap replicate weights?

Say I am interested in analysing a dataset with repeated measures and I would like to obtain a population-averaged model. But say there are only replicate weights in the survey data that must be used to estimate the variance.

I understand that replicate weights are used to generate a copy of the point estimates (say if there are 50 replicate weights, then there would be 50 replicated point estimates), and then the distribution of the point estimates would be used to estimate the variance. But from my readings, I am getting the impression that the use of bootstrap replicate weights is enough to account for any individual-level clustering through an "ultimate cluster" assumption where any variability or clustering nested within the PSU are rolled-up and captured in the composite variance estimate.

For example, on page 9 of the following paper (from SAS no less): http://support.sas.com/resources/pap.../0767-2017.pdf
"Again, it bears repeating that we can carry out this hypothesis test using output from PROC SURVEYMEANS because data from the two years (i.e., the two domains) do not emanate from one or more of the same PSUs—granted, we could have used PROC SURVEYFREQ to get the same figures(see Program 4.1 of Lewis (2016)). If this is not the case, then the standard error of the difference will generally contain a non-zero covariance, which PROC SURVEYMEANS (or PROC SURVEYFREQ) is not designed to estimate. Because the covariance is often positive, thinking back to the formulas presented in Section 4.1, it is advantageous to account for it, because it can serve to reduce the estimated standard error of the difference. One way to implicitly account for the covariance is via the general methodology shown in Section 4.4*. In fact, the methodology works for differences in any kind of parameter, including means and totals"

[*Where Section 4.4 describes the bootstrap methodology]

So my question is, what "happens" to the within-person clustering when bootstrap replicates are used? Is within-person clustering implicitly accounted for when I use bootstrap replicate weights? Or, do I need to use both bootstrap weights (to account for complex survey design) AND account for within-person correlation (through the use of GEE or by adding a vce(cluster) option)? In my example below, which standard errors would you tend to go with?

Code:

/******************************/
/*load data*/
/******************************/

webuse nhanes2brr, clear

/******************************/
/*generate dummy outcomes - pretend these are repeated measures*/
/******************************/

*high systolic BP
generate highbpsys=1 if bpsystol>140
replace highbpsys=0 if bpsystol<=140
tab highbp highbpsys

*high diastolic BP
generate highbpdia=1 if bpdiast>90
replace highbpdia=0 if bpdiast<=90
tab highbp highbpdia

bysort highbp: tab highbpsys highbpdia

/******************************/
/*stack the data in person-period format*/
/******************************/

count
expand 2, generate(measurement)
count

generate outcome=highbpsys if measurement==0
replace outcome=highbpdia if measurement==1

/******************************/
/*population averaged models*/
/******************************/

*SVY WITH BOOTSTRAP
svyset [pw=finalwgt], brrweight(brr_1-brr_32) vce(brr) mse
svy: logistic outcome height weight age female

*SVY WITH CLUSTER BUT NO DESIGN (LINEARIZED)
svyset sampl [pw=finalwgt]
svy: logistic outcome height weight age female

*NON-SURVEY LOGISTIC WITH CLUSTER
logistic outcome height weight age female [pw=finalwgt], vce(cluster sampl)

*GEE
xtset sampl
xtgee outcome height weight age female [pw=finalwgt], family(binomial) link(logit) corr(independent) eform
xtgee outcome height weight age female [pw=finalwgt], family(binomial) link(logit) corr(exchangeable) eform

*GEE with wrapper

capture program drop geebootstrap
program geebootstrap, eclass
version 13
syntax anything [if] [iw pw]
if "`weight'" != "" {
local wgtexp "[`weight' `exp']"
}
set buildfvinfo on
`anything' `if' `wgtexp', family(binomial) link(logit) corr(independent) eform
end

local mycmdline xtgee outcome height weight age female
svyset [pw=finalwgt], brrweight(brr_1-brr_32) vce(brr) mse
svy brr _b, eform: geebootstrap `mycmdline'

/******************************/
*COMPARE ESTIMATES
/******************************/

*                                                     | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
*SVY WITH BOOTSTRAP:                           height |   .9674585   .0047008    -6.81   0.000     .9579186    .9770935
*SVY WITH CLUSTER BUT NO DESIGN (LINEARIZED):  height |   .9674585   .0042054    -7.61   0.000     .9592502    .9757371
*NON-SURVEY LOGISTIC WITH CLUSTER:             height |   .9674585   .0042054    -7.61   0.000     .9592512    .9757361
*GEE INDEPENDENT:                              height |   .9674585   .0042054    -7.61   0.000     .9592512    .9757361
*GEE EXCHANGEABLE:                             height |   .9674585   .0042054    -7.61   0.000     .9592512    .9757361
*GEE INDEPENDENT WITH BOOTSTRAP:               height |   .9674585   .0047008    -6.81   0.000     .9579186    .9770935

*The point estimates are exactly the same across models, but SEs are different. Which are the best SEs to use (0.0047008 versus 0.0042054)?

*Also note that the SEs are equivalent when specifying clustering in the non-GEE models (0.0042054).

*Also note that the SEs when using bootstrap (0.0047008) are the exact same in the SVY LOGISTIC model and the GEE model with bootstrap wrapper. I suppose this is because the point estimates are the same from each model and the bootstrap is working off of the point estimates - thus the "extra" covariance accounted for by the GEE model does not carry forward when using the bootstrap weights? Do I need to "bootstrap" the GEE standard error value itself?

Last edited by Mischa Perrier; 20 Jun 2018, 00:50.

Tags: None

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

20 Jun 2018, 23:05

Welcome to Statalist, Mischa!

As you can see, the two brr calculations are identical; they are the only correct ones. All within-replicate variation (not just within-person variation) is accounted for in between-replicate variation. However that is also true for non-replicate survey designs, where for "replicate", substitute "primary sampling unit".

I future posts, please don't show extensive text in italics. I find the quote from SAS practically unreadable.

Last edited by Steve Samuels; 20 Jun 2018, 23:09.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Mischa Perrier

Join Date: Jun 2018

Posts: 2
#3

21 Jun 2018, 02:41

Steve, thank you for replying to my post. This is the answer I was looking for!

I'm having some difficulty finding any references on this beyond the very technical papers describing the early bootstrap methodology. Would you happen to know of any accessible papers that I could use to support this? Thank you again. And I will remember to turn off italics in my next post!
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

21 Jun 2018, 12:26

See perhaps Stas Kolenikov's Stata Journal article or Wolter's book:

Wolter, K. M. (2007). Introduction to variance estimation (2nd ed. ed. Vol. Statistics for social and behavioral sciences). New York: Springer.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

Incorporating survey bootstrap replicate weights in repeated measures design

Comment

Comment

Comment