Boostrap percentile confidence interval for survey data?

helenweiss

Join Date: May 2014

Posts: 2
#1

Boostrap percentile confidence interval for survey data?

15 May 2014, 06:44

We are using bootstrap variance estimation for survey data with a small number of primary sampling units (n=4), and would like to use the bootstrap percentiles to derive the confidence intervals rather than the normal approximation, which seems to be the default. We can't see a way to do that easily - any suggestions?

We are using the rhsbsample command to generate our bootstrap replication weights, using repeated half-sample bootstrap sampling, and then specifying 'bootstrap' as the variance estimation method in our 'svyset' statement.
Tags: None
P Van Kerm

Join Date: Apr 2014

Posts: 10
#2

16 May 2014, 05:15

First off, a bit of ... aheum... self-promotion:rhsbsample is available from the SSC archive (ssc describe rhsbsample) and was described at the 2013 UKSUG meeting (see http://ideas.repec.org/p/boc/usug13/10.html).

It does not seem quite immediate to get Stata to show you the percentile CI after -svy bootstrap-, but this mock example should get you there:

Code:

sysuse auto svyset [pw=weight] , bsrweight(mpg weight) // bsrweight() should contain your rhsbsample-generated replication weights svy bootstrap , saving(test , replace) mse : regress price headroom mat b=e(b) bstat using test , stat(b) mse mat list e(ci_percentile) estat bootstrap , all

The alternative of course is to do all calculations and combinations 'by hand'---whether it is easier probably depends on the size and complexity of your estimations.

Philippe

Last edited by P Van Kerm; 16 May 2014, 05:25.
Comment
helenweiss

Join Date: May 2014

Posts: 2
#3

23 May 2014, 04:06

Thanks so much Philippe - that works fine. Much appreciated

Helen
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

23 May 2014, 16:48

Philippe: If I read your presentation correctly, the rhsbsample command takes repeated samples of size N/2 from N PSUs. Helen has N=4, so that the number of possible distinct samples is 6. Is this enough to get good bootstrap standard errors, let alone percentile CIs ?

Last edited by Steve Samuels; 23 May 2014, 16:51.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Alfonso Sánchez-Peñalver

Join Date: Mar 2014

Posts: 432
#5

24 May 2014, 08:03

Steve, is the problem that rhsbsampe uses N/2 as the size for the PSUs, or is the problem that Helen has only 4 observations and is trying to bootstrap? Because I believe the problem is the latter, and the PSU size set by rhsbsample is not a problem in this case, but your message seems to indicate that the problem is with the command.

Alfonso Sanchez-Penalver
Comment
P Van Kerm

Join Date: Apr 2014

Posts: 10
#6

25 May 2014, 04:20

Good point. If Helen indeed has only one stratum (and 4 PSUs), then , yes, this is quite a peculiar setting. In my recollection of Saigo et al. (Survey Methodology 2001) which rhsbsample implements, there is no condition on the number of stratum for the variance estimates to be ok (condition is on first-stage sample size). That said, if there is just one stratum, with 6 replications, Helen would have a complete enumeration of possible 'repeated half-samples', so the percentile bootstrap CI is likely to be quite problematic.

Philippe
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

25 May 2014, 17:52

Phillipe: I'm not very familiar with the theory in this area, but the little reading I've done (Hall, 2014, p. 81) for non-survey data agrees with your assessment.

Alfonso, the problem is not the N/2, because the method will work fine with larger numbers of PSUs in one stratum or larger numbers of strata. In addition to the inaccurate percentile confidence intervals, the N = 4 PSUs is problematic also because it provides only three degrees of freedom. Korn and Graubard (1999, p. 193) discuss some work-arounds for such situations.

Saigo et al.(2001) proposed repeated half-sample bootstrap sampling specifically to handle randomly imputed data.
Helen, if you need it for that purpose, then I don't see a good choice for you.. If you don't have imputed data, then perhaps you can get by with an ordinary linearized (non-bootstrap) standard error and (wide) t-confidence intervals.

References:

Hall, Peter. 2014. Methodology and Theory for the Bootstrap, lecture notes
found at http://anson.ucdavis.edu/~peterh/sta...to-may-16.pdf.

Korn, Edward Lee, and Barry I Graubard. 1999. Analysis of health surveys. New York: Wiley.

Saigo, Hiroshi, Jun Shao, and Randy R Sitter. 2001. A repeated half-sample bootstrap and balanced repeated replications for randomly imputed data. Survey Methodology 27, no. 2: 189-196. available at: http://www.statcan.gc.ca/ads-annonce...x/6095-eng.pdf

Last edited by Steve Samuels; 25 May 2014, 18:20.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

26 May 2014, 08:56

I've read, the paper by Saigo et al. (2001) more carefully, If Helen has missing data and randomly re-imputes for each bootstrap replication as in the Saigo et al. article, she can use rhsbsample, though I think that any CIs will still be quite inaccurate, especially for nonlinear estimates. There will be > 6 distinct replicate values for each estimated parameter, located in six clusters, one for each distinct bootstrap sample. As Saigo et al. show (section 4, p. 192) Helen would need to use the average of the replication estimates (in e(b_bs)), not the reported usual estimates (e(b)). Saigo et al. show that their technique gives good results for n as small as 2 in each stratum. But in this case, the technique benefits from averaging over many strata (32 in their simulations).

Last edited by Steve Samuels; 26 May 2014, 09:04.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
joelmfrancis

Join Date: Mar 2014

Posts: 3
#9

27 May 2014, 10:36

Thanks all, would this work in a situation with 4 stratum and 2 primary sampling units in each stratum? Joel
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#10

27 May 2014, 16:29

I don't know. I suggest that you read section 5.2 of the Korn and Graubard reference. With L = 4 strata and \(n_h\) = 2 PSUs in each, the nominal design degrees of freedom is 4 for single parameter problems, but the effective degrees of freedom could be less.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

Boostrap percentile confidence interval for survey data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment