analyzing multiple subsamples

Hiam Saghir

Join Date: Dec 2015

Posts: 17
#1

analyzing multiple subsamples

28 Dec 2015, 08:49

I have a big dataset, and I wish to create say 1000 samples with replacement (multiple bsample (s)), give each sample an id and then analyze each of the samples separately (like in a loop) and save the result for each of these iterations. While I have figured the analysis part, I am still struggling with creating these multiple subsamples with replacement. I will really appreciate any advice on this.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#2

28 Dec 2015, 09:03

It sounds like you are trying to carry out some analysis and bootstrap it with 1000 reps. If that is correct, you don't have to do the sampling and looping yourself. the -bootstrap- command will do it for you. See the online help for the -bootstrap- command. It may reduce to something as simple as:

Code:

bootstrap, saving(myresults, replace) reps(1000): your analysis command here

If that isn't what you need, post back with more details.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

28 Dec 2015, 09:04

It sounds to me like bootstrap might be of assistance, see help bootstrap for more details.

Added in edit: Clyde is clearly one step ahead of me this morning.
Comment
Hiam Saghir

Join Date: Dec 2015

Posts: 17
#4

28 Dec 2015, 10:20

Thank you Clyde and William.
So I want to get the prevalence of HIV positive individuals by age (9 age groups) after adjusting for weight. I tried to use the svy bootstrap but I struggled with the exp_list. I could not figure out how to get the stored row proportions so I ended up doing the following:

svyset [pweight=weight]
svy: tab age HIV, row
estpost svy: tab age HIV, row
esttab, cell("b(f(9))") varlabels(`e(labels)') eqlabels(`e(eqlabels)')
matrix A= e(row)
matrix list A
matrix x=A[1, 11..19]
putexcel A1=matrix(x) using booknew, sheet("sheet1")
end

As you Clyde have mentioned I need to repeat this 1000 times using say samples of size 500 each
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#5

28 Dec 2015, 10:50

So, what you need to do is wrap your command in a short program, because -bootstrap- will not work with an explicit mention of svy:.

Code:

capture program drop myprogram program define myprogram svy:tab age HIV exit end bootstrap, saving(myresults, replace) reps(1000) size(500): myprogram use myresults, clear

The program myprogram stores the results of svy:tab in e(b), and bootstrap automatically picks that up without your having to specify anything. You may want to rename the variables that -bootstrap- gives to something more mnemonic. Note that in most applications of bootstrap, the sample size is the same as the size of the entire data set. If you want to override that to specifically use samples of 500, the above code does that. For the more conventional approach, just omit the -size(500)- option.
1 like
Comment
Hiam Saghir

Join Date: Dec 2015

Posts: 17
#6

28 Dec 2015, 11:36

Much much appreciated Clyde! But it seems that bootstrap even ignores svy within my program and calculates its own "percentages". So "svy: tab age HIV, row" and "svy: tab age HIV, col" and svy: tab age HIV" all return the same results which I guess here are column percent
.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

28 Dec 2015, 11:48

Removed after #8 posted, mine was indeed a naive solution.

Last edited by William Lisowski; 28 Dec 2015, 12:31.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#8

28 Dec 2015, 12:14

Sorry, I should have been clearer. -boostrap- picks up the entries in e(b), which are in all cases the cell proportions. The -row- and -col- options of svy: tab affect what is displayed on the screen, but not what is stored in e(b). So you need to calculate those row percentages yourself. This can be done in the myresults file, but it is probably easier to do it in the program and then pick up the returned values in the bootstrap command. I don't know how your HIV variable is coded. But let's assume it is 0 for no and 1 for yes. Then _b[p11] will be the cell proportion for age group 1, HIV neg, _b[p12] will be the cell proportion for age group 1 HIV positive. So you can do this:

Code:

capture program drop myprogram program define myprogram, rclass svy:tab age HIV forvalues i = 1/9 { return scalar prev`i' = _b[p`i'2]/(_b[p`i'2] + _b[p`i'1]) } exit end bootstrap r(prev1) r(prev2) r(prev3) r(prev4) r(prev5) /// r(prev6) r(prev7) r(prev8) r(prev9), /// saving(myresults, replace) reps(1000) size(500): myprogram use myresults, clear // AND IF YOU WOULD LIKE BETTER VARIABLE NAMES IN MY RESULTS rename _bs_* prev*
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

28 Dec 2015, 15:10

1. Your original svyset statement surprises me, because it omits any reference to strata or primary sampling units. Were there none? Please describe the population and the sampling design. What are the population and sample sizes? Were any post-stratification adjustments applied?

2. Clyde's program will give good results with non-survey, non-weighted, data. However with weighted survey data, it can yield biased estimates. The reason: the survey weights in each bootstrap replicate will not add to the population totals (and post-stratified weights would not add to control totals). Standard errors will also be biased for some designs (Kolenikov, 2010, p. 178).

The proper approach is to construct bootstrap replicates with specially constructed weights. To do this, use Stas Kolenikov's bsweights program (Kolenikov, 2010), which you can locate by typing.

Code:

findit bsweights

After creating the replicates, write a svyset statement appropriate for replicate weights; follow by svy: tabulate.

3. bsweights is not simple. If you are not very familiar with survey replication methods, I suggest that you don't bootstrap at all. Use a svyset statement that reflects the original design, then depend on Stata's default linearization method to estimate standard errors.

Knowing more about your design and study might lead me to modify these suggestions, so please give details.

Reference:

Kolenikov, Stanislav. 2010. Resampling variance estimation for complex survey data. Stata Journal 10, no. 2: 165-199.
http://www.stata-journal.com/article.html?article=st0187

Last edited by Steve Samuels; 28 Dec 2015, 15:41.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#10

28 Dec 2015, 15:21

Steve Samuels

Thanks for this post. I was not aware of this difficulty, but what you say makes perfect sense. I suppose this is (at least part of) the reason that -bootstrap- does not support use of the -svy:- prefix.
Comment
Hiam Saghir

Join Date: Dec 2015

Posts: 17
#11

29 Dec 2015, 01:48

Steve Samuels

Many thanks for your very informative post.
I am using Demographic and Health Survey data: http://www.dhsprogram.com/publicatio...nd-manuals.cfm
These are nationally representative household-based cross-sectional surveys, and you are right the svyset command is actually as follows: svyset psu [pw=weight], strata(sample_domain).
I can actually avoid the whole bootstrapping thing if I can figure out how to repeat the following script 1000 times and saving the result in one excel sheet

bsample 500
svyset psu [pw=weight], strata(sample_domain)
svy: tab age HIV, row
estpost svy: tab age HIV, row
esttab, cell("b(f(9))") varlabels(`e(labels)') eqlabels(`e(eqlabels)')
matrix A= e(row)
matrix list A
matrix x=A[1, 11..19]
putexcel A1=matrix(x) using booknew, sheet("sheet1")
end

However, I will also look into the replicate weight as using the svyset would be easier to execute.

Hiam
Comment
Hiam Saghir

Join Date: Dec 2015

Posts: 17
#12

29 Dec 2015, 04:56

Latest update:

I have been trying different things. I first tried to create replicates and I have used the following script:

bsweights wgtnew, n(3) reps(60) balanced dots replace
bs4rw, rweights(wgtnew*): tab age ob03a, row [pw=finalwgt]

The number of stratas is 6 and the number of PSUs is 12008
This did not take me far as I received a warning that the first-order balance was not achieved

I also tried the following:

forvalues i=1/1000 {
use test
bsample 1000
svyset psu [pw=ob05], strata(pv023)
svy: tab age ob03a, row
estpost svy: tab age ob03a, row
esttab, cell("b(f(9)) lb ub") varlabels(`e(labels)') eqlabels(`e(eqlabels)')
matrix A= e(row)
matrix list A
matrix x=A[1, 11..19]
putexcel A'i'=matrix(x) using booknew, sheet("sheet1")
}

Here STATA refused to reopen the dataset since "the data in memory will be lost"
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#13

29 Dec 2015, 06:58

Thanks for the additional information, but you didn't answer all my questions about your study. The number 12,008 looks more like the number of respondents in a DHS survey, rather than the number of PSUs, which are area units. See some good advice from Emma Slaymaker at: http://www.stata.com/statalist/archi.../msg00906.html and from Stas Kolewnikov at: http://www.stata.com/statalist/archi.../msg00877.html

As I said, bweights is not a simple command. Study again the description of the n() option.

The goal of your analysis is get good estimates of prevalence and good standard errors. The default linearization method will do both. Your bsample attempt, like Clyde's program, will do neither: it will not produce good prevalence estimates because the sampled data sets do not respect the weight structure of the original sample. The standard errors will be completely wrong (much too small), because you resample individuals, whereas the correct unit for resampling is the PSU.

Here is code that will give good estimates of prevalences and standard errors. Any resampling results that you generate will have to be close to these.

Code:

svyset psu [pw=ob05], strata(pv023) svy: tab age ob03a, row se

Last edited by Steve Samuels; 29 Dec 2015, 07:50.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Hiam Saghir

Join Date: Dec 2015

Posts: 17
#14

30 Dec 2015, 01:15

Many thanks for your reply. The number of psus is 1267 and the number of strata is 6
So now I have the following code:

svyset psu [pw=weight], strata(pv023)
bsweights wgtnew, n(-1) reps(100) dots replace
svyset [pw=weight], vce(bootstrap) bsrweight(wgtnew*)
svy:tab age ob03a, row

How do I get the row percentages generated from each individual iteration?
Comment
Hiam Saghir

Join Date: Dec 2015

Posts: 17
#15

30 Dec 2015, 01:25

I can also use the following but I need to input an expression list after bs4rw

bs4rw, rweights(wgtnew*) saving(myresults, replace): tab age ob03a [iw=ob05], row
Comment

Announcement

analyzing multiple subsamples

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment