Avoid Collapsing for adjusted prevalence calculation AND add bootstrap for CIs (all of my colleagues stumped!)

Beck Willis

Join Date: Aug 2014

Posts: 3
#1

Avoid Collapsing for adjusted prevalence calculation AND add bootstrap for CIs (all of my colleagues stumped!)

06 Aug 2014, 09:50

Hello everyone- this is my first post in the Stata forums, although I'm a regular at Stack Overflow. My question is how I can perform bootstrapping on an adjusted prevalence calculation performed on the Cluster-level which is then added to the other adjusted prevalence calculations for the other Clusters in an Evaluation Unit (EU) and then the EU prevalence is the average of the adjusted cluster prevalences. I am able to successfully perform the prevalence calculation with various collapse commands, but this loses the resident-level data that I would need to perform a bootstrap on the calculations in order to get CIs. Here is the dofile I use to do the prevalence calculations (FYI: The age adjustment is done using manually-entered weights... I don't think there is anyway I can avoid doing that bit of manual work):

insheet using "C:\Users\rwillis\Google Drive\Data Analysis\Stata_Practice_201407\YEMEN_CLEAN_lesstha n00081.csv", comma
drop if examined!=1
gen tf_any=.
replace tf_any=1 if left_eye_tf=="1" | right_eye_tf=="1"
gen tt_any=.
replace tt_any=1 if left_eye_tt=="1" | right_eye_tt=="1"
gen str5 eu_s = string(eu, "%05.0f")
gen str3 cluster_s = string(cluster, "%03.0f")
gen str2 age_s = string(age, "%02.0f")
gen group= eu_s + cluster_s + age_s
gen res_dum=.
replace res_dum=1 if !missing(instance_id_res)
gen tf_dum=.
replace tf_dum=1 if tf_any==1
gen tt_dum=.
replace tt_dum=1 if tt_any==1
save "C:\Users\rwillis\Google Drive\Data Analysis\Stata_Practice_201407\Yemen_uncollapsed_2 0140802.dta",replace
collapse (sum) num_1_9=res_dum tf_1_9=tf_dum if age<10 & age>=1, by(group)
gen tf_prev= tf_1_9/num_1_9
gen age_s=substr(group, 9, 2)
destring age_s, generate(age)
drop age_s
gen age_weight=.
replace age_weight=0.130443886097152 if age==1
replace age_weight=0.125628140703518 if age==2
replace age_weight=0.118058239917536 if age==3
replace age_weight=0.113870635227419 if age==4
replace age_weight=0.110005153975003 if age==5
replace age_weight=0.106123566550702 if age==6
replace age_weight=0.102692951939183 if age==7
replace age_weight=0.098779152171112 if age==8
replace age_weight=0.0943982734183739 if age==9
gen tf_adj_prev= tf_prev* age_weight
save "C:\Users\rwillis\Google Drive\Data Analysis\Stata_Practice_201407\yemen_groupcollapse d_tfprev_20140802.dta"
gen eucluster=substr(group,1,8)
collapse (sum) cluster_tf_prev=tf_adj_prev, by(eucluster)
gen eu=substr(eucluster,1,5)
collapse eu_tf_prev=cluster_tf_prev, by(eu)
save "C:\Users\rwillis\Google Drive\Data Analysis\Stata_Practice_201407\Yemen_EU_Adjprev_20 140802.dta"

So, to summarize:

1. I need code to perform the calculations without collapsing
2. Next, I need help in choosing the correct bootstrap command and putting it in the correct place.

I have asked several colleagues for assistance in this and everyone is either too busy to allow time to wrap their brains around it or just stumped. Please consider this your challenge for the day and help me! :-) OH- and I'm using Stata 10.1.
Tags: bootstrap, cluster, collapse, weights

Nick Cox

Join Date: Mar 2014
Posts: 33586

06 Aug 2014, 11:27

Please use your full real name on Statalist. This is explained in the FAQ Advice.

Your question is some distance from my usual territory. If no one replies, please give serious thought to rewriting it to use public data and avoiding irrelevant distractions, i.e. stuff intrinsic to your data but not central to your question. Although not tested, the version below is a bit shorter given some cosmetic suggestions. You are creating dummy (indicator) variables as 1 or missing, by the way.

Code:

 
insheet using "C:\Users\rwillis\Google Drive\Data Analysis\Stata_Practice_201407\YEMEN_CLEAN_lesstha n00081.csv", comma
drop if examined!=1

gen tf_any = cond(left_eye_tf=="1" | right_eye_tf=="1", 1, .) 
gen tt_any = cond(left_eye_tt=="1" | right_eye_tt=="1", 1, .) 

gen str5 eu_s = string(eu, "%05.0f")
gen str3 cluster_s = string(cluster, "%03.0f")
gen str2 age_s = string(age, "%02.0f")
gen group= eu_s + cluster_s + age_s

gen res_dum= cond(!missing(instance_id_res), 1, .) 
gen tf_dum = tf_any 
gen tt_dum = tt_any

save "C:\Users\rwillis\Google Drive\Data Analysis\Stata_Practice_201407\Yemen_uncollapsed_2 0140802.dta",replace

collapse (sum) num_1_9=res_dum tf_1_9=tf_dum if age<10 & age>=1, by(group)
gen tf_prev= tf_1_9/num_1_9
gen age_s=substr(group, 9, 2)
destring age_s, generate(age)
drop age_s

gen age_weight=age_weight=0.130443886097152 if age==1
replace age_weight=0.125628140703518 if age==2
replace age_weight=0.118058239917536 if age==3
replace age_weight=0.113870635227419 if age==4
replace age_weight=0.110005153975003 if age==5
replace age_weight=0.106123566550702 if age==6
replace age_weight=0.102692951939183 if age==7
replace age_weight=0.098779152171112 if age==8
replace age_weight=0.0943982734183739 if age==9

gen tf_adj_prev= tf_prev* age_weight
save "C:\Users\rwillis\Google Drive\Data Analysis\Stata_Practice_201407\yemen_groupcollapse d_tfprev_20140802.dta"

gen eucluster=substr(group,1,8)
collapse (sum) cluster_tf_prev=tf_adj_prev, by(eucluster)
gen eu=substr(eucluster,1,5)
collapse eu_tf_prev=cluster_tf_prev, by(eu)
save "C:\Users\rwillis\Google Drive\Data Analysis\Stata_Practice_201407\Yemen_EU_Adjprev_20 140802.dta"

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1785
#3

09 Aug 2014, 21:45

I don't think this is a coding problem, but rather a conceptual problem. But to know what to advise, I need details of the design. There must have been sampling/selection of some sort, so describe. What are the "clusters"; how many are there in the EU and in the study sample? What is the range of their sizes? Please accompany your response with your full real name, as Nick requested. (Request a change to your user-name with the "contact us" button at the bottom right) This is long-standing Statalist etiquette., and I consider the practice so important that without a full real name, I'll not respond further.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Beck Willis

Join Date: Aug 2014

Posts: 3
#4

15 Aug 2014, 12:27

Evaluation units generally correspond to “health district” size: 100,000-250,000 residents. Each evaluation unit is mapped using a cluster random sample survey of 20+ clusters which aims to examine 1019 children aged 1-9 years, plus adults resident in the same households, with sampling specifics guided by local geopolitical divisions and population structure. Since first posting, I've discussed with a PhD in my division and based on what she and I discussed, I think what I am supposed to do is bootstrap after I collapse into clusters and before I collapse by EU. So instead of collapsing with ~30 cluster per EU, it will be ~30 * # replications... and then I have to find a 95% confidence interval for each EU prevalence, which is calculated in the last collapse command.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1785
#5

15 Aug 2014, 14:37

Thank you for changing your name, Beck!

This study design can be described as a multi-stage random sample. The sampling units at each stage are: cluster, household (if sampled within cluster), and individual (if sampled). With due respect to your colleague, the analysis requires survey sampling expertise.

In a survey analysis, inference is based on the design, primarily on variation between primary sampling units-your clusters. A bootstrap with individuals as the analysis units will badly under-estimate standard errors. Moreover, the mean of means estimate ignores the sizes of the clusters and households, so is likely to be biased as well.

The analysis is straightforward.. You will describe the sampling design to Stata by a svyset statement. This will describe the EUs as "strata" ; the clusters as primary sampling units or PSUs; and will include a sampling weight, which you will need to calculate. The svyset statement will also describe the age-group standardization: you will identify the age group variable with a poststrata() option, and the variable containing the weight for each group with a postweight() option. t will also include a variable that describes the original number of clusters in each EU (not the sampled numbers), call it nclus.

A likely minimal svyset statement will look something like:

Code:

svyset cluster [pw = sampweight], strata(eu) poststrata(age) postweight(age_weight) fpc(nclus)

(It might also pay to describe information on the household sampling stage, if any.)

You will then estimate prevalences with Stata's descriptive survey commands, svy: tabulate or svy: proporton.

If you are unfamiliar with survey techniques, the best book that I can recommend is that by Groves et al. (2004) (Groves, R, F Fowler Jr, M Couper, J Lepkowski, E Singer, and R Tourangeau. 2004. Survey methodology. Wiley series in survey methodology. Hoboken, NJ: Wiley.)

A couple of issues to clarify:

1. Was selection of clusters done by "simple random sampling" within each EU? If not, describe.
2. Of all the clusters in the evaluation unit, how many were selected?
3.. Was there further sampling of households within cluster? If so, was this by simple random sampling, or by systematic sampling? Or what?
4. Were all children 1-9 in the selected households examined? If not, was the probability of selection determined by age (e.g. over-sampling young children).
5.. Were all adults residents in the selected households examined? Was there any sampling at that level?
6. Do you have information on non-response or non-participation at any level?

Calculation of probability weights: The sampling analysis will require calculation of the probability of selection for each studied individual. The inverse of this will be the "sampling" or "probability" weight. How this is calculated will depend on the answers to the questions above.

Steve

Steve Samuels
Consulting Statistician
18 Cantine's Island Lane
Saugerties NY 12477 USA
Phone: 845-246-0774
Fax: 206-202-4783
[email protected]

Last edited by Steve Samuels; 15 Aug 2014, 14:39.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Beck Willis

Join Date: Aug 2014

Posts: 3
#6

18 Aug 2014, 08:07

This is very helpful, thank you! Let me reach out to a couple of colleagues more involved with the cluster and HH selection and get back to you. I don't want to give you incorrect information.
Comment

Announcement