svyset for cross-sectional data

Ciara Hogan

Join Date: Jul 2017

Posts: 4
#1

svyset for cross-sectional data

27 Jul 2017, 01:52

Hi,

My first post here. I'm looking at 5 cross-sectional SMART nutrition surveys, two from 2015 (AN and AW counties in November) and three from 2015 (AN and AW in November, and AN in April also). I have 42-44 clusters in each survey. The clusters chosen are different in each year.

I think I need to use the svyset command to analyse my data correctly.

I think it should be as follows:

svyset cluster [pw=weight], strata (strata)

My cluster is the cluster variable in the data set, the PW I think is the total number of villages divided by the number of villages chosen for clusters (i.e. 305/44) but I'm not sure what the strata should be?
Each of my surveys is in a separate dataset, could I put them all in one dataset and recode the clusters so they're all unique numbers and then use year as my strata? Can I do this if the clusters are different each year?

Thanks in advance!
Ciara
Tags: cluster, cross-sectional study, svyset
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

02 Aug 2017, 11:22

Welcome to Statalist, Clare!

I know little about SMART surveys, but I did a quick internet search, Your description looks incomplete to me and your PW calculation is not correct.

* SMART surveys often have multistage designs, with sampling of households in selected villages

* In some surveys, villages are selected with probability proportional to size (PPS), important if the villages differ greatly in size

*Calculating the probability weight ("design weight")>

The design weight for an analysis unit is the 1/f where f is the probability of selecting that unit. If the unit analysis is the household, for example, the selection probability for HH j in village i would be:

f_ij = f_i x f_j|i

Where f_i is the probability selecting village i and f_j|i is the probability of selecting HH j, one village i is selected.

The sampling weight for HH ij would be
w_ij = 1/f_ij

Some questions:

1 What are the goals of your analysis? To estimate descriptive statistics? To test hypotheses? To fit models?

2. Will your denominators be villages? households? individuals? all of the above?

3. County AN was studied in April and November of "2015" (2016?). Were the same households studied on each occasion?

4. Did you have household or individual non-response? How much?

5. Looking ahead: The design weight is often adjusted so that the weighted sample better matche nf the population. Have you census information about the populations of the two counties?

Such post-survey adjustments will be needed if villages were selected with simple random sampling and if, in the population, villages differ greatly in size.

Given the uncertainties, I am unwilling to offer any concrete advice at this point. I suggest that you repost with a more detailed description of the sampling design.

Some references:

http://essedunet.nsd.uib.no/cms/topics/weight/2/6.html

http://www.restore.ac.uk/PEAS/theoryweighting.php

Lohr, S. L. (2009). Sampling: Design and Analysis (2nd ed.). Boston, MA: Cengage Brooks/Cole.

Groves, R. M., Fowler, F. J., Couper, M. P., Lepkowski, J. M., Singer, E., & Tourangeau, R. (2009). Survey methodology, Second Edition (2nd ed.). Hoboken, N.J.: Wiley. (Chapter 4 and section 10.5)

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Ciara Hogan

Join Date: Jul 2017

Posts: 4
#3

07 Aug 2017, 10:50

Dear Steve,

I'm so sorry I'm only replying now, I was expecting an email to alert me to a response on the post but don't recall seeing one! Thank you very much for your thorough response and links to references. Since I've posted I've done some learning myself, what I ended up doing was using svyset to arrange my data into clusters (villages) and groups (by year and season) as I amalgamated my datasets. I decided I didn't need to use any weighting because my sample sizes were very similar (100 people at most in the difference in a group of about 600-700) between groups. My villages were all selected using PPS.
1. The goal of this analysis is to look at the relationship between antrhopometric variables (GAM prevalence by WHZ) and explanatiry variables such as illness, vaccination status, breastfeeding and minimum dietary diversity. So I've been using stepwise logistic regression to come up with a final model.
2. My demominator is individuals.
3. '2015' should be 2016, yes, and different villages were chosen randomly each year from an exhaustive list of villages and then households were chosen either by spinning a pen or SRS.
4. I'm not able to determine the non response rate as I only have data on those who participated and no record of whether there were any people who declined to participate.
5. I have census information on the population of each county, AW- 236,402 AN -183,186 are the projected figures for 2017.

Do you think I've used the right approach in the end? Have I provided enough information this time?

Thank you very much!
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

09 Aug 2017, 16:18

Dear Ciara:

Equal numbers in your five county-year groups is not sufficient evidence of equal weights. In fact, taking the 2017 figures as deonominators and the range of sample sizes you quote, we find that the sampling fractions differ.

f1=600/236402 => 0.002538
f2 =600/183186 => 0.003275
f3 =700/236402 => 0.002961
f4 =700/183186 => 0.003821

f4/f1 => 1.505586
f3/f1 => 1.166667
f2/f1 => 1.29050

As these were for extremes, the actual fractionsshould be closer.

To compute the actual probabilities of selection for HH, one needs to multiply first state(village) and second stage (HH) probabilities

You haven't said how individuals were selected. I assume that information was gathered on all individuals in a HH. If so, they would inherit a household's selection probability and weight, It's unfortunate that in some villages, HH were selected by spin-the-needle With this design, HH selection probabilities & weights cannot be directly computed. And, even with improvements in the original design, selections is biased in favor of HH in lower density areas.(Grais et al., 2007)

In villages in which SRS of HH was done, the probabilities of HH selection are equal by design. You do need to know the village HH total and the number of HH approached.

For spin-the-needle villages, you would need to *assume* that probabilities are equal. That is likely to be untrue in practice.

Stepwise?

Stepwise is an unfortunate choice. Many studies have shown that models it produces do not hold up in new data. p-values are biased downward and you will be unable to quote an honest p-value in your conclusions.

Another option for variable selection in Stata is Gareth Ambler's contributed command -plogit- with the "lasso" option. Unfortunately, -plogit- will ignore the village clustering and assume SRS of individuals. Still it might be a good start-note that I haven't used it. Get -plogit- at http://www.homepages.ucl.ac.uk/~ucakgam/stata.html.

Perhaps a simpler honest assessment of stepwise results is to split the sample into two groups, with three strata in one group and two in the other.; Use stepwise or other exploratory methods on each group; then validate the final model from that group using svy: logit on the other group.

Other solutions are the bootstrap and cross-validation of the exploratory analysis. ( http://ellisp.github.io/blog/2016/06...-cv-strategies)

Note that you can account for the survey design in stepwise:
Let county-year group be the variable stratum

Code:

stepwise , pr(.2): logit y x1 x2 x3 i.stratum [pw = ], vce(cluster village)

For the final validation analyses, svyset the data and use svy: logistic

Reference:
Grais, R. F., Rose, A. M., & Guthmann, J. P. (2007). Don't spin the pen: two alternative methods for second-stage sampling in urban cluster surveys. Emerging themes in epidemiology, 4(1), 8.
https://ete-online.biomedcentral.com.../1742-7622-4-8
.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Ciara Hogan

Join Date: Jul 2017

Posts: 4
#5

11 Aug 2017, 05:36

Hi Steve,

I've just realised I had email notifications turned off! So again, a delay with getting back to you, sorry!

When I cleaned the data, I'm left with unequal numbers in each group in any case, so I've used this final number to determine the weighting, is this fair to do or should I do weighting before cleaning?
I've got the following when I account for the particular county and year:

F1: 489 weighting:1
F2: 512 weighting F2/F1: 1.0468
F3: 501 weighting F3/F1: 0.7940
F4: 632 weighting F4/F1: 1.2926
F5: 641 weighting F5/F1: 1.016

In fact, in each of these surveys, clusters were all randomly selected from exhaustive list of villages and households were fortunately all then selected by SRS.

In 2016 for instance, I have 44 clusters and 14 households chosen in each cluster, if villages were larger than 100 households the village was segmented using natural boundaries and the estimated population of each segment recorded, then a segment was chosen on PPS basis and the households chosen by SRS.

'To compute the actual probabilities of selection for HH, one needs to multiply first state(village) and second stage (HH) probabilities'- I'm not sure what you might mean by this. But you're right, information was gathered on all individuals in a HH.

'You do need to know the village HH total and the number of HH approached'...might you instead mean I do NOT ned to know this?
If I do indeed need to know this, in some years it's available, in others it's not for village HH total, but I always know the number of households chosen per village.

I think instead of stepwise I've actually used forward modelling, I might have used the wrong terminology saying stepwise. What I did with the entire dataset was look at each variable against GAM in logistic regression individually, anything with a p-value >0.1 I broought forwrad to second stage and kept in my final model...is this the right thing to do?

I did take your advice to split my dataset, I split it by county and found different associations, ie. in West county 'other liquid' is protective of GAM where as in the North county not having been sick and having consumed orange fruit and veg is protective. In the initial amalgamated dataset I found not having been sick, juice consumption and other milk consumption were all protective against GAM.

But now how do I ' validate the final model from that group using svy: logit on the other group', I now contain my data in two separate excel spreadsheets, should I be using a combined version instead to do this?

Thank you for your continued help!
Ciara
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

25 Aug 2017, 13:23

I missed your post, Ciara-sorry I'm away for a couple of days, but I'll respond when I return.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

27 Aug 2017, 11:17

I'm not sure how "cluster" differs from "village". I assume cluster = village. The segmentation you describe, followed by SRS or systematic sampling (but not "spin the pen") should produce identical probabilities of selection for all HH in a given village, county and occasion. For lack of better information, you'll have to assume equal probabilities.

From your figures, it appears that sampling fractions were similar in F1, F2, and F5, possibly different in F3 & F4.

Unfortunately, the procedure of selecting for a final model only variables significant in a bivariate model is also biased. However: your split analysis showed some variables which were predictive in both counties, a finding that strengthens conclusion that those variables are important.

To fit the coefficients to a another data set, follow Clyde Schecter's suggestion in Post #6 here

Suppose your split data sets are data1 and data2

Code:

use data1 logistic outcome v1 v2 v3 v4 matrix bd1 = e(b) // coefficients from data 1 matrix colnames bd1 = v1 v2 v3 v4 use data2, clear matrix score xb21 = bd1 label var xb21 "data2 predicted xb from data1 model" )

Then to validate the data set 2 model, switch "1" and "2" in the above. The virtue of this approach (instead of copying over from Excel) is that you can easily change models, e.g. add interactions, drop variables.

Notes:
If the variables you selected had been shown in prior studies to affect your outcome, then you could started with them a-priori. Note that if your goal is prediction, then smaller models are often better than big ones.

You don't appear to have considered interactions. In my experience, a strong linear predictors often interact with others.

Your splitting to validate models is a version of what is called "cross-validation". There is a Stata command cvauroc that fits a logistic regression model and gets cross-validated estimates of ROC curves, sensitivity, and specificity.

Last edited by Steve Samuels; 27 Aug 2017, 11:24.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Ciara Hogan

Join Date: Jul 2017

Posts: 4
#8

28 Aug 2017, 03:44

Hi Steve,

Thanks a million for the further advice, plenty more for me to think about, will get stuck in to STATA again in the next few days, thanks again!

Ciara
Comment

Announcement

svyset for cross-sectional data

Comment

Comment

Comment

Comment

Comment

Comment

Comment