Setting up survey design with svyset

John Guitar

Join Date: Feb 2016

Posts: 2
#1

Setting up survey design with svyset

13 Feb 2016, 06:14

I have several doubts regarding a project I have to do with data from IPUMS General Household Survey in Nigeria. I have two samples for years 2006 and 2007.

I need to analyse the socio-economic factors which influence whether a household owns, has access to a computer or none. Therefore, I will use multinomial regression.

Following is the sample description for 2006 sample (2007 slightly differs in the total number of included households and EAs): the sample followed a two-stage, replicated and rotable design in which enumeration areas (EAs) demarcated for the 1991 Population Census served as the primary sampling units and housing units (HUs) as the secondary sampling units. Sixty EAs per state and 30 EAs in the Federal Capital Territory, Abuja were randomly selected. In each EA, 10 households were selected randomly from a list of all households in the EA. In total, 21,900 housing units from 2,190 enumeration areas were included in the sample. The selected EAs were distributed across urban and rural areas.

The sample is weighted, meaning each record in the sample represents certain number of households from the population. If I am right, these are post-stratification weights.

However, my doubts are with setting up the design of the survey with svyset:

1. Should/can I use the weights in the sample as pweights?
2. Should I use the urban/rural variable as strata identifier?
3. Should I pool the data or perform the analysis on each year separatly?

Thank you in advance. Any help will be appreciated.
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

14 Feb 2016, 08:16

I was not familiar with this survey, so I spent some time looking it up. Ordinarily, the svyset statement for a single year would be something like of one of the following:

Code:

svyset EA [pw = ], strata(strata) //or svyset EA [pw =], strata(state)

but neither of these is correc here.The clue is that there is no Enumeration Area variable in the dataset.

According to these pages
https://international.ipums.org/inte...g2006a_tag.xml
https://international.ipums.org/inte...roup?id=h-tech
and
https://international.ipums.org/inte...iption_section

the survey should be analyzed as 100 or 120 independent replicates identified by the variable SUBSAMP. (You'll have to check the actual number.)

"SUBSAMP allocates each case to one of 100 subsample replicates, randomly numbered from 0 to 99. Each subsample is nationally representative and preserves any stratification of the sample from which it is drawn. Users who need a representative subset of a sample can use SUBSAMP to select their cases. For example, to randomly extract 10% of the cases from a sample, select any 10 of the 100 subsamples."

Strata:
The variable for state GEO1_NG (https://international.ipums.org/inte...iption_section,)
iis not needed for svyset, because the replicates are already stratified. There is also STRATA variable in the data, but I don't see how it can be used in svyset for the same reason. Urban/Rural is not a stratifying variable and there is no mention of it as such. It can, however, be an important classification variable for the analysis.

Weights:
There are two weights, household (HHWT) and personal (PERWT). Which you use would depend on which you are analyzing. You would use the HH weight if, for example, you created HH summaries such as average income.

You can analyze the combined data, with YEAR as a stratifying variable Totals will be wrong (added over two years) , but means, proportions, will be okay. You can also analyze the differences between years.

Therefore one of:

Code:

svyset SUBSAMP [pw = PERWT], strata(YEAR) //person-analyses svyset SUBSAMP [pw = HHWT ], strata(YEAR) // HH Analyses

For single year analyses, one of

Code:

keep if year ==2006 keep if year ==2007

Caveat: To use YEAR as a stratum variable, I've assumed that the replicates were reconstructed randomly each year. If that's not so then it should simply be an analysis variable.

Aside: I find it very inconvenient to have to switch between lowercase Stata syntax and upper case variable names. Accordingly, I recomment as a first edit step:

Code:

rename *, lower save new

Last edited by Steve Samuels; 14 Feb 2016, 08:47.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Ivan Oreskovic

Join Date: Feb 2016

Posts: 24
#3

14 Feb 2016, 08:43

Steve Samules, thank you for your reply. I actually created a new profile with full name because I have read the post can be ignored if full name is not provided. This is actually me, the original author of this post (John Guitar = Ivan Oreskovic). Nice to meet you. Later, I saw that I should have contacted the administrators in order to change the name. I apologize to the forum members for redundancy.

I have read carefully your guidelines. However, I have some questions which are primarily results of not being statistician by profession. I came accross the SUBSUMP variable but didnt't know what it means actually and how to use it. Can you explain in simple terms or provide any link with explanation what are subsamples replicates and their purpose/effects?

I don't want just to copy-paste the provided lines but actually be able to understand the rationale behind it.

Ivan
Comment
Ivan Oreskovic

Join Date: Feb 2016

Posts: 24
#4

16 Feb 2016, 01:10

@Steve Samuels

In addition to my previous reply, I read about replicated or interpenetrating sampling and I'm confused with the SUBSAMP being declared as PSU. Could you, please, explain the logic behind it.

If I understood correctly, you are suggesting that single year analysis should be performed without svyset? If this is the case, however, when running multinomial logit, I should still use HHWT or PERWT, right?

Thank you so much.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

17 Feb 2016, 08:24

Thanks for reregistering with your real name, Ivan.

Below I give a summary of the random groups method. To answer your two questions: 1) Use svyset whether you analyze one year or two; 2) The weight you use depends on whether your analysis unit is the household or the individual. For the household, you would create a dataset of households alone; for the individual, the data set consists of all observations.

The Method of Random Groups

The method of random groups was originated by Mahalanobis; the groups are designated by many names: "interpenetrating samples", "ultimate clusters", "replicated subsamplies, "random groups".. The method is discussed discussed in many sampling texts (see the references).

The replicates are, in fact, artificial primary sampling units created for the purpose o calculating standard errors. With modern survey-capable packages like Stata, there's no need to do the simplified calculation. Therefore, you can svyset with the random groups as PSUs and apply Stata's svy commands. (Heeringa et al., 2010, p. 107).

Below is a description of the original idea. For a comprehensive exposition, consult Chapter 2 of Wolter, 2007. I use the words "group" and "replicate" to refer to as a "subsample replicate."

Suppose the goal is to estimate a parameter \(\theta\). The method creates k subsamples, each of which is a random sample of the original population. In sample \(\alpha\), the estimate of \(\theta\) is \(\widehat{\theta}_\alpha\). The estimate of \(\theta\) is the sample mean:

\[
\widehat{\overline{\theta}} = \sum_{\alpha=1}^k \widehat{\theta}_\alpha/k
\]

The advantage of this is a greatly simplified estimate of variance, even for complicated designs:
\[
v(\widehat{\overline{\theta}}) = \sum_{\alpha=1}^k (\widehat{\overline{\theta}}-\widehat{\theta}_\alpha)/k(k-1)
\]

This is just the elementary estimate of a variance applied to the \(\widehat{\theta}_\alpha\). There is another advantage to the random groups approach: a look at the \(\widehat{\theta}_\alpha\).gives the a better idea of variability than display of the confidence interval alone.

How random groups are created

1. Draw the original sample according to the complex multistage design. Randomly group the primary sampling units (enumeration units) into larger standard error calculation units (SECUs) (Heeringa, 2010, pp. 101-107.) This is the method used in the Nigerian Household Survey.

2. Draw the random groups as independent systematic samples.This is the method popularized by Deming (1960); it can be used for quite complicated esigns. (I've used it myself). It's the only way to get an unbiased estimate of variance with systematic sampling.

These approaches are very flexible. For example, certainty units are handled by placing copies of their data in each replicate.

Analysis choices when the data consist of random groups

1. Use statsby: estimate parameters for each group; save to a dataset; use summarize to get means and standard errors. This approach works well if the main goal is estimate descriptive statistics.. For regression modeling, statsby is less attractive, because post-estimation commands like margins and predict are not available.

2. svyset with replicates as PSUs. As the individual replicates are already stratified, there is no need to specify strata. This is what I recommend ed above.

References:

Deming WE, 1960, Sample Design in Business Research, New York: Wiley

Hansen, MH, WN Hurwitz, and W Madow. 1953. Sample Survey Methods and Theory. Volume I Methods and Applications. New York: Wiley.

Heeringa, Steven, Brady T. West, and Patricia A. Berglund. 2010. Applied survey data analysis. Boca Raton, FL: Chapman & Hall/CRC.

Kish, Leslie, 1965 Survey Sampling, New York: Wiley

Wolter, Kirk M. 2007. Introduction to variance estimation. New York: Springer.

Last edited by Steve Samuels; 17 Feb 2016, 09:03.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Ivan Oreskovic

Join Date: Feb 2016

Posts: 24
#6

19 Feb 2016, 01:59

Steve Samuels

Your suggestions and the literature have been very helpful. Thank you very much for your time and effort.

I want to check with you one more thing about analysis design to be sure I understood well. As stated the outcome variable is at household level - whether household owns or has access to a computer, or none.

I'm considering two approaches:

1 - Analysis at household level - the model will include household and person's individual variables. This "person" will be the head of the household. The number of observations will be equal to the number of households in the sample.

2 - Analysis at respondent level - the model will include the same household and individual variables as in 1) but this time of each person living in the household. The number of observation will be equal to the total number of respondents in the sample.

Please correct me if I went wrong somewhere.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#7

19 Feb 2016, 22:28

From what you say, HH is the proper unit for analysis.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Ivan Oreskovic

Join Date: Feb 2016

Posts: 24
#8

23 Feb 2016, 14:55

Steve Samuels

Does this mean that HHWTs should be added together for each individual household in the sample? E.g. for a two-member household, there are two records in the dataset, each having a HHWT 4547. If I keep only the record of household head, should I then add these two weights together so that new HHWT = 9094?
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

23 Feb 2016, 15:57

No- it means that the data set should have one observation per household and that you should ignore the individual weights. At this point, your question is totally unrelated to your original topic. So if you want to ask further questions, start another topic.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
hamzah hasyim

Join Date: Jan 2016

Posts: 17
#10

08 Mar 2016, 07:39

Dear

I am a new member. pleased to meet you. I have already "a dataset of 50 variables and approximately a million observations", of basic health research. Providing set basic data that was collected through a national survey with having "52 variables". The survey had collected information from 258,366 households sampled and 987,205 household members sampled for measuring many public health indicators. The sampling design of survey was using "two-stage sampling" The result of this design required special treatment in which process using conventional statistic by complex samples or svyset in Stata to make it possible to utilize two-stage sampling designs, in processing and analyzing the dataset, the validity of analysis result can be optimized. My project describes the determinant prevalence of one communicable disease (binary logistic 0 and 1) amongst respondents who have livestock breeding animal’s in the rural endemic area using both command svy: proportion and xi:svy: logistic. we used cross-sectional design.

In case, the dataset that I have also already consist of variable "PSU, Weight, Inflate and Strata" According to the information above and sampling design. I would like to ask you some questions.

The first question.
I ran svyset and then I am bit confused which ones the command and what does each meaning and different the result of command svyset if I want to prefer to choose one of them like below

1st command svyset
svyset psu [pweight=inflate], strata(strata), vce(linearized), singleunit(missing)
pweight: inflate
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>

or….

2nd command svyset
svyset [pweight = inflate],strata(strata)
pweight: inflate
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: <observations>
FPC 1: <zero>

Or
3rd command svyset
svyset psu [pweight= inflate],psu(psu)
pweight: inflate
VCE: linearized
Single unit: missing
Strata 1: <one>
SU 1: psu
FPC 1: <zero>

or
4th command svyset
svyset [pw=weight], strata(strata) psu(psu)
pweight: weight
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>

The second question.
I ran svyset and then find opposite finding in the manual of survey states that the sampling of the survey design was using two-stage sampling, meanwhile using from svydescribe we only get only the data in one stages. what does the meaning?

. svydescribe

Survey: Describing stage 1 sampling units

pweight: inflate
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: <observations>
FPC 1: <zero>

Could you please give me advance advice.
Comment

Announcement

Setting up survey design with svyset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment