Setting my data set as a svy dataset

Sunga Kalemba

Join Date: Jun 2014

Posts: 22
#1

Setting my data set as a svy dataset

03 Jul 2018, 20:56

I have a confidentialised census sample which is 1% of the population from the Australian Bureau of Statistics the data methodology is given here

HTML Code:

http://www.abs.gov.au/ausstats/[email protected]/Latestproducts/2037.0.30.001Main%20Features202011?opendocument&tabname=Summary&prodno=2037.0.30.001&issue=2011&num=&view=

The data is in 3 levels i,e Individual, family and dwelling with these respective IDs (ABSPID, ABSFID and ABSHID). It covers the whole country but areas are divided into states (STATE) and hh id is ABSFID ; respondent id is ABSPID; dwelling id is ABSHID, BUT the design weight is not given. a sample of some key variables including sex and age are as follows:-

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str14 ABSHID byte(ABSPID ABSFID Sex) float agegroup long STATE "CSF11B00000001" 0 1 2 6 1 "CSF11B00000001" 0 1 1 3 1 "CSF11B00000002" 0 1 2 4 1 "CSF11B00000002" 0 1 1 3 1 "CSF11B00000003" 0 1 1 1 1 "CSF11B00000003" 0 1 2 5 1 "CSF11B00000003" 0 1 1 1 1 "CSF11B00000004" 0 1 1 5 1 "CSF11B00000005" 0 1 2 9 1 "CSF11B00000006" 0 1 1 4 1 end label values Sex SEXP label def SEXP 1 "1. Male", modify label def SEXP 2 "2. Female", modify label values agegroup agegrouplbl label def agegrouplbl 1 "under 16", modify label def agegrouplbl 3 "20-29", modify label def agegrouplbl 4 "30-39", modify label def agegrouplbl 5 "40-49", modify label def agegrouplbl 6 "50-59", modify label def agegrouplbl 9 "85+", modify label values STATE STATE label def STATE 1 "NSW", modify

My trouble comes on how to calculate and apply the weight and eventually survey set my data. Given that we can calculate weight as (sample size/population size), i am wondering if doing that wont simply give me a single number for the weights for each level (individual, family and household). The methodology file accompanying the data says the ideal PSU is dwelling, but i want to use the individual as unit of analysis, it has gone into details on what to do to avoid specifications which i completely understand

Would you please advise on how i would go about survey setting this dataset using the calculated weights and individual as unit if analysis?

Thanks and best regards.

Sunganani Kalemba
PhD Student.
Queensland
Tags: categorical, survey data
Sunga Kalemba

Join Date: Jun 2014

Posts: 22
#2

03 Jul 2018, 22:11

To add clarity to my question.

Kindly note that this database is slightly different from survey data as it is a sample of the existing records in the Census data rather than a sampling applied during data collection. This is why I am at pains to decide whether or not to treat it as survey data and thus having trouble setting the appropriate markers for survey data.

Sunganani Kalemba
PhD Student.
Queensland
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

04 Jul 2018, 09:46

Interesting questions. As the data are a sample of the census files, then yes, you must svyset them.

1. 1% basic Census Sample File (CSF): private dwellings Each person in a selected dwelling is studied. designwt = 100

2. 5% expanded CSF of private dwellings: designwt = 20

In 1 & 2 all HH and people in a dwelling get the dwelling design weight.

3. 1% sample of people in non-private dwellings: designwt = 100

Private & non-private dwellings are given by the DWIP ("Dwelling Indicator for Persons") or DWTD variable (not sure which is in your data). DWIP (DWTD) =1 for private dwellings =2 for non-private Dwellings

I would guess that the Census lists were ordered geographically and that a systematic sample was drawn. However to protect confidentiality, in the analysis files, dwellings were randomly ordered within regions.

From the document you link to:
" Standard error calculation

Both CSFs can be treated, for the purposes of standard error calculations, as a simple random sample of dwellings from the private dwelling population. For many purposes the non-private dwelling population has only a minor influence on results, and it is sufficient to include each person counted in a non-private dwelling as a separate 'dwelling' when calculating standard errors."

For geographic region , choose the variable you think is appropriate: AREANUM (enumeration area) or, possibly, REGUCP (region of usual residence).

Code:

gen stratum = AREANUM // or REGUCP? gen psu = ABSHID if DWIP==1 // or DWDT == 1 replace psu = ABSPID if DWIP==2 // or DWDT = =2 svyset psu [pw = designwt], strata(stratum)

Your base file is the CSF11BP person file. To create the final analysis file, merge contextual variables like family size from the dwelling and family files. Note that if a respondent was a visitor, then such variables should be set to missing.

Last edited by Steve Samuels; 04 Jul 2018, 09:48.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Sunga Kalemba

Join Date: Jun 2014

Posts: 22
#4

04 Jul 2018, 12:37

Steven

I can’t thank you enough for your quick and insight response. I have learnt a lot here already.

You are right. I had merged all data files (family, dwelling and individual) but having no idea what to do with visitors I had them erroneously recoded as “other”. I have made the correction.

Am I right to assume that I can do the same for those responses not relevant to my analysis such as “No adequately described”, “not applicable” etc?

Best regards

Sunganani Kalemba
PhD Student.
Queensland
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

05 Jul 2018, 15:03

Yes, those could all be missing. In fact, for the visitors, the HH/Family questions are "not applicable". If you want to keep track of different kinds of missings, consider using Stata's extended missing values. These can be useful when there are several reasons for missing data and you want to be able to know the reason.

Besides "." you can have missing:
.a
.b
.c

For r example you can use ".a" for not applicable and give it a value label

label define notapp .a = "Not Applicable"

The sort order for these missings is:
. .a .b .c

As an aside: when values can be missing, you want to be careful with statements like:

Code:

keep if x >10

This will keep all observations when x is missing, because in Stata, missing values are larger than any number. SAS has the same kind of problem, because SAS missing values are less than any number. An easy fix is:

Code:

keep if x>10 & x<.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Sunga Kalemba

Join Date: Jun 2014

Posts: 22
#6

05 Jul 2018, 19:01

Steve,
Many thanks for pointing it out.
I noted that when i compared the results before and after cleaning.

best regards

Sunganani Kalemba
PhD Student.
Queensland
Comment

Announcement

Setting my data set as a svy dataset

Comment

Comment

Comment

Comment

Comment