Help - sub-sampling a random representative subset from a large data set (survey data)

Cristian Popa

Join Date: Sep 2017

Posts: 21
#1

Help - sub-sampling a random representative subset from a large data set (survey data)

10 Sep 2019, 10:56

Hello,

I have a data set consisting of 1500 respondents from a selected area. However, due to the topic restrictions, the respondents were selected through non-probability, following a quota sampling - county level (5 counties), urban - rural and 3 age categories. However, this sample is still biased and I want if possible to correct it. Instead of post-stratification, I'd like to solve for probability too (yeah, I know...).

Question: How can I draw from my data set a random subset (of, say 8-900 cases) using available census population data (population by county, urban - rural, pop by age category) for stratification?

Thank you!
Cristian
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

10 Sep 2019, 11:24

The following presumes you have a variable called "stratum," coded from 1 to Whatever, and that you record the desired N for each stratum in a list of locals.

Code:

// desired number for each stratum local N1 = 15 local N2 = 19 ... ... local NWhatever = forval i = 1/Whatever { preserve sample `N`i'' if stratum == `i', count tempfile sample`i' save `sample`i'' restore } clear forval i = 1/Whatever { append using `sample`i'' }

I can appreciate the ease of using a sample like this, but wouldn't you get the equivalent results but with more precision by keeping all of the original data set and weighting? I guess the increase in precision from keeping the original would be on the order of sqrt(1500/900), but it still might be attractive.

Last edited by Mike Lacy; 10 Sep 2019, 12:07.
1 like
Comment
Cristian Popa

Join Date: Sep 2017

Posts: 21
#3

10 Sep 2019, 12:29

Thank you, Mike for the suggestions. I agree that post-stratification / weighting is a good way to deal with my data but I wanted to try this other approach.
Regarding your code, it would need to have computed by hand the (desired) numbers for each stratum. That's OK, but this works only for one level with N strata, however I need to control for three different levels (population in 5 counties, population rural-urban and 3 age categories). Or I didn't get this...
Cheers!
K.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

10 Sep 2019, 13:47

I was thinking that your population data was available for each combination of county/rural-urban/age group. So, if there might be 5 counties * 2 rural-urban * 3 age categories, giving you 30 strata, and I'm presuming you would know the population fraction in each one. The sample Ns would just be each population fraction * 900, giving you a sample that reproduces the population fractions. If this isn't on target, go ahead and post again and explain.
Comment

Announcement

Help - sub-sampling a random representative subset from a large data set (survey data)

Comment

Comment

Comment