stratified sampling using a count variable

Ataullah Khan

Join Date: Jun 2017

Posts: 41
#1

stratified sampling using a count variable

16 Jan 2020, 14:10

Hi,

I would like to produce a stratified sample of 250 clusters based on a variable that store the information of how much clusters would come to the main sample from each strata.
One solution is to keep each strata and draw the sample using the sample formula. Such as

Code:

preserve keep if id == 1 & gender == "boys" sample 4,count tempfile sample1 save `sample1' restore keep if id == 1 & gender == "girls" sample 5,count tempfile sample2 save `sample2'

and so on for other 32 unique ids

However, doing this process over and over again is time consuming. Is there a way to code this sampling based on a count variable for each strata.
I am looking for something like this

Code:

sample countvariable,count by(id gender)

Thank you.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#2

16 Jan 2020, 16:20

Well, if it's always 4 boys and 5 girls for each id, you can do a simple loop:

Code:

preserve forvalues i = 1/32 { restore, preserve keep if id == `i' & gender == "boys" sample 4,count tempfile sample1 save `sample`=2*`i'-1'' restore, preserve keep if id == `i' & gender == "girls" sample 5,count tempfile sample2 save `sample`=2*`i''' }

There is also a way to do a little dance with macros to further shorten the code so that it automatically alternates samples of 4 boys and 5 girls, but the code becomes pretty opaque when you do that, so I think the transparency of this code is worth a few extra lines of typing.

Note: if the number of boys and girls to sample changes from one id to another, this approach will not work as is. The complexity of modifying it would depend on the extent to which the number of each sex to sample is a simple function of the id.
1 like
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#3

16 Jan 2020, 20:34

Just curious: What is the input of your countSample?

Code:

*gen countSample = 4 * (gender == "boys") + 5 * (gender == "girls") set seed 16012020 gen x = runiform() bys id gender (x): drop if _n > countSample drop x

Last edited by Romalpa Akzo; 16 Jan 2020, 20:39.
1 like
Comment

Ataullah Khan

Join Date: Jun 2017
Posts: 41

16 Jan 2020, 22:44

here is my the number of sample (count variable) in each id

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(id boys girls grandtotal)
 1  6  4 10
 2  5  3  8
 3  3  1  4
 4  7  2  9
 5  8  5 13
 6  7  4 11
 7 10  3 13
 8 11  6 17
 9  .  1  1
10  1  .  1
11  .  1  1
12  1  .  1
13  3  1  4
14  4  3  7
15  5  1  6
16  6  3  9
17  1  .  1
18  5  2  7
19  1  1  2
20  1  .  1
21  5  3  8
22  9  3 12
23 14  9 23
24  6  5 11
25 13 10 23
26  6  1  7
27 10  5 15
28 14  6 20
29  1  1  2
30  1  1  2
31  .  1  1
32  1  .  1
end

So, it changes with each id.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30063

16 Jan 2020, 22:58

Romalpa Akzo had a much nicer solution than the one I proposed. So I will just modify hers to allow for the variation in sample sizes desired across id's. You provided an example data set for the sample sizes, but nothing for the data you are drawing your samples from. So my code starts by creating a demonstration data set that has enough boys and girls in each unit to assure that the sample sizes can be (more than) met. Just organize your actual data in the same layout as what I create here and then use it instead of my demonstration data.

Code:

//  CREATE A DEMONSTRATION DATA SET
clear
set obs 32
gen int id = _n
expand 50
by id, sort: gen byte sex = (_n > 25)
label define sex    0   "boy"   1   "girl"
label values sex sex
tempfile to_be_sampled
save `to_be_sampled'

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(id boys girls grandtotal)
 1  6  4 10
 2  5  3  8
 3  3  1  4
 4  7  2  9
 5  8  5 13
 6  7  4 11
 7 10  3 13
 8 11  6 17
 9  .  1  1
10  1  .  1
11  .  1  1
12  1  .  1
13  3  1  4
14  4  3  7
15  5  1  6
16  6  3  9
17  1  .  1
18  5  2  7
19  1  1  2
20  1  .  1
21  5  3  8
22  9  3 12
23 14  9 23
24  6  5 11
25 13 10 23
26  6  1  7
27 10  5 15
28 14  6 20
29  1  1  2
30  1  1  2
31  .  1  1
32  1  .  1
end
mvencode boys girls, mv(0)
tempfile wanted_sample_sizes
save `wanted_sample_sizes'

set seed 1234
use `wanted_sample_sizes', clear
rename boys n0
rename girls n1
drop grandtotal
reshape long n, i(id) j(sex)

merge 1:m id sex using `to_be_sampled', assert(match) nogenerate

gen double shuffle = runiform()
by id sex (shuffle), sort: drop if _n > n

Announcement