Stratified sampling

Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#1

Stratified sampling

21 Apr 2023, 09:36

Hi everyone,
I'm hoping to obtain a sample of 2400 records from a population of 106176 records that are spread across 485 schools. I am looking to sample 30 candidates from each school (80 schools). However, I have been performing manual sampling by selecting one school at a time, drawing a sample of 30 candidates, and then saving the selection. Then, I proceed to the next school, sample 30 candidates, and save the selection which has been quite time-consuming. Does anyone know of a quicker approach to this type of sampling?

Last edited by Simwinga Simwinga; 21 Apr 2023, 09:40.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#2

21 Apr 2023, 09:49

Code:

// CREATE DEMONSTRATION DATA SET clear* set obs 80 gen school_id = _n gen school_size = rpoisson(100) expand school_size by school_id, sort: gen record_num = _n drop school_size // SAMPLE 30 RECORDS FROM EACH SCHOOL set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle = runiform() by school_id (shuffle), sort: keep if _n <= 30 drop shuffle sort school_id record_num

In the future, when asking for help with code, please use the -dataex- command and show example data. Although sometimes, as here, it is possible to give an answer that has a reasonable probability of being correct, this is usually not the case. Moreover, such answers are necessarily based on experience-based guesses or intuitions about the nature of your data. When those guesses are wrong, both you and the person trying to help you have wasted their time as you end up with useless code. To avoid this, a -dataex- based example provides all of the information needed to develop and test a solution.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Simwinga Simwinga

Join Date: Apr 2022
Posts: 36

21 Apr 2023, 19:24

Thank you for your response, Clyde. To provide more context, I have a population of 485 schools with different numbers of students. I aim to obtain a sample of exactly 2,400 students, where each of the 80 selected schools contributes exactly 30 students. Therefore, I need to limit my sample to 80 schools only. The code you provided correctly samples 30 students from each school, but it does not ensure that the sample size is exactly 2,400 students or that only 80 schools are selected for the sample.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long StudentID int School_Code byte Paper1
1510320097 1220  5
1510510007 5516 24
1510540059 1288  4
1510570051 1009  2
1510720006 1170 10
1510770085 1264 22
1510770096 1201 10
1510800114 1642  6
1510820033 1010  4
1510870008 1011  1
1510930042 1061  1
1511010040 1262 21
1511070028 2219  9
1511150005 1259  2
1511290011 1267 18
1511300013 2211  3
1511690009 1026 19
1512020083 7006 10
1512130017 1225  2
1512250008 1011  2
1512380022 1254  6
1512410033 1195 10
1512480093 9392  4
1512540072 1016  0
1512560026 1259  0
1512750026 1259  1
1512780085 1593  4
1512850049 1225 10
1512870057 1592  7
1512900021 1269 18
1513050115 9013  2
1513080068 1201  1
1513200037 1225  3
1513200043 1221  7
1513220038 1011  2
1513230064 1012  2
1513260021 1046  7
1513350007 5259 14
1513390044 1231  6
1513430011 1231 12
1513500038 1219  0
1513630107 1201  1
1513700002 7452  1
1513750017 1259 11
1513750019 1009 24
1513750044 5323  6
1513940002 1170 16
1513990034 1254  1
1514060005 9137 19
1514080013 1061  1
end

Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4423

21 Apr 2023, 21:20

Run the following do-file, and follow along with what it does after the "Begin here" comment. (The code above that comment is just to create a dataset that mimics the pertinent features of yours.)

Code:

version 17.0

clear *

// seedem
set seed 539611586

// "a population of 106176 records that are spread across 485 schools" School Code
quietly set obs 485
generate int sid = 1000 + _n

generate int cou = runiformint(30, floor(2 * 106176 / 485 - 30))
sort cou
summarize cou, meanonly
if r(sum) < 0 quietly replace cou = cou + r(sum) in l
else quietly replace cou = cou + r(sum) in 1

// Students, Student ID
quietly expand cou
drop cou

generate long pid = 1e6 + _n

// Paper 1
generate byte sco = runiformint(0, 24)

*
* Begin here
*
// "limit my sample to 80 schools"
frame put sid, into(Schools)
frame Schools {
    contract sid
    quietly keep in 1/80
}
quietly frlink m:1 sid, frame(Schools)
quietly keep if !missing(Schools)

// "each of the 80 selected schools contributes exactly 30 students"
generate double randu = runiform()
isid sid randu, sort
quietly by sid: keep if inrange(_n, 1, 30)

// "ensure that the sample size is exactly 2,400 students "
count
assert r(N) == 2400

exit

Notes:
1. For brevity I use shorter variable names than what you do in your example, but the code you need to execute is otherwise the same.

2. Be sure to set the seed to something so that you can reproduce your sample selection. Set it only once, at the top of your do-file, as I have done in the example above.

3. Be sure to use double precision in your random number in order to help avoid duplicates. (On the unlikely chance that you do have a duplicate, then generate two randu variables and sort on them both—see example below.)

4. Be sure to verify that you don't have duplicates with the isid School_Code randu, sort command. Again, if it fails (unlikely, but not impossible), then

Code:

generate double randu2 = runiform()
isid School_Code randu randu2, sort
by School_Code: keep if inrange(_n, 1, 30)

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4423
#5

21 Apr 2023, 21:37

It's inconsequential in the example, but a couple of lines of code among those that created the dataset for illustration should have been more like the following.

Code:

generate int cou = runiformint(30, floor(2 * 106176 / 485 - 30)) sort cou sid // for reproducibility summarize cou, meanonly if r(sum) < 106176 quietly replace cou = cou + (106176 - r(sum)) in 1 else quietly replace cou = cou + (106176 - r(sum)) in l

Sorry about that.
Comment

Joseph Coveney

Join Date: Apr 2014
Posts: 4423

21 Apr 2023, 23:00

Argh, in my haste, I neglected to sample the schools randomly, which does have consequences.

One last time:

Code:

version 17.0

clear *

// seedem
set seed 539611586

// "a population of 106176 records that are spread across 485 schools" School Code
quietly set obs 485
generate int sid = 1000 + _n

generate int cou = runiformint(30, floor(2 * 106176 / 485 - 30))
sort cou sid
summarize cou, meanonly
if r(sum) < 106176 quietly replace cou = cou + (106176 - r(sum)) in 1
else quietly replace cou = cou + (106176 - r(sum)) in l

// Students, Student ID
quietly expand cou
drop cou

generate long pid = 1e6 + _n

// Paper 1
generate byte sco = runiformint(0, 24)

*
* Begin here
*
// "limit my sample to 80 schools"
frame put sid, into(Schools)
frame Schools {
    contract sid
    generate double randu = runiform()
    isid randu, sort
    quietly keep in 1/80
}
quietly frlink m:1 sid, frame(Schools)
quietly keep if !missing(Schools)

// "each of the 80 selected schools contributes exactly 30 students"
generate double randu = runiform()
isid sid randu, sort
quietly by sid: keep if inrange(_n, 1, 30)

// "ensure that the sample size is exactly 2,400 students "
count
assert r(N) == 2400

exit

Comment

Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#7

22 Apr 2023, 20:23

Thank so much everyone, I have managed to sample as desired.
Comment

Announcement