Sampling

Matthew Hosseini

Join Date: Jan 2016

Posts: 8
#1

Sampling

17 Mar 2016, 16:05

Hi Forum

I have searched the forum for a possible answer to the following question but to no avail.

If I have a dataset with some rare events, often the case when dealing with mortgage defaults, say 2MLN observations and only 1K of rare event, how best can I take random samples of size 5K, ensuring the rare events weight about 25% in each sample?

Matthew
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

17 Mar 2016, 16:10

I don't understand. You want 5K samples, with 25% of them being the rare events: so that's 1,250 rare events in the sample. But you say you only have 1K of these events in the entire data set. How is that supposed to work?

Additional question: assuming we get the sample sizes issue resolved, do you want sampling with replacement or without?
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

17 Mar 2016, 16:23

This seems to me an ideal situation for a "case-control" approach. Take all the HH with the rare event and a sample of others. If your study question is to study the influence of some predictors on the event, then it would be worthwhile to match non-events to events.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
1 like
Comment
Matthew Hosseini

Join Date: Jan 2016

Posts: 8
#4

17 Mar 2016, 16:57

Clyde Schechter my question is specific to oversampling - know how to do this in sas but not in stata as I'm new!
Steve Samuels will dig in an investigate further. thx
Comment
Matthew Hosseini

Join Date: Jan 2016

Posts: 8
#5

19 Mar 2016, 08:26

For anyone who is interested in a resolution to this oversampling problem, see below for details.

Objective: based on a population of 1MLN with only 1% (or even less) of rare events, you require a sample comprising of 25% rare event and 75% of non-rare event.

1. take a 25% sample of rare events (here referred to as DEF=1) only,

*use your original dataset
*create a dataset with DEF=1 ONLY
drop if DEF==0
*sample 750 counts of DEF=1 cases
sort DEF
by DEF: count
set seed 12345
by DEF: sample 750, count
save "/Users/mhosseini/Documents/Sample_DEF1.dta", replace

2. take a 75% sample of non rare events only,

*use your original dataset again
*create a dataset with DEF=0 ONLY
drop if DEF==1
*sample 2250 counts of DEF=0 cases
sort DEF
by DEF: count
set seed 12345
by DEF: sample 2250, count
save "/Users/mhosseini/Documents/ Sample_DEF0.dta", replace

*now append the two cases
append using Sample_DEF1.dta

3. Check using tab DEF
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

19 Mar 2016, 09:30

Note that with randomtag (from SSC) you can avoid all the file gymnastics. In addition, randomtag is much faster than sample. To install, type in Stata's Command window

Code:

ssc install randomtag

randomtag creates an indicator variable that tags observations that are selected. To perform your sampling, all you would need is

Code:

use "main.dta", clear
randomtag if DEF, count(750) gen(rare)
randomtag if !DEF, count(2250) gen(common)
keep if rare | common
save "mysample.dta", replace

randomtag is guaranteed to pick the same observations as sample would. Here's a replay of the example in #5 to confirm that both techniques choose exactly the same observations

Code:

clear
set seed 412354
set obs 1000000
gen x = runiform()
gen DEF = x < .01
gen id = _n
tab DEF
save "main.dta", replace

timer clear
timer on 1

* 1. take a 25% sample of rare events (here referred to as DEF=1) only, 
use "main.dta", clear
drop if DEF==0
*sample 750 counts of DEF=1 cases
sort DEF
by DEF: count
set seed 12345
by DEF: sample 750, count
save "Sample_DEF1.dta", replace

* 2. take a 75% sample of non rare events only, 
use "main.dta", clear
drop if DEF==1
*sample 2250 counts of DEF=0 cases
sort DEF
by DEF: count
set seed 12345
by DEF: sample 2250, count
save "Sample_DEF0.dta", replace

*now append the two cases 
append using Sample_DEF1.dta
sort id

* 3. Check using tab DEF

tab DEF
save "Sample_DEF.dta", replace

timer off 1

timer on 2
use "main.dta", clear
set seed 12345
randomtag if DEF, count(750) gen(rare)
set seed 12345
randomtag if !DEF, count(2250) gen(common)
keep if rare | common
save "Sample_3000.dta", replace
timer off 2


timer list

cf x DEF id using  "Sample_DEF.dta", all

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

19 Mar 2016, 10:34

I would note that Matthew has "moved the goalposts." (Nothing wrong with that in this context; if something you planned to do is impossible, change the plan!) The sample he creates has a total size of 3,000, not the 5,000 he requested in #1.
Comment
Matthew Hosseini

Join Date: Jan 2016

Posts: 8
#8

27 Mar 2016, 16:04

Just saw responses to my earlier posting.
Clyde Schechter : the percentage remains intact and is the point here; whether its 3k, 5k, or whatever. If you're missing the point, you're bound to miss the goal post regardless!
@Robert Picard: thanks for the tip.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment