Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sampling

    Hi Forum

    I have searched the forum for a possible answer to the following question but to no avail.

    If I have a dataset with some rare events, often the case when dealing with mortgage defaults, say 2MLN observations and only 1K of rare event, how best can I take random samples of size 5K, ensuring the rare events weight about 25% in each sample?


    Matthew

  • #2
    I don't understand. You want 5K samples, with 25% of them being the rare events: so that's 1,250 rare events in the sample. But you say you only have 1K of these events in the entire data set. How is that supposed to work?

    Additional question: assuming we get the sample sizes issue resolved, do you want sampling with replacement or without?

    Comment


    • #3
      This seems to me an ideal situation for a "case-control" approach. Take all the HH with the rare event and a sample of others. If your study question is to study the influence of some predictors on the event, then it would be worthwhile to match non-events to events.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Clyde Schechter my question is specific to oversampling - know how to do this in sas but not in stata as I'm new!
        Steve Samuels will dig in an investigate further. thx

        Comment


        • #5
          For anyone who is interested in a resolution to this oversampling problem, see below for details.

          Objective: based on a population of 1MLN with only 1% (or even less) of rare events, you require a sample comprising of 25% rare event and 75% of non-rare event.

          1. take a 25% sample of rare events (here referred to as DEF=1) only,

          *use your original dataset
          *create a dataset with DEF=1 ONLY
          drop if DEF==0
          *sample 750 counts of DEF=1 cases
          sort DEF
          by DEF: count
          set seed 12345
          by DEF: sample 750, count
          save "/Users/mhosseini/Documents/Sample_DEF1.dta", replace

          2. take a 75% sample of non rare events only,

          ​*use your original dataset again
          *create a dataset with DEF=0 ONLY
          drop if DEF==1
          *sample 2250 counts of DEF=0 cases
          sort DEF
          by DEF: count
          set seed 12345
          by DEF: sample 2250, count
          save "/Users/mhosseini/Documents/ Sample_DEF0.dta", replace

          *now append the two cases
          append using Sample_DEF1.dta

          3. Check using tab DEF

          Comment


          • #6
            Note that with randomtag (from SSC) you can avoid all the file gymnastics. In addition, randomtag is much faster than sample. To install, type in Stata's Command window

            Code:
            ssc install randomtag
            randomtag creates an indicator variable that tags observations that are selected. To perform your sampling, all you would need is

            Code:
            use "main.dta", clear
            randomtag if DEF, count(750) gen(rare)
            randomtag if !DEF, count(2250) gen(common)
            keep if rare | common
            save "mysample.dta", replace
            randomtag is guaranteed to pick the same observations as sample would. Here's a replay of the example in #5 to confirm that both techniques choose exactly the same observations

            Code:
            clear
            set seed 412354
            set obs 1000000
            gen x = runiform()
            gen DEF = x < .01
            gen id = _n
            tab DEF
            save "main.dta", replace
            
            timer clear
            timer on 1
            
            * 1. take a 25% sample of rare events (here referred to as DEF=1) only, 
            use "main.dta", clear
            drop if DEF==0
            *sample 750 counts of DEF=1 cases
            sort DEF
            by DEF: count
            set seed 12345
            by DEF: sample 750, count
            save "Sample_DEF1.dta", replace
            
            * 2. take a 75% sample of non rare events only, 
            use "main.dta", clear
            drop if DEF==1
            *sample 2250 counts of DEF=0 cases
            sort DEF
            by DEF: count
            set seed 12345
            by DEF: sample 2250, count
            save "Sample_DEF0.dta", replace
            
            *now append the two cases 
            append using Sample_DEF1.dta
            sort id
            
            * 3. Check using tab DEF
            
            tab DEF
            save "Sample_DEF.dta", replace
            
            timer off 1
            
            timer on 2
            use "main.dta", clear
            set seed 12345
            randomtag if DEF, count(750) gen(rare)
            set seed 12345
            randomtag if !DEF, count(2250) gen(common)
            keep if rare | common
            save "Sample_3000.dta", replace
            timer off 2
            
            
            timer list
            
            cf x DEF id using  "Sample_DEF.dta", all

            Comment


            • #7
              I would note that Matthew has "moved the goalposts." (Nothing wrong with that in this context; if something you planned to do is impossible, change the plan!) The sample he creates has a total size of 3,000, not the 5,000 he requested in #1.

              Comment


              • #8
                Just saw responses to my earlier posting.
                Clyde Schechter : the percentage remains intact and is the point here; whether its 3k, 5k, or whatever. If you're missing the point, you're bound to miss the goal post regardless!
                ​@Robert Picard: thanks for the tip.

                Comment

                Working...
                X