Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • stratified sampling using a count variable

    Hi,

    I would like to produce a stratified sample of 250 clusters based on a variable that store the information of how much clusters would come to the main sample from each strata.
    One solution is to keep each strata and draw the sample using the sample formula. Such as

    Code:
    preserve
    keep if id == 1 & gender == "boys"
    sample 4,count
    tempfile sample1
    save `sample1'
    restore
    
    keep if id == 1 & gender == "girls"
    sample 5,count
    tempfile sample2
    save `sample2'
    and so on for other 32 unique ids

    However, doing this process over and over again is time consuming. Is there a way to code this sampling based on a count variable for each strata.
    I am looking for something like this

    Code:
    sample countvariable,count by(id gender)
    Thank you.

  • #2
    Well, if it's always 4 boys and 5 girls for each id, you can do a simple loop:

    Code:
    preserve
    
    forvalues i = 1/32 {
        restore, preserve
        keep if id == `i' & gender == "boys"
        sample 4,count
        tempfile sample1
        save `sample`=2*`i'-1''
        restore, preserve
        keep if id == `i' & gender == "girls"
        sample 5,count
        tempfile sample2
        save `sample`=2*`i'''
    }
    There is also a way to do a little dance with macros to further shorten the code so that it automatically alternates samples of 4 boys and 5 girls, but the code becomes pretty opaque when you do that, so I think the transparency of this code is worth a few extra lines of typing.

    Note: if the number of boys and girls to sample changes from one id to another, this approach will not work as is. The complexity of modifying it would depend on the extent to which the number of each sex to sample is a simple function of the id.

    Comment


    • #3
      Just curious: What is the input of your countSample?
      Code:
      *gen countSample = 4 * (gender == "boys") + 5 * (gender == "girls")
      
      set seed 16012020
      gen x = runiform()
      
      bys id gender (x): drop if _n > countSample
      drop x
      Last edited by Romalpa Akzo; 16 Jan 2020, 20:39.

      Comment


      • #4
        here is my the number of sample (count variable) in each id

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte(id boys girls grandtotal)
         1  6  4 10
         2  5  3  8
         3  3  1  4
         4  7  2  9
         5  8  5 13
         6  7  4 11
         7 10  3 13
         8 11  6 17
         9  .  1  1
        10  1  .  1
        11  .  1  1
        12  1  .  1
        13  3  1  4
        14  4  3  7
        15  5  1  6
        16  6  3  9
        17  1  .  1
        18  5  2  7
        19  1  1  2
        20  1  .  1
        21  5  3  8
        22  9  3 12
        23 14  9 23
        24  6  5 11
        25 13 10 23
        26  6  1  7
        27 10  5 15
        28 14  6 20
        29  1  1  2
        30  1  1  2
        31  .  1  1
        32  1  .  1
        end
        So, it changes with each id.

        Comment


        • #5
          Romalpa Akzo had a much nicer solution than the one I proposed. So I will just modify hers to allow for the variation in sample sizes desired across id's. You provided an example data set for the sample sizes, but nothing for the data you are drawing your samples from. So my code starts by creating a demonstration data set that has enough boys and girls in each unit to assure that the sample sizes can be (more than) met. Just organize your actual data in the same layout as what I create here and then use it instead of my demonstration data.

          Code:
          //  CREATE A DEMONSTRATION DATA SET
          clear
          set obs 32
          gen int id = _n
          expand 50
          by id, sort: gen byte sex = (_n > 25)
          label define sex    0   "boy"   1   "girl"
          label values sex sex
          tempfile to_be_sampled
          save `to_be_sampled'
          
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input byte(id boys girls grandtotal)
           1  6  4 10
           2  5  3  8
           3  3  1  4
           4  7  2  9
           5  8  5 13
           6  7  4 11
           7 10  3 13
           8 11  6 17
           9  .  1  1
          10  1  .  1
          11  .  1  1
          12  1  .  1
          13  3  1  4
          14  4  3  7
          15  5  1  6
          16  6  3  9
          17  1  .  1
          18  5  2  7
          19  1  1  2
          20  1  .  1
          21  5  3  8
          22  9  3 12
          23 14  9 23
          24  6  5 11
          25 13 10 23
          26  6  1  7
          27 10  5 15
          28 14  6 20
          29  1  1  2
          30  1  1  2
          31  .  1  1
          32  1  .  1
          end
          mvencode boys girls, mv(0)
          tempfile wanted_sample_sizes
          save `wanted_sample_sizes'
          
          set seed 1234
          use `wanted_sample_sizes', clear
          rename boys n0
          rename girls n1
          drop grandtotal
          reshape long n, i(id) j(sex)
          
          merge 1:m id sex using `to_be_sampled', assert(match) nogenerate
          
          gen double shuffle = runiform()
          by id sex (shuffle), sort: drop if _n > n

          Comment

          Working...
          X