Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stratified sampling

    Hi everyone,
    I'm hoping to obtain a sample of 2400 records from a population of 106176 records that are spread across 485 schools. I am looking to sample 30 candidates from each school (80 schools). However, I have been performing manual sampling by selecting one school at a time, drawing a sample of 30 candidates, and then saving the selection. Then, I proceed to the next school, sample 30 candidates, and save the selection which has been quite time-consuming. Does anyone know of a quicker approach to this type of sampling?
    Last edited by Simwinga Simwinga; 21 Apr 2023, 09:40.

  • #2
    Code:
    //  CREATE DEMONSTRATION DATA SET
    clear*
    set obs 80
    gen school_id = _n
    gen school_size = rpoisson(100)
    expand school_size
    by school_id, sort: gen record_num = _n
    drop school_size
    
    //  SAMPLE 30 RECORDS FROM EACH SCHOOL
    set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
    gen double shuffle = runiform()
    by school_id (shuffle), sort: keep if _n <= 30
    drop shuffle
    sort school_id record_num
    In the future, when asking for help with code, please use the -dataex- command and show example data. Although sometimes, as here, it is possible to give an answer that has a reasonable probability of being correct, this is usually not the case. Moreover, such answers are necessarily based on experience-based guesses or intuitions about the nature of your data. When those guesses are wrong, both you and the person trying to help you have wasted their time as you end up with useless code. To avoid this, a -dataex- based example provides all of the information needed to develop and test a solution.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thank you for your response, Clyde. To provide more context, I have a population of 485 schools with different numbers of students. I aim to obtain a sample of exactly 2,400 students, where each of the 80 selected schools contributes exactly 30 students. Therefore, I need to limit my sample to 80 schools only. The code you provided correctly samples 30 students from each school, but it does not ensure that the sample size is exactly 2,400 students or that only 80 schools are selected for the sample.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input long StudentID int School_Code byte Paper1
      1510320097 1220  5
      1510510007 5516 24
      1510540059 1288  4
      1510570051 1009  2
      1510720006 1170 10
      1510770085 1264 22
      1510770096 1201 10
      1510800114 1642  6
      1510820033 1010  4
      1510870008 1011  1
      1510930042 1061  1
      1511010040 1262 21
      1511070028 2219  9
      1511150005 1259  2
      1511290011 1267 18
      1511300013 2211  3
      1511690009 1026 19
      1512020083 7006 10
      1512130017 1225  2
      1512250008 1011  2
      1512380022 1254  6
      1512410033 1195 10
      1512480093 9392  4
      1512540072 1016  0
      1512560026 1259  0
      1512750026 1259  1
      1512780085 1593  4
      1512850049 1225 10
      1512870057 1592  7
      1512900021 1269 18
      1513050115 9013  2
      1513080068 1201  1
      1513200037 1225  3
      1513200043 1221  7
      1513220038 1011  2
      1513230064 1012  2
      1513260021 1046  7
      1513350007 5259 14
      1513390044 1231  6
      1513430011 1231 12
      1513500038 1219  0
      1513630107 1201  1
      1513700002 7452  1
      1513750017 1259 11
      1513750019 1009 24
      1513750044 5323  6
      1513940002 1170 16
      1513990034 1254  1
      1514060005 9137 19
      1514080013 1061  1
      end

      Comment


      • #4
        Run the following do-file, and follow along with what it does after the "Begin here" comment. (The code above that comment is just to create a dataset that mimics the pertinent features of yours.)
        Code:
        version 17.0
        
        clear *
        
        // seedem
        set seed 539611586
        
        // "a population of 106176 records that are spread across 485 schools" School Code
        quietly set obs 485
        generate int sid = 1000 + _n
        
        generate int cou = runiformint(30, floor(2 * 106176 / 485 - 30))
        sort cou
        summarize cou, meanonly
        if r(sum) < 0 quietly replace cou = cou + r(sum) in l
        else quietly replace cou = cou + r(sum) in 1
        
        // Students, Student ID
        quietly expand cou
        drop cou
        
        generate long pid = 1e6 + _n
        
        // Paper 1
        generate byte sco = runiformint(0, 24)
        
        *
        * Begin here
        *
        // "limit my sample to 80 schools"
        frame put sid, into(Schools)
        frame Schools {
            contract sid
            quietly keep in 1/80
        }
        quietly frlink m:1 sid, frame(Schools)
        quietly keep if !missing(Schools)
        
        // "each of the 80 selected schools contributes exactly 30 students"
        generate double randu = runiform()
        isid sid randu, sort
        quietly by sid: keep if inrange(_n, 1, 30)
        
        // "ensure that the sample size is exactly 2,400 students "
        count
        assert r(N) == 2400
        
        exit
        Notes:
        1. For brevity I use shorter variable names than what you do in your example, but the code you need to execute is otherwise the same.

        2. Be sure to set the seed to something so that you can reproduce your sample selection. Set it only once, at the top of your do-file, as I have done in the example above.

        3. Be sure to use double precision in your random number in order to help avoid duplicates. (On the unlikely chance that you do have a duplicate, then generate two randu variables and sort on them both—see example below.)

        4. Be sure to verify that you don't have duplicates with the isid School_Code randu, sort command. Again, if it fails (unlikely, but not impossible), then
        Code:
        generate double randu2 = runiform()
        isid School_Code randu randu2, sort
        by School_Code: keep if inrange(_n, 1, 30)

        Comment


        • #5
          It's inconsequential in the example, but a couple of lines of code among those that created the dataset for illustration should have been more like the following.
          Code:
          generate int cou = runiformint(30, floor(2 * 106176 / 485 - 30))
          sort cou sid // for reproducibility
          summarize cou, meanonly
          if r(sum) < 106176 quietly replace cou = cou + (106176 - r(sum)) in 1
          else quietly replace cou = cou + (106176 - r(sum)) in l
          Sorry about that.

          Comment


          • #6
            Argh, in my haste, I neglected to sample the schools randomly, which does have consequences.

            One last time:
            Code:
            version 17.0
            
            clear *
            
            // seedem
            set seed 539611586
            
            // "a population of 106176 records that are spread across 485 schools" School Code
            quietly set obs 485
            generate int sid = 1000 + _n
            
            generate int cou = runiformint(30, floor(2 * 106176 / 485 - 30))
            sort cou sid
            summarize cou, meanonly
            if r(sum) < 106176 quietly replace cou = cou + (106176 - r(sum)) in 1
            else quietly replace cou = cou + (106176 - r(sum)) in l
            
            // Students, Student ID
            quietly expand cou
            drop cou
            
            generate long pid = 1e6 + _n
            
            // Paper 1
            generate byte sco = runiformint(0, 24)
            
            *
            * Begin here
            *
            // "limit my sample to 80 schools"
            frame put sid, into(Schools)
            frame Schools {
                contract sid
                generate double randu = runiform()
                isid randu, sort
                quietly keep in 1/80
            }
            quietly frlink m:1 sid, frame(Schools)
            quietly keep if !missing(Schools)
            
            // "each of the 80 selected schools contributes exactly 30 students"
            generate double randu = runiform()
            isid sid randu, sort
            quietly by sid: keep if inrange(_n, 1, 30)
            
            // "ensure that the sample size is exactly 2,400 students "
            count
            assert r(N) == 2400
            
            exit

            Comment


            • #7
              Thank so much everyone, I have managed to sample as desired.

              Comment

              Working...
              X