Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Subset a N random number of groups

    Hi,
    I have a dataset look like this

    indid villageid

    3521 1
    3521 1

    3861 1
    3861 1
    3861 1

    70011 1
    70011 1
    70011 1
    70011 1
    70011 1

    3763 2
    3763 2
    3763 2
    3763 2

    3463 2
    3463 2

    3464 2
    3464 2
    3464 2

    5464 3
    5464 3
    5464 3
    5464 3

    2464 3
    2464 3

    7464 3
    7464 3
    7464 3

    How to select the first two indids in each villageid without reshape wide the data?
    Thank you!

  • #2
    Assuming the data is presorted as in the dataex example:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long indid byte villageid
     3521 1
     3521 1
     3861 1
     3861 1
     3861 1
    70011 1
    70011 1
    70011 1
    70011 1
    70011 1
     3763 2
     3763 2
     3763 2
     3763 2
     3463 2
     3463 2
     3464 2
     3464 2
     3464 2
     5464 3
     5464 3
     5464 3
     5464 3
     2464 3
     2464 3
     7464 3
     7464 3
     7464 3
    end
    
    gen long obsno=_n
    bys villageid (obsno): gen selected= indid!=indid[_n-1] & sum(indid!=indid[_n-1])<=2
    Res.:

    Code:
    . l if selected, sepby(villageid)
    
         +-------------------------------------+
         | indid   villag~d   obsno   selected |
         |-------------------------------------|
      1. |  3521          1       1          1 |
      3. |  3861          1       3          1 |
         |-------------------------------------|
     11. |  3763          2      11          1 |
     15. |  3463          2      15          1 |
         |-------------------------------------|
     20. |  5464          3      20          1 |
     24. |  2464          3      24          1 |
         +-------------------------------------+

    Comment


    • #3
      The question is unclear and strikes me as inappropriate to the data shown. The data consist of a series of repeated observations of the same indid and villageid. So the first two observations in each villageid will actually yield only one person. It is unclear why one would even have a data set like this with so many redundant observations. So I'm going to speculate that in the real data there is more going on, e.g. there is some other variable (probably even more than one) that distinguishes these observations and the request is actually to identify the first two distinct indid's within each village. But this also seems to clash with the title of the post which mentions the word "random." That makes me think that perhaps you want to randomly select two distinct indid's from each village, and then reduce the data set to all of the many observations of those two distinct indid's.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input long indid byte villageid
       3521 1
       3521 1
       3861 1
       3861 1
       3861 1
      70011 1
      70011 1
      70011 1
      70011 1
      70011 1
       3763 2
       3763 2
       3763 2
       3763 2
       3463 2
       3463 2
       3464 2
       3464 2
       3464 2
       5464 3
       5464 3
       5464 3
       5464 3
       2464 3
       2464 3
       7464 3
       7464 3
       7464 3
      end
      
      set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
      frame put indid villageid, into(selection)
      frame selection {
          duplicates drop
          gen double shuffle = runiform()
          by villageid (shuffle), sort: keep if _n <= 2
      }
      frlink m:1 villageid indid, frame(selection)
      keep if !missing(selection)
      drop selection
      frame drop selection
      If this is not what you want, please post back with a clearer explanation.

      In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      Added: Crossed with #2.

      Comment


      • #4
        Hi,
        Sorry for bad sample of my data and unclear question. Thank Clyde and Andrew for trying to help regardless. As Clyde suspected, there is the third variable called vcode varies within indid.
        Below is the sample of my data generated by dataex, My task is how to select a random subset of 2 indid in each villageid? One way I could think of is using reshape wide the indid, assign a random number for each indid and select the first 2. However, is there any another method to do it?

        Thank you!

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input long indid byte villageid float vcode
         3512 1  171
         3521 1  170
         3591 1 1037
         3591 1 9903
         3741 1 2006
         3761 1  140
         3761 1  185
         3761 1  448
         3861 1 2011
         3871 1 2035
         3873 1  143
         3901 1 2006
         3901 1 2035
         3931 1  185
         3931 1 2006
         3961 1  546
         3971 1  185
        70021 1  480
        70021 1 2021
        83451 1  169
        83581 1  193
         2131 2  180
         2131 2  193
         2131 2 2035
         2141 2 1037
         2181 2  167
         2181 2  193
         2181 2  546
         2181 2 1368
         2191 2  185
         2191 2 2006
         2201 2  147
         2261 2  161
         2431 2  161
         2441 2  546
         2901 3  185
         2901 3  193
         2903 3   30
         2903 3   34
         2903 3  153
         2903 3  185
         2903 3  443
         2903 3 1568
         2903 3 2019
         2903 3 2035
         2911 3   30
         2911 3  185
         2931 3   30
         2931 3  167
         2931 3  193
         2971 3  193
         2981 3  185
         2991 3  185
         3021 3  175
         3021 3 2006
         3031 3  140
         3031 3  185
         3031 3 1568
         3041 3 2035
         3221 3 2006
         3251 3  448
         3331 3  157
         3331 3  193
         3331 3  458
         3331 3 1555
         3341 3  193
         3341 3 2006
         3351 3 9903
         3361 3  145
         3361 3  975
         3361 3 1614
         3361 3 2015
         3391 3  185
         3391 3  193
         3401 3  143
         3401 3  157
         3401 3  161
         3401 3 1037
         3411 3  169
         2171 4  480
         2211 4  143
         2211 4  193
         2211 4 2006
         2221 4   30
         2221 4  169
         2221 4  193
         2221 4  448
         2221 4  458
         2231 4  145
         2281 4 1037
         2281 4 2021
         2291 4  516
         2451 4  546
         2471 4  169
         2471 4  193
         2301 5  169
         2361 5 2035
         2491 5 2035
         2501 5  147
         2501 5  185
        end
        label values villageid a2q3
        label def a2q3 1 "ban mon", modify
        label def a2q3 2 "ban nhap", modify
        label def a2q3 3 "nong quynh", modify
        label def a2q3 4 "hua tat", modify
        label def a2q3 5 "ban lech", modify

        Comment


        • #5
          Code:
          set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
          frame put indid villageid, into(selection)
          frame selection {
              duplicates drop
              gen double shuffle = runiform()
              by villageid (shuffle), sort: keep if _n <= 2
          }
          frlink m:1 villageid indid, frame(selection)
          gen byte selected = !missing(selection)
          drop selection
          frame drop selection

          Comment


          • #6
            Thank you. It works perfectly!

            Comment

            Working...
            X