Subset a N random number of groups

Lanna Ng

Join Date: Feb 2020

Posts: 41
#1

Subset a N random number of groups

08 Jun 2023, 12:46

Hi,
I have a dataset look like this

indid villageid

3521 1
3521 1

3861 1
3861 1
3861 1

70011 1
70011 1
70011 1
70011 1
70011 1

3763 2
3763 2
3763 2
3763 2

3463 2
3463 2

3464 2
3464 2
3464 2

5464 3
5464 3
5464 3
5464 3

2464 3
2464 3

7464 3
7464 3
7464 3

How to select the first two indids in each villageid without reshape wide the data?
Thank you!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10214

08 Jun 2023, 13:21

Assuming the data is presorted as in the dataex example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long indid byte villageid
 3521 1
 3521 1
 3861 1
 3861 1
 3861 1
70011 1
70011 1
70011 1
70011 1
70011 1
 3763 2
 3763 2
 3763 2
 3763 2
 3463 2
 3463 2
 3464 2
 3464 2
 3464 2
 5464 3
 5464 3
 5464 3
 5464 3
 2464 3
 2464 3
 7464 3
 7464 3
 7464 3
end

gen long obsno=_n
bys villageid (obsno): gen selected= indid!=indid[_n-1] & sum(indid!=indid[_n-1])<=2

Res.:

Code:

. l if selected, sepby(villageid)

     +-------------------------------------+
     | indid   villag~d   obsno   selected |
     |-------------------------------------|
  1. |  3521          1       1          1 |
  3. |  3861          1       3          1 |
     |-------------------------------------|
 11. |  3763          2      11          1 |
 15. |  3463          2      15          1 |
     |-------------------------------------|
 20. |  5464          3      20          1 |
 24. |  2464          3      24          1 |
     +-------------------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#3

08 Jun 2023, 13:23

The question is unclear and strikes me as inappropriate to the data shown. The data consist of a series of repeated observations of the same indid and villageid. So the first two observations in each villageid will actually yield only one person. It is unclear why one would even have a data set like this with so many redundant observations. So I'm going to speculate that in the real data there is more going on, e.g. there is some other variable (probably even more than one) that distinguishes these observations and the request is actually to identify the first two distinct indid's within each village. But this also seems to clash with the title of the post which mentions the word "random." That makes me think that perhaps you want to randomly select two distinct indid's from each village, and then reduce the data set to all of the many observations of those two distinct indid's.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long indid byte villageid 3521 1 3521 1 3861 1 3861 1 3861 1 70011 1 70011 1 70011 1 70011 1 70011 1 3763 2 3763 2 3763 2 3763 2 3463 2 3463 2 3464 2 3464 2 3464 2 5464 3 5464 3 5464 3 5464 3 2464 3 2464 3 7464 3 7464 3 7464 3 end set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE frame put indid villageid, into(selection) frame selection { duplicates drop gen double shuffle = runiform() by villageid (shuffle), sort: keep if _n <= 2 } frlink m:1 villageid indid, frame(selection) keep if !missing(selection) drop selection frame drop selection

If this is not what you want, please post back with a clearer explanation.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Added: Crossed with #2.
Comment

Lanna Ng

Join Date: Feb 2020
Posts: 41

09 Jun 2023, 05:05

Hi,
Sorry for bad sample of my data and unclear question. Thank Clyde and Andrew for trying to help regardless. As Clyde suspected, there is the third variable called vcode varies within indid.
Below is the sample of my data generated by dataex, My task is how to select a random subset of 2 indid in each villageid? One way I could think of is using reshape wide the indid, assign a random number for each indid and select the first 2. However, is there any another method to do it?

Thank you!

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long indid byte villageid float vcode
 3512 1  171
 3521 1  170
 3591 1 1037
 3591 1 9903
 3741 1 2006
 3761 1  140
 3761 1  185
 3761 1  448
 3861 1 2011
 3871 1 2035
 3873 1  143
 3901 1 2006
 3901 1 2035
 3931 1  185
 3931 1 2006
 3961 1  546
 3971 1  185
70021 1  480
70021 1 2021
83451 1  169
83581 1  193
 2131 2  180
 2131 2  193
 2131 2 2035
 2141 2 1037
 2181 2  167
 2181 2  193
 2181 2  546
 2181 2 1368
 2191 2  185
 2191 2 2006
 2201 2  147
 2261 2  161
 2431 2  161
 2441 2  546
 2901 3  185
 2901 3  193
 2903 3   30
 2903 3   34
 2903 3  153
 2903 3  185
 2903 3  443
 2903 3 1568
 2903 3 2019
 2903 3 2035
 2911 3   30
 2911 3  185
 2931 3   30
 2931 3  167
 2931 3  193
 2971 3  193
 2981 3  185
 2991 3  185
 3021 3  175
 3021 3 2006
 3031 3  140
 3031 3  185
 3031 3 1568
 3041 3 2035
 3221 3 2006
 3251 3  448
 3331 3  157
 3331 3  193
 3331 3  458
 3331 3 1555
 3341 3  193
 3341 3 2006
 3351 3 9903
 3361 3  145
 3361 3  975
 3361 3 1614
 3361 3 2015
 3391 3  185
 3391 3  193
 3401 3  143
 3401 3  157
 3401 3  161
 3401 3 1037
 3411 3  169
 2171 4  480
 2211 4  143
 2211 4  193
 2211 4 2006
 2221 4   30
 2221 4  169
 2221 4  193
 2221 4  448
 2221 4  458
 2231 4  145
 2281 4 1037
 2281 4 2021
 2291 4  516
 2451 4  546
 2471 4  169
 2471 4  193
 2301 5  169
 2361 5 2035
 2491 5 2035
 2501 5  147
 2501 5  185
end
label values villageid a2q3
label def a2q3 1 "ban mon", modify
label def a2q3 2 "ban nhap", modify
label def a2q3 3 "nong quynh", modify
label def a2q3 4 "hua tat", modify
label def a2q3 5 "ban lech", modify

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

09 Jun 2023, 09:13

Code:

set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
frame put indid villageid, into(selection)
frame selection {
    duplicates drop
    gen double shuffle = runiform()
    by villageid (shuffle), sort: keep if _n <= 2
}
frlink m:1 villageid indid, frame(selection)
gen byte selected = !missing(selection)
drop selection
frame drop selection

Comment

Lanna Ng

Join Date: Feb 2020

Posts: 41
#6

09 Jun 2023, 13:03

Thank you. It works perfectly!
Comment

Announcement