Using randomtag by group

Alexander Koplenig

Join Date: Jul 2014

Posts: 39
#1

Using randomtag by group

16 Aug 2019, 01:10

Dear Statalisters,

I am having a dataset with individuals belonging to different groups and I am running a simulation with many repetitions. Per repetition, one individual is randomly drawn from each group using the sample command, e.g.

Code:

*A toy example clear *Generate 100 different groups set obs 100 generate long group=_n *Generate 1000 indivudals per group expand 1000 bysort group: gen individual=_n *Sample 1 individual per group by group: sample 1, count

Since sample relies on sorting the data (which makes the code run rather slowly), I would like to use the user-written command randomtag (from SSC) that tags the same observations that sample would select but does not sort the observations.

My problem is that randomtag does not have a by() option, so I can't use it to sample one individual per group. Does anyone has an idea how to accomplish this with randomtag or with another workaround

If anyone has any ideas, please let me know, thank you in advance!

Ali
Tags: None

Robert Picard

Join Date: Mar 2014
Posts: 1536

16 Aug 2019, 08:45

Assuming that the data is already sorted by group, you can repeatedly pick one individual per group randomly without sorting via explicit subscripting (help subscripting). All you need is to generate a random observation index within each group. Something like:

Code:

version 15
set seed 3121

clear
set obs 100
gen long group=_n
gen nid = runiformint(10,1000)
expand nid
bysort group: gen individual=_n

by group: gen long pickid = runiformint(1,_N) if _n == 1
by group: gen pick = _n == pickid[1]

by group: replace pickid = runiformint(1,_N) if _n == 1
by group: gen pick2 = _n == pickid[1]

listsome if !mi(pickid) | pick | pick2, sepby(group) max(21)

and the output generated by listsome (from SSC):

Code:

. listsome if !mi(pickid) | pick | pick2, sepby(group) max(21)

       +------------------------------------------------+
       | group   nid   indivi~l   pickid   pick   pick2 |
       |------------------------------------------------|
    1. |     1   558          1      230      0       0 |
   51. |     1   558         51        .      1       0 |
  230. |     1   558        230        .      0       1 |
       |------------------------------------------------|
  559. |     2   705          1      349      0       0 |
  579. |     2   705         21        .      1       0 |
  907. |     2   705        349        .      0       1 |
       |------------------------------------------------|
 1264. |     3   513          1        5      0       0 |
 1268. |     3   513          5        .      0       1 |
 1292. |     3   513         29        .      1       0 |
       |------------------------------------------------|
 1777. |     4   507          1       83      0       0 |
 1859. |     4   507         83        .      0       1 |
 1923. |     4   507        147        .      1       0 |
       |------------------------------------------------|
 2284. |     5    58          1       32      0       0 |
 2315. |     5    58         32        .      0       1 |
 2322. |     5    58         39        .      1       0 |
       |------------------------------------------------|
 2342. |     6   945          1      220      0       0 |
 2561. |     6   945        220        .      0       1 |
 2920. |     6   945        579        .      1       0 |
       |------------------------------------------------|
 3287. |     7   251          1      159      0       0 |
 3421. |     7   251        135        .      1       0 |
 3445. |     7   251        159        .      0       1 |
       +------------------------------------------------+

Comment

Alexander Koplenig

Join Date: Jul 2014

Posts: 39
#3

17 Aug 2019, 10:35

Thank you Robert, that’s an elegant solution, a quick check indicates that this saves a significant amount of computing time per repetition compared to the sample approach, I will provide detailed figures on Monday. I'm even more amazed by the listsome ado, I needed this without even knowing it. Kudos!.
Comment

Alexander Koplenig

Join Date: Jul 2014
Posts: 39

19 Aug 2019, 01:41

As promised, here are some detailed figures: to compare the sample approach with Robert's suggestion, I generate a data set consisting of 1000 groups with a varying number of individual members (between 1 an 1,000).

Code:

set seed 3121

clear
set obs 1000
gen long group=_n
gen nid = runiformint(1,1000)
expand nid
drop nid
bysort group: gen individual=_n

From this data set consisting of roughly half a million observations, I randomly drew samples of 1 indvidual per group, this step is repeated 1,000 times.

Code:

*A: Standard approach using sample command
    timer clear 1
    timer on 1
    quietly {
        forvalues i=1/1000 {
                preserve
                by group: sample 1, count
                if mod(`i',100)==0 {
                    local picked=_N
                    noisily di "repetition: `i' | sample size: `picked'"
                }
                restore
            }
        }
    timer off 1    

*B: Subscripting approach by Robert
        timer clear 2
        timer on 2
        gen long pickid=.
        gen pick=.
        quietly {
        forvalues i=1/1000{
            by group: replace pickid = runiformint(1,_N) if _n == 1
            by group: replace pick = _n == pickid[1]
            if mod(`i',100)==0 {
                    count if pick==1
                    local picked=r(N)
                    noisily di "repetition: `i' | sample size: `picked'"
                }
        }
        }
        timer off 2
        timer list

Result:

Code:

.                 timer list
   1:    733.06 /        1 =     733.0620
   2:     77.16 /        1 =      77.1560

The result indicates that Robert's approach is faster by a factor of ~10. Thanks again!

Announcement