Split sample into N buckets

Rochelle Zhang

Join Date: Aug 2025

Posts: 0
#1

Split sample into N buckets

10 Mar 2015, 21:46

Dear Statalisters,

I saw someone using

bysort xvar : gen x_Q3=ceil(_n/(_N/3))

to split sample into 3 buckets.

input gvkey xvar
1002 4
1003 67
1001 29
1004 34
1001 10
1002 6
1003 54
1003 23
1004 43
1003 4
1002 67
1001 5
1001 3
1002 7
end

below is pretend data, somehow I did not get any where close to 3 buckets.

I was just curious if I misunderstood how the other person used it without fail.

thank you,
Rochelle
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35754

11 Mar 2015, 04:53

We can't comment easily on what the other person did unless we've seen the same source, nor can we comment on the code you used, as you don't show it, but we can supply a little analysis. I find the technique easier to think about in terms of

Code:

 
ceil(3 * _n/_N)

The fraction _n/_N has upper limit 1 and 3 times that has upper limit 3. Rounding up with ceil() gives up possible answers 1, 2, 3 but they aren't guaranteed to be equally frequent. In fact, with subset size 1 only one bucket or bin is possible and with 2 only two. Even for subset sizes of 3 or more equal frequencies (which may well be what you are after) are only possible for multiples of 3.

Here is a demonstration for those who prefer examples to exegesis:

Code:

 
. clear

. set obs 9 
obs was 0, now 9

. gen group = _n

. l group

     +-------+
     | group |
     |-------|
  1. |     1 |
  2. |     2 |
  3. |     3 |
  4. |     4 |
  5. |     5 |
     |-------|
  6. |     6 |
  7. |     7 |
  8. |     8 |
  9. |     9 |
     +-------+

. expand group
(36 observations created)

. tab group

      group |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          1        2.22        2.22
          2 |          2        4.44        6.67
          3 |          3        6.67       13.33
          4 |          4        8.89       22.22
          5 |          5       11.11       33.33
          6 |          6       13.33       46.67
          7 |          7       15.56       62.22
          8 |          8       17.78       80.00
          9 |          9       20.00      100.00
------------+-----------------------------------
      Total |         45      100.00

. bysort group : gen bucket = ceil(3 * _n/_N)

. tab group bucket

           |              bucket
     group |         1          2          3 |     Total
-----------+---------------------------------+----------
         1 |         0          0          1 |         1 
         2 |         0          1          1 |         2 
         3 |         1          1          1 |         3 
         4 |         1          1          2 |         4 
         5 |         1          2          2 |         5 
         6 |         2          2          2 |         6 
         7 |         2          2          3 |         7 
         8 |         2          3          3 |         8 
         9 |         3          3          3 |         9 
-----------+---------------------------------+----------
     Total |        12         15         18 |        45

Not the question, but binning regardless of the values of variables or even a repeatable sort order is usually a bad idea.

Comment

Rochelle Zhang

Join Date: Aug 2025

Posts: 0
#3

11 Mar 2015, 21:05

Thank you Nick for your detailed response.

Use your example, would you recommend egen, cut to split into buckets/bins?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#4

12 Mar 2015, 03:35

I never use that function, for specific reasons documented often in this forum.

Binning is a 19th century device -- think histograms -- and we can usually do better without it.

The biggest fans appear to be people with business data who want to summarize things like the characteristics of the best or worst performing fractions of firms. I respect that objective, but don't share it, or more importantly don't do similar things for the kind of data I work with. If the data come categorised, that is how they are; if they come counted or measured, just use that information.

If you feel an urge to bin, working with k * floor(x / k) or its ceil() sibling is my top recommendation. It's transparent and self-documenting; it's easy to think about.
Comment

Announcement