Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Split sample into N buckets

    Dear Statalisters,

    I saw someone using

    bysort xvar : gen x_Q3=ceil(_n/(_N/3))

    to split sample into 3 buckets.

    input gvkey xvar
    1002 4
    1003 67
    1001 29
    1004 34
    1001 10
    1002 6
    1003 54
    1003 23
    1004 43
    1003 4
    1002 67
    1001 5
    1001 3
    1002 7
    end



    below is pretend data, somehow I did not get any where close to 3 buckets.

    I was just curious if I misunderstood how the other person used it without fail.

    thank you,
    Rochelle

  • #2
    We can't comment easily on what the other person did unless we've seen the same source, nor can we comment on the code you used, as you don't show it, but we can supply a little analysis. I find the technique easier to think about in terms of

    Code:
     
    ceil(3 * _n/_N)
    The fraction _n/_N has upper limit 1 and 3 times that has upper limit 3. Rounding up with ceil() gives up possible answers 1, 2, 3 but they aren't guaranteed to be equally frequent. In fact, with subset size 1 only one bucket or bin is possible and with 2 only two. Even for subset sizes of 3 or more equal frequencies (which may well be what you are after) are only possible for multiples of 3.

    Here is a demonstration for those who prefer examples to exegesis:

    Code:
     
    . clear
    
    . set obs 9 
    obs was 0, now 9
    
    . gen group = _n
    
    . l group
    
         +-------+
         | group |
         |-------|
      1. |     1 |
      2. |     2 |
      3. |     3 |
      4. |     4 |
      5. |     5 |
         |-------|
      6. |     6 |
      7. |     7 |
      8. |     8 |
      9. |     9 |
         +-------+
    
    . expand group
    (36 observations created)
    
    . tab group
    
          group |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |          1        2.22        2.22
              2 |          2        4.44        6.67
              3 |          3        6.67       13.33
              4 |          4        8.89       22.22
              5 |          5       11.11       33.33
              6 |          6       13.33       46.67
              7 |          7       15.56       62.22
              8 |          8       17.78       80.00
              9 |          9       20.00      100.00
    ------------+-----------------------------------
          Total |         45      100.00
    
    . bysort group : gen bucket = ceil(3 * _n/_N)
    
    . tab group bucket
    
               |              bucket
         group |         1          2          3 |     Total
    -----------+---------------------------------+----------
             1 |         0          0          1 |         1 
             2 |         0          1          1 |         2 
             3 |         1          1          1 |         3 
             4 |         1          1          2 |         4 
             5 |         1          2          2 |         5 
             6 |         2          2          2 |         6 
             7 |         2          2          3 |         7 
             8 |         2          3          3 |         8 
             9 |         3          3          3 |         9 
    -----------+---------------------------------+----------
         Total |        12         15         18 |        45
    Not the question, but binning regardless of the values of variables or even a repeatable sort order is usually a bad idea.

    Comment


    • #3
      Thank you Nick for your detailed response.

      Use your example, would you recommend egen, cut to split into buckets/bins?

      Comment


      • #4
        I never use that function, for specific reasons documented often in this forum.

        Binning is a 19th century device -- think histograms -- and we can usually do better without it.

        The biggest fans appear to be people with business data who want to summarize things like the characteristics of the best or worst performing fractions of firms. I respect that objective, but don't share it, or more importantly don't do similar things for the kind of data I work with. If the data come categorised, that is how they are; if they come counted or measured, just use that information.

        If you feel an urge to bin, working with k * floor(x / k) or its ceil() sibling is my top recommendation. It's transparent and self-documenting; it's easy to think about.

        Comment

        Working...
        X