No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Procedure to group variable with minimum number of observations

    Dear Statalists,

    I have a discrete variable, that takes integer values larger or equal to zero. I want to cut the values into categories and am currently trying to write a procedure to find smallest "bin" size such that for each value, there are at least three observations in the data. As I don't have access to the data it has to be automized to check whether the current "bin" width satisfied the criterion and if not increase it by 1.

    Here is some bit of non-working code which might clarify more what I am intending to do. I had the idea to cut the variable with egen, then tab it and loop through the matrix to assert that all values were > 2.

    qui su x
    local end = `r(max)'
    tempname A temp
    local r = 1
    foreach i = 1(1)100{
        cap drop `temp'
        egen `temp' = cut(x, at(0(`i')`end'))
        qui tab `temp', matcell(`A')
        forvalues i = 1(1)`= rowsof(`A')'{
            if `A'[`i',1] > 2 {
           else {
           GO TO LOOP ABOVE
    There is still a lot of parts missing because I am not sure how to implemente the CONTINUE and the GO TO LOOP ABOVE, and also that the loop should stop if the second loop ran through completely. But that was anyway just to give an idea as I am sure there are much more efficient ways to go about this. So, I am happy to hear you ideas. Thank you very much in advance!

    Last edited by Felix Stips; 13 Apr 2018, 10:36. Reason: Added tags

  • #2
    I have never liked egen, cut() as every time someone uses it I have to look at the code to see what the rules are. I prefer just to bin in terms of floors or ceilings.

    This code works. The sandbox is entirely arbitrary. Note that you can get frequencies fairly directly.

    set obs 1000
    set seed 2803
    gen sandbox = floor(exp(rnormal(0.2, 1.2)))
    su sandbox
    local bad = 1
    gen bin = .
    gen freq = .
    local width = 1
    quietly while `bad' {
         replace bin = `width' * floor(sandbox/`width')
         bysort bin : replace freq = _N
         su freq, meanonly
         noisily di "width" %7.0f `width' "{col 12} min `r(min)'"
         if r(min) >= 3 local bad = 0
         else  local ++width


    • #3
      Thank you very much, this works perfectly! And thanks, I learned the trick with bad = 0 to stop the loop