Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Making sure that histogram bin has more than 5 observations.

    I am working on a server with data on individuals, and there are very strict rules on how data should be aggregated if I want to retrieve figures from the server.

    This is mostly fine, but I am having some trouble making sure that my histograms are in order. I am unable to immediately figure out how many observations that are in each bin. I think I have found a solution, but wanted to hear if someone else has experience with this.

    I found that the histogram calls another function called -twoway__histogram_gen-, and I found from the ado file that I can force it to return what it uses to create the bins. I tried to make this little function, that works well on this small, and well behaved, data. But the data I have are tens of million of observations, so it is a little more difficult to make the same "sanity" check.

    Anyone else have experience in counting number of obs in bins? (And I dont understand why the -break- is not respected in this loop?)

    Code:
    sysuse auto, clear
    twoway__histogram_gen price, return
    local r_bin = r(bin)
    local r_start = r(start)
    local r_width = r(width)
    
    forvalues bin_n = 1(1)`r_bin'{
        qui count if ///
            price >= (`r_start' + `r_width'*(`bin_n' - 1)) & ///
            price < (`r_start' + `r_width'*`bin_n')
        cap assert r(N) > 5
        if _rc {
            di "Not enough observations in bin `bin_n' with `r(N)' obs."
            break
        }
    }

  • #2
    I am puzzled to know why small bin counts are any more revealing of individual circumstances than any others, but this seems needlessly roundabout.


    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . twoway__histogram_gen price, freq gen(count where)
    gives a variable that tells you directly what the bin frequencies are.

    Comment


    • #3
      Hey, I don't make the rules.

      But your code works perfectly. I had no idea that you could use the function like that! Is there any documentation on that function (or functions like it) anywhere?

      Comment


      • #4
        You don't make the rules -- but they appear paranoid to me.

        twoway__histogram_gen is a command, not a function, but either way it is documented. Did you try


        Code:
        help twoway__histogram_gen
        ?

        Comment


        • #5
          You don't make the rules -- but they appear paranoid to me.
          Sorry, Nick, but apparently you don't work with potentially sensitive data collected from individuals. Preventing the disclosure of personally identifiable information is indeed a Big Deal. That protection is made more difficult in time when sensitive results from a survey can be combined with readily available collections of individual data from data harvesters, which can lead to matching the sensitive results to personal identification. This is a topic of serious substantive research among statisticians, information scientists, and others, although one that I follow only casually, so I can't quickly identify a link to a good overview of the current state of this field. (I think what I'm missing is the precise technical name of the overarching concept.)

          The floor on the cell size in tabulations is familiar to me from my work. I have on occasion had to collapse small categories of one or another variable in a crosstab so that no cell was below the threshold.

          Working with socioeconomic geographic data would be more fun for me if it weren't for the people who inhabit the geography.

          Comment

          Working...
          X