Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • find_denom available from SSC

    Thanks as always to Kit Baum, a new command find_denom is now available from SSC. Stata 9 (at least) is required.

    The very specific problem tackled is "finding the denominator", that is, determining minimum sample size consistent with reported class percentages.

    I would much appreciate any further references or interesting examples in this territory.

    An old joke with many variants has the following flavour. A naive researcher is reporting on a rather small project: 33% of the sample said A, 33% said B, but the other person refused to answer. It is immediate that the sample size is 3.

    Only a twist more challenging: What denominator or sample size underlies a percentage breakdown of 40, 40, 20? That breakdown is consistent with a sample size of 5, with 2, 2, 1 as class frequencies. It is also consistent with any multiple of 5 and, dependent on amount of rounding, reportably consistent with other percentage breakdowns too. Thus 2001, 1999, 1000 is exactly 40.02, 39.98, 20.00 as a percentage breakdown and so rounds to 40.0, 40.0, 20.0 to 1 decimal place, as would 2002, 1998 and 1000, and as would many other possibilities.

    Every researcher should know that sample size should always be reported. Every researcher with any experience knows that does not always happen, and the culprits are not confined to advertising, journalism, or politics. Having flagged that this is an ethical issue, we now concentrate on the technicalities of trying to guess the minimum sample size consistent with a reported percentage breakdown. We assume honest and accurate reporting, other than the sample size being suppressed.

    The problem was discussed by Wallis and Roberts (1956, pp.185-189) (hereafter WR) and in much more technical detail by Becker, Chambers, and Wilks (1988) (hereafter BCW). Two ideas arise immediately. First, a complete set of percentages is not needed to say something about minimum sample size. Thus a single percentage reported as 33% implies that the sample size cannot be 2 and must be at least 3. Second, the smallest percentage reported, or if smaller the smallest positive difference between two percentages reported, gives another handle on the minimum sample size. Thus with a percentage breakdown of 40, 30, 30, the smallest positive difference is 10 and equivalently 100/10 = 10 is the minimum sample size.

    WR (p.186) report a fictitious percentage breakdown

    23.1
    15.4
    30.8
    19.2
    7.7
    3.8

    -- from which both the smallest percentage and the smallest positive difference are 3.8, suggesting a minimum sample size of 100/3.8, which
    rounds as an integer to 26. The implied frequencies are thus

    6
    4
    8
    5
    2
    1

    Let's get find_denom to do the work:

    Code:
    . find_denom 23.1 15.4 30.8 19.2 7.7 3.8, eps(0.05)
    
    minimum sample size is 26
    frequencies are 6 4 8 5 2 1
    WR (1956, pp.187-188) report percentage breakdowns of movie ratings from Consumer Reports August 1949, p.383. The categories are in turn percentages reporting Excellent, Good, Fair, and Poor. Some examples are

    Alias Nick Beal 6 27 47 20
    Bride of Vengeance 11 22 56 11


    Code:
    .
    . find_denom 6 27 47 20, eps(0.5)
    
    minimum sample size is 49
    frequencies are 3 13 23 10
    
    .
    . find_denom 11 22 56 11, eps(0.5)
    
    minimum sample size is 9
    frequencies are 1 2 5 1
    BCW (p.272) report these percentages for considering vendor for 1986 from a personal computer magazine:

    Ours 14.6
    A 12.2
    B 12.2
    C 7.3
    D 7.3


    Code:
    . find_denom 14.6 12.2 12.2 7.3 7.3, eps(0.05)
    
    minimum sample size is 41
    frequencies are 6 5 5 3 3
    BCW report an algorithm and S code with this recipe for proportions (my wording). The idea is just to bump up the sample size until implied percentages are all consistent with the stated precision.

    It is their algorithm, translated from S to Stata, but adapted for percentage input, that is implemented in find_denom..

    BCS (pp.274-277) further discuss speeding-up computations and allowing a certain number of outliers, in essence percentages that do not fit,
    say because they were reported incorrectly. These elaborations are not implemented here, but should be of interest for a deeper study.

    On the problem of how often rounded percentages sum to exactly 100, see Mosteller, Youtz, and Zahn (1967) and Diaconis and Freedman (1979).

    Becker, R. A., J. M. Chambers and A. R. Wilks. 1988. The New S Language: A Programming Environment for Data Analysis and Graphics. Pacific
    Grove, CA: Wadsworth & Brooks-Cole.

    Diaconis, P. and D. Freedman. 1979. On rounding percentages. Journal of the American Statistical Association 74: 359-364.

    Mosteller, F., C. Youtz and D. Zahn. 1967. The distribution of sums of rounded percentages. Demography 4: 850-858. Reprinted in Fienberg,
    S. E. and D. C. Hoaglin (eds) 2006. Selected Papers of Frederick Mosteller. New York: Springer, 399-411.

    Wallis, W. A. and H. V. Roberts. 1956. Statistics: A New Approach. Glencoe, IL: Free Press.









    .
Working...
X