Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating arbitrary probabilities from sample statistics (mean, SD, and several percentiles, N)

    Suppose someone took a sample from a large population and gave me the following sample statistics on heights:

    Code:
    scalar n    = 5510
    scalar mean = 161.3
    scalar se   = 0.19
    scalar p5   = 149.8
    scalar p10  = 152.5
    scalar p15  = 153.9
    scalar p25  = 156.4
    scalar p50  = 161.3
    scalar p75  = 166.0
    scalar p85  = 168.4
    scalar p90  = 170.2
    scalar p95  = 172.5
    I want to know Pr(A <= Height <= B). Is there a way to use all this information in the calculation?

    The simplest thing I can think of is doing this:

    Code:
    scalar sd = scalar(se) * sqrt(scalar(n))
    
    capture program drop pr_calc
    program define pr_calc, // rclass
        syntax, MEAN(real) SD(real) A(real) B(real) P(string)
        scalar `p' = normal((`b' - `mean')/`sd') - normal((`a' - `mean')/`sd')
        di "Pr(`a' <= X <= `b') = " `p'
    end
    
    pr_calc, mean(`=scalar(mean)') sd(`=scalar(sd)') a(150) b(170) p(pr)
    However, that does not use all available information since it ignores all percentile data.

    The solution can be fairly slow.

  • #2
    Is that se correct? Seems like 0.09 is more like it.

    Generalized Lambda Distribution might help, but that distribution looks near symmetric.

    and if its just 150, 170, you got that.

    Comment


    • #3
      George Ford The SE seems right to me. These are heights, so the implied SD is 13.6 cm, which does not set off any alarm bells.

      Can you give me more insight on how the GLD would be useful here to go from sample stats to a CDF? This is not something I have come across before; a quick Google search did not locate anything that seemed germane.

      Comment


      • #4
        Based on the distribution you present, I think the calculated se is 0.09 ish.

        Code:
        di (161.3 - 149.8)/(1.645*sqrt(5510))
        .09417945
        
        di (172.5 - 161.3)/(1.645*sqrt(5510))
        .09172259
        If you,

        Code:
        g x = rnormal(161.3,0.094*sqrt(5510))
        summ x, d
        you'll get pretty much exactly that distribution breakdown you provide. You do not if you use 0.19.

        The GLD is a very flexible 4 parameter distribution. Karian and Dudewiz "Fitting Statistical Distributions" has tables in the back that allow you to lookup the parameters with information about the distribution (either the moments or the percentiles). You can then simulate by a uniform random and the 4 parameters. 5510 is likely big enough to get decent results.

        The GLD would be useful if the distribution isn't what you provide (if that's an example). The distribution you provide is about as normal as normal gets.

        If you are actually wanting the answer for that particular distribution, then you have the answer at the values you provide (or really close to it). Or, the rnormal procedure above would allow you to calculate it (do it 1,000 times and get the cutoffs at whatever values you want).
        Last edited by George Ford; 12 Dec 2023, 08:20. Reason: fixed a few errors in the rnormal code

        Comment


        • #5
          I dabbled with the GLD in the past. You'll get a feel for it reading this.

          HTML Code:
          https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1293952

          Comment


          • #6
            If the SE is wrong, that would explain why I had trouble replicating the percentiles. Thank you for these suggestions!

            Comment

            Working...
            X