Calculating arbitrary probabilities from sample statistics (mean, SD, and several percentiles, N)

Dimitriy V. Masterov

Join Date: Mar 2014
Posts: 609

Calculating arbitrary probabilities from sample statistics (mean, SD, and several percentiles, N)

08 Dec 2023, 16:14

Suppose someone took a sample from a large population and gave me the following sample statistics on heights:

Code:

scalar n    = 5510
scalar mean = 161.3
scalar se   = 0.19
scalar p5   = 149.8
scalar p10  = 152.5
scalar p15  = 153.9
scalar p25  = 156.4
scalar p50  = 161.3
scalar p75  = 166.0
scalar p85  = 168.4
scalar p90  = 170.2
scalar p95  = 172.5

I want to know Pr(A <= Height <= B). Is there a way to use all this information in the calculation?

The simplest thing I can think of is doing this:

Code:

scalar sd = scalar(se) * sqrt(scalar(n))

capture program drop pr_calc
program define pr_calc, // rclass
    syntax, MEAN(real) SD(real) A(real) B(real) P(string)
    scalar `p' = normal((`b' - `mean')/`sd') - normal((`a' - `mean')/`sd')
    di "Pr(`a' <= X <= `b') = " `p'
end

pr_calc, mean(`=scalar(mean)') sd(`=scalar(sd)') a(150) b(170) p(pr)

However, that does not use all available information since it ignores all percentile data.

The solution can be fairly slow.

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3177
#2

11 Dec 2023, 12:48

Is that se correct? Seems like 0.09 is more like it.

Generalized Lambda Distribution might help, but that distribution looks near symmetric.

and if its just 150, 170, you got that.
Comment
Dimitriy V. Masterov

Join Date: Mar 2014

Posts: 609
#3

11 Dec 2023, 19:52

George Ford The SE seems right to me. These are heights, so the implied SD is 13.6 cm, which does not set off any alarm bells.

Can you give me more insight on how the GLD would be useful here to go from sample stats to a CDF? This is not something I have come across before; a quick Google search did not locate anything that seemed germane.
Comment
George Ford

Join Date: Aug 2014

Posts: 3177
#4

12 Dec 2023, 07:43

Based on the distribution you present, I think the calculated se is 0.09 ish.

Code:

di (161.3 - 149.8)/(1.645*sqrt(5510)) .09417945 di (172.5 - 161.3)/(1.645*sqrt(5510)) .09172259

If you,

Code:

g x = rnormal(161.3,0.094*sqrt(5510)) summ x, d

you'll get pretty much exactly that distribution breakdown you provide. You do not if you use 0.19.

The GLD is a very flexible 4 parameter distribution. Karian and Dudewiz "Fitting Statistical Distributions" has tables in the back that allow you to lookup the parameters with information about the distribution (either the moments or the percentiles). You can then simulate by a uniform random and the 4 parameters. 5510 is likely big enough to get decent results.

The GLD would be useful if the distribution isn't what you provide (if that's an example). The distribution you provide is about as normal as normal gets.

If you are actually wanting the answer for that particular distribution, then you have the answer at the values you provide (or really close to it). Or, the rnormal procedure above would allow you to calculate it (do it 1,000 times and get the cutoffs at whatever values you want).

Last edited by George Ford; 12 Dec 2023, 08:20. Reason: fixed a few errors in the rnormal code
Comment
George Ford

Join Date: Aug 2014

Posts: 3177
#5

12 Dec 2023, 08:17

I dabbled with the GLD in the past. You'll get a feel for it reading this.

HTML Code:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1293952
Comment
Dimitriy V. Masterov

Join Date: Mar 2014

Posts: 609
#6

13 Dec 2023, 19:26

If the SE is wrong, that would explain why I had trouble replicating the percentiles. Thank you for these suggestions!
Comment

Announcement

Calculating arbitrary probabilities from sample statistics (mean, SD, and several percentiles, N)

Comment

Comment

Comment

Comment

Comment