I was looking to understand how Stata computes weighted percentiles. I tried to follow the formula in the documentation (here: https://www.stata.com/manuals13/dpctile.pdf) but I get different results:
I'm not entirely sure why it gives that answer. It seems incorrect:
This shows that the first x such that the cumulative weighted sum is is larger than P is 997; however, Stata seems to think that this requires averaging with the prior observation, which should only happen if 49850.32031 is equal to 49850.55859.
I thought this might be a numerical precision issue, but I get the same result using doubles (the cum sum is 49850.31956 and P is 49850.56176). It is interesting to note that collapse gives what I understand to be the right answer:
Any insights?
Code:
. set seed 1729
. clear
. set obs 100000
number of observations (_N) was 0, now 100,000
. gen w = 100 * runiform()
. gen long x = _n
. sort x w
. gen wsum = sum(w)
. gen P = wsum[_N] * 1 / 100
. assert abs(wsum - P) > 0.1
.
. gen gt = (wsum > P)
. qui sum x if gt
. disp `r(min)'
997
.
. _pctile x [aw = w], p(1)
. return list
scalars:
r(r1) = 996.5
Code:
. format %21.5f wsum P
. disp P
49850.559
. qui count if wsum < P
. list in `=r(N) - 1' / `=r(N) + 1'
+-------------------------------------------------+
| w x wsum P gt |
|-------------------------------------------------|
995. | 22.8744 995 49817.42578 49850.55859 0 |
996. | 32.89221 996 49850.32031 49850.55859 0 |
997. | 25.59445 997 49875.91406 49850.55859 1 |
+-------------------------------------------------+
I thought this might be a numerical precision issue, but I get the same result using doubles (the cum sum is 49850.31956 and P is 49850.56176). It is interesting to note that collapse gives what I understand to be the right answer:
Code:
. collapse (p1) x [aw = w] . disp x 997
