Computing percentiles

John Mullahy

Join Date: Dec 2016
Posts: 752

Computing percentiles

27 May 2020, 07:29

This may be widely known, but in case not I thought I would share...

Stata has several commands that compute percentiles:

centile
sum, d
_pctile
egen pctile

and perhaps others.

It turns out that these do not always yield the same results, apart from the median or 50th percentile. For example this code:

Code:

preserve
cap drop _all
set obs 20
set seed 23
tempvar y
gen `y'=exp(rnormal(0,1))
qui centile `y', c(10 25 50 75 90)
di r(c_1) _n r(c_2) _n r(c_3) _n r(c_4) _n r(c_5)
qui sum `y',d
di r(p10) _n r(p25) _n r(p50) _n r(p75) _n r(p90)
qui _pctile `y', p(10 25 50 75 90)
di r(r1) _n r(r2) _n r(r3) _n r(r4) _n r(r5)
drop _all
restore

gives these results:

Code:

. preserve

. cap drop _all

. set obs 20
number of observations (_N) was 0, now 20

. set seed 23

. tempvar y

. gen `y'=exp(rnormal(0,1))

. qui centile `y', c(10 25 50 75 90)

. di r(c_1) _n r(c_2) _n r(c_3) _n r(c_4) _n r(c_5)
.29993572
.38304436
1.6890243
2.8531529
5.1466236

. qui sum `y',d

. di r(p10) _n r(p25) _n r(p50) _n r(p75) _n r(p90)
.31345257
.40814352
1.6890243
2.7669318
5.0989532

. qui _pctile `y', p(10 25 50 75 90)

. di r(r1) _n r(r2) _n r(r3) _n r(r4) _n r(r5)
.31345257
.40814352
1.6890243
2.7669318
5.0989532

. drop _all

. restore

.
end of do-file

There is nothing surprising about this if one reads carefully the respective "Methods and Formulas" sections in each command's documentation, as centile uses a different formula than do the others.

Yet the differences may be nontrivial in some contexts (e.g. computation of IQRs), so it is perhaps worth considering which of the competing formulae squares most closely with how the researcher conceives of percentiles.

Tags: None

Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#2

27 May 2020, 08:44

Dear John,

Thank you very much for this very interesting post.

If asked to compute percentiles, I would run a quantile regression on a constant and that produces results that are different from the ones you reported (including for the median). Of course, these variations are caused by the fact that percentiles are not always point identified, but personally I find it troubling that different commands in the same software produce different estimates of the same quantities. I would be interested in knowing what other users think of this; maybe this heterogeneity is valued by some users?

Best wishes,

Joao
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#3

27 May 2020, 08:59

I do share Joao's opinion about the interesting topic started by John.
In the past, I've found this difference in results really troubling: during my first years with Stata I remember calculating and re-calculating percentiles with different methods (and different results) before discovering in Stata .pdf manual, that, as usual, different methods give different results (and unavoidably so).
Now, I compute percentiles with summarize,d- or -tabstat- (during the last years of Stata use, I prefer the latter) as they are expected to give back the same results; hence, I parked this nuisance in the background of my mind (depite it hit me in the past).
Unlike Joao, I rarely use -qreg-.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#4

27 May 2020, 09:22

I like that the option exists to use different definitions of the percentiles, I don't like that it is so "hidden". My guess would be that this difersity just "organically" grew as more more commands were added to Stata. I can easily understand how that could happen. However, I would prefer to "harmonize" these commands, such that they have the same default and the same option that governs which definition is used. That would make it easier to communicate with the user that these differences exist, and what the choices are. This would fit with what I like about most Stata commands: if you want to do the same thing in different commands then you use the same option. For example, if you want robust standard errors, you add the option vce(robust) regardless whether you are doing a linear regression or logistic regression, or anything else. This is what I miss about this "potpourri" of percentile commands.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

27 May 2020, 09:48

There is minor chaos on this in the literature too. A moderately famous, or infamous, paper documents nine different methods for calculating quantiles and software can be found that boasts scope for choosing any. As I recall that paper says nothing whatsoever about generalizing recipes yet further so that weights can be applied too, which is what some of these Stata commands do.

The different commands have different purposes too. The main or distinctive selling point of centile is to provide confidence intervals. At a different end of statistics, Tukey's boxplot suggestions (with a box based on median and quartiles) were based on what he called hinges (and later fourths), which were always either order statistics (values in the data) or half-way between them (so deliberately eschewing any interpolation rule that was more complicated). His name of hinges in particular was based partly on personal whim but positively also as a signal that hinges need not agree exactly with anybody else's idea of how to calculate quartiles.

We're all in favour of consistency and standardization, except that the detail remains of what is the best standard.

Most or all features of statistical computation-computer hardware, software systems, coding, languages, symbols, terminology, procedures-have much to gain from elimination of pointless variations, redundancies and confusion. Yet pointlessness is not always easy to judge. The only quite satisfying rule of standardization is that you adopt my standards.

Anscombe, F.J. 1981. Computing in Statistical Science through APL. New York: Springer. p.3.

Last edited by Nick Cox; 27 May 2020, 09:52.
4 likes
Comment

Announcement

Computing percentiles

Comment

Comment

Comment

Comment