No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computing percentiles

    This may be widely known, but in case not I thought I would share...

    Stata has several commands that compute percentiles:

    sum, d
    egen pctile

    and perhaps others.

    It turns out that these do not always yield the same results, apart from the median or 50th percentile. For example this code:
    cap drop _all
    set obs 20
    set seed 23
    tempvar y
    gen `y'=exp(rnormal(0,1))
    qui centile `y', c(10 25 50 75 90)
    di r(c_1) _n r(c_2) _n r(c_3) _n r(c_4) _n r(c_5)
    qui sum `y',d
    di r(p10) _n r(p25) _n r(p50) _n r(p75) _n r(p90)
    qui _pctile `y', p(10 25 50 75 90)
    di r(r1) _n r(r2) _n r(r3) _n r(r4) _n r(r5)
    drop _all
    gives these results:
    . preserve
    . cap drop _all
    . set obs 20
    number of observations (_N) was 0, now 20
    . set seed 23
    . tempvar y
    . gen `y'=exp(rnormal(0,1))
    . qui centile `y', c(10 25 50 75 90)
    . di r(c_1) _n r(c_2) _n r(c_3) _n r(c_4) _n r(c_5)
    . qui sum `y',d
    . di r(p10) _n r(p25) _n r(p50) _n r(p75) _n r(p90)
    . qui _pctile `y', p(10 25 50 75 90)
    . di r(r1) _n r(r2) _n r(r3) _n r(r4) _n r(r5)
    . drop _all
    . restore
    end of do-file
    There is nothing surprising about this if one reads carefully the respective "Methods and Formulas" sections in each command's documentation, as centile uses a different formula than do the others.

    Yet the differences may be nontrivial in some contexts (e.g. computation of IQRs), so it is perhaps worth considering which of the competing formulae squares most closely with how the researcher conceives of percentiles.

  • #2
    Dear John,

    Thank you very much for this very interesting post.

    If asked to compute percentiles, I would run a quantile regression on a constant and that produces results that are different from the ones you reported (including for the median). Of course, these variations are caused by the fact that percentiles are not always point identified, but personally I find it troubling that different commands in the same software produce different estimates of the same quantities. I would be interested in knowing what other users think of this; maybe this heterogeneity is valued by some users?

    Best wishes,



    • #3
      I do share Joao's opinion about the interesting topic started by John.
      In the past, I've found this difference in results really troubling: during my first years with Stata I remember calculating and re-calculating percentiles with different methods (and different results) before discovering in Stata .pdf manual, that, as usual, different methods give different results (and unavoidably so).
      Now, I compute percentiles with summarize,d- or -tabstat- (during the last years of Stata use, I prefer the latter) as they are expected to give back the same results; hence, I parked this nuisance in the background of my mind (depite it hit me in the past).
      Unlike Joao, I rarely use -qreg-.
      Kind regards,
      (Stata 16.0 SE)


      • #4
        I like that the option exists to use different definitions of the percentiles, I don't like that it is so "hidden". My guess would be that this difersity just "organically" grew as more more commands were added to Stata. I can easily understand how that could happen. However, I would prefer to "harmonize" these commands, such that they have the same default and the same option that governs which definition is used. That would make it easier to communicate with the user that these differences exist, and what the choices are. This would fit with what I like about most Stata commands: if you want to do the same thing in different commands then you use the same option. For example, if you want robust standard errors, you add the option vce(robust) regardless whether you are doing a linear regression or logistic regression, or anything else. This is what I miss about this "potpourri" of percentile commands.
        Maarten L. Buis
        University of Konstanz
        Department of history and sociology
        box 40
        78457 Konstanz


        • #5
          There is minor chaos on this in the literature too. A moderately famous, or infamous, paper documents nine different methods for calculating quantiles and software can be found that boasts scope for choosing any. As I recall that paper says nothing whatsoever about generalizing recipes yet further so that weights can be applied too, which is what some of these Stata commands do.

          The different commands have different purposes too. The main or distinctive selling point of centile is to provide confidence intervals. At a different end of statistics, Tukey's boxplot suggestions (with a box based on median and quartiles) were based on what he called hinges (and later fourths), which were always either order statistics (values in the data) or half-way between them (so deliberately eschewing any interpolation rule that was more complicated). His name of hinges in particular was based partly on personal whim but positively also as a signal that hinges need not agree exactly with anybody else's idea of how to calculate quartiles.

          We're all in favour of consistency and standardization, except that the detail remains of what is the best standard.

          Most or all features of statistical computation-computer hardware, software systems, coding, languages, symbols, terminology, procedures-have much to gain from elimination of pointless variations, redundancies and confusion. Yet pointlessness is not always easy to judge. The only quite satisfying rule of standardization is that you adopt my standards.

          Anscombe, F.J. 1981. Computing in Statistical Science through APL. New
          York: Springer. p.3.
          Last edited by Nick Cox; 27 May 2020, 09:52.