Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • qbplot available from SSC: pedagogic or proganda plot to wean people off box plots

    Thanks as always to Kit Baum, a new command qbplot is available from SSC.

    qbplot is a dedicated pedagogic or propaganda command with the aim of showing that a quantile plot can convey all the key information given by a box plot showing median and quartiles. Indeed, it can convey much more.

    Box plots are very commonly used, but in my view over-used and often not nearly as helpful as other plots might be.

    * The median and quartiles are convenient summary points but otherwise usually lack particular scientific or practical meaning.

    * Box plots often fail to give enough information in the tails.

    * Box plots usually fail to be very helpful, and can even be misleading, given distributions that are extremely skew, or include only a small number of distinct values, or are bimodal.

    * The Tukey convention of showing data points individually if and only if they lie more than 1.5 IQR from the nearer quartile and otherwise extending so-called whiskers to the outermost data points within 1.5 IQR of each quartile had a rationale in the 1970s of being amenable to hand calculation and drawing plots with paper and pen. Fifty years on, we all have access to computers and can do better. Furthermore, this convention has proved often difficult to explain and understand. The last four books I have reviewed with content on statistical graphics all get it wrong! Other conventions are indeed possible and stripplot implements alternatives that might be congenial.

    * Box plots don't usually show means as extra detail, or other summaries such as geometric means, when as or more appropriate. This comes down to how they are implemented and this is one detail in which Stata implementations graph box and graph hbox aren't flexible. Particularly bizarre, but remarkably common, are textbook or paper discussions of ANOVA results illustrated by box plots that don't even show the means.

    Back to what qbplot does:

    The main part of the display is a quantile plot, a scatter plot showing ordered values against the so-called plotting positions, labelled "Fraction of the data". Compare official command quantile; community-contributed command qplot (Stata Journal); community-contributed command multqplot (Stata Journal)); community-contributed command stripplot (from SSC).

    On the vertical axis, the values of the minimum, maximum, median and quartiles are shown as axis labels. The median and quartiles are connected using horizontal and vertical line segments to plotting positions 0.25, 0.5, 0.75 on the horizontal axis. These segments may be extended mentally to imagine a conventional box showing median and quartiles, as in most variations of box plots.

    Variations from the basic design may be obtained by particular option choices.

    My suggestion is that this design can work well for samples of small, moderate or large size. If a sample is small, there is enough space to show all the detail in the data; the art is doing so while allowing broad features to be grasped easily. If a sample is very large, many data points often blur into each other, but that doesn't make such plots useless. Marked outliers, gaps, spikes and so forth are likely to be evident when they are present. Either way, quantile plots also convey information on skewness and tail weight.

    Whenever data are strongly skewed, between two and four axis labels may be uncomfortably close to each other. This is often a sign that you might be better off using a transformed scale. Note that ysc(log) is available for on-the-fly logarithmic transformation of entirely positive data, Here we rely on median and quartiles of logarithms of data being usually close enough to logarithms of median and quartiles of data, at least for exploratory purposes. Otherwise you may need to apply a transformation separately.

    Much of the point of this display is that we no longer need to worry about precisely what is shown outside the box of a box plot. We just show all the data points. So, we do not need to implement any particular rule or convention, such as displaying data points individually if and only if they are more than 1.5 IQR from the nearer quartile. Nor do we need to interpret the results of any such rule or convention.

    Literature on so-called quantile-box plots is pertinent here. Parzen (1979a, 1979b, 1982, 1997) hybridised box and quantile plots as quantile-box plots. The help file gives several more references. For yet further references in this territory, see the help for stripplot.


    Enough sales pitch. The code examples in the help are also bundled in qbplot_test.do which is ancillary to the package. Let me show a few.

    Click image for larger version

Name:	qbplot_1.png
Views:	1
Size:	65.5 KB
ID:	1762081



    That is a first simple example with mpg from the auto data. We are showing minimum, maximum, median and quartiles as annotation to the quantile plot.
    Click image for larger version

Name:	qbplot_3.png
Views:	1
Size:	74.0 KB
ID:	1762082


    Logarithmic scale does not make much difference in this case, but it is easy to try. We can add any other summary that appeals.
    Click image for larger version

Name:	qb_6.png
Views:	1
Size:	79.9 KB
ID:	1762083


    Wage from the nlsw88 data is (unsurprisingly) skewed, and we really would be better off on logarithmic scale. We can use standard graphics options to work on presentation details.
    Click image for larger version

Name:	qb_8.png
Views:	1
Size:	78.2 KB
ID:	1762084


    That's it in essence. The aim as said pedagogy or propaganda, or what unites those, persuasion. Use quantile plots, because annotation can give them the virtues of box plots, with none of the vices.

    Parzen, E. 1979a. Nonparametric statistical data modeling. Journal, American Statistical Association 74: 105-121.

    Parzen, E. 1979b. A density-quantile function perspective on robust estimation. In Launer, R.L. and G.N. Wilkinson (eds) Robustness in Statistics. New York: Academic Press, 237-258.

    Parzen, E. 1982. Data modeling using quantile and density-quantile functions. In Tiago de Oliveira, J. and B. Epstein (eds) Some Recent Advances in Statistics. London: Academic Press, 23-52.

    Parzen, E. 1997. Concrete statistics. In Ghosh, S., W.R. Schucany, and W.B. Smith (eds) Statistics of Quality. New York: Marcel Dekker, 309-332.
    Last edited by Nick Cox; 21 Aug 2024, 02:01.

  • #2
    Oops -- title of thread should mention propaganda -- not proganda. Sorry about that.

    Comment


    • #3
      Hi Nick Cox , I am thinking about using your plot and I am wondering if it would be possible to highlight one or a number of data points. E.g., I plot the income of universities and I would like to highlight the position of my university and a bunch of comparatives. Would that be possible? Marc

      Comment


      • #4
        That would be easier using stripplot from SSC.

        {Sorry for slow reply. I did write this a week ago but evidently was distracted just before I was about to send it.)

        Comment


        • #5
          Hi Nick, Thank you for your reply. I tried to extend -qbplot- a bit and I've been successfull.
          I have added two options casevar and cases and then I borrowed a bit of code from Asjads' streamplot (SSC or Github) - which requires palettes and colrspace from Ben Jann (SSC or Github):
          Code:
          if `"`casevar'"' ~= `""' {
              
              local items : word count `cases'
              
              forval x = 1/`items' {  
          
                  local numcolor = `items'
          
                  local case : word `x' of `cases'
          
                  colorpalette `palette', n(`numcolor') nograph `poptions'
          
                  local x1 =  `x' + 1
          
                  local scatter `scatter' scatter  `y' `pp' if `casevar' == `case' , mcolor("`r(p`x1')'")  msymbol(D) ||
          }
             
              twoway spike `quartiles' `where', base(`min') pstyle(p2) `spike' ///
              || spike `where' `quartiles', horizontal pstyle(p2) `spike' ///
              || scatter `y' `pp', pstyle(p1) ms(oh)  ///
              || `scatter' ///
              , xla(0 "0" 0.25 "0.25" 0.5 "0.5" 0.75 "0.75" 1 "1") yla(`Q') ///
              ytitle(`"`what'"') legend(off) xtitle(Fraction of the data) `options' ///
              || `addplot'
          In your first example I could highlight VW cars:
          encode make, gen(make_n)
          qbplot mpg, aspect(1) name(qb1_VW, replace) casevar(make_n) cases(70 71 72 73)
          It gives me this.

          Click image for larger version

Name:	qb1_VW.png
Views:	2
Size:	72.7 KB
ID:	1762608

          Depending on my real use cases I may like to extend it further - like marker labels, different symbols etc.
          Attached Files

          Comment


          • #6
            Excellent work. If you decide to make your version of the code public, please give it a different program name.

            Comment

            Working...
            X