Thanks as always to Kit Baum, a new command qbplot is available from SSC.
qbplot is a dedicated pedagogic or propaganda command with the aim of showing that a quantile plot can convey all the key information given by a box plot showing median and quartiles. Indeed, it can convey much more.
Box plots are very commonly used, but in my view over-used and often not nearly as helpful as other plots might be.
* The median and quartiles are convenient summary points but otherwise usually lack particular scientific or practical meaning.
* Box plots often fail to give enough information in the tails.
* Box plots usually fail to be very helpful, and can even be misleading, given distributions that are extremely skew, or include only a small number of distinct values, or are bimodal.
* The Tukey convention of showing data points individually if and only if they lie more than 1.5 IQR from the nearer quartile and otherwise extending so-called whiskers to the outermost data points within 1.5 IQR of each quartile had a rationale in the 1970s of being amenable to hand calculation and drawing plots with paper and pen. Fifty years on, we all have access to computers and can do better. Furthermore, this convention has proved often difficult to explain and understand. The last four books I have reviewed with content on statistical graphics all get it wrong! Other conventions are indeed possible and stripplot implements alternatives that might be congenial.
* Box plots don't usually show means as extra detail, or other summaries such as geometric means, when as or more appropriate. This comes down to how they are implemented and this is one detail in which Stata implementations graph box and graph hbox aren't flexible. Particularly bizarre, but remarkably common, are textbook or paper discussions of ANOVA results illustrated by box plots that don't even show the means.
Back to what qbplot does:
The main part of the display is a quantile plot, a scatter plot showing ordered values against the so-called plotting positions, labelled "Fraction of the data". Compare official command quantile; community-contributed command qplot (Stata Journal); community-contributed command multqplot (Stata Journal)); community-contributed command stripplot (from SSC).
On the vertical axis, the values of the minimum, maximum, median and quartiles are shown as axis labels. The median and quartiles are connected using horizontal and vertical line segments to plotting positions 0.25, 0.5, 0.75 on the horizontal axis. These segments may be extended mentally to imagine a conventional box showing median and quartiles, as in most variations of box plots.
Variations from the basic design may be obtained by particular option choices.
My suggestion is that this design can work well for samples of small, moderate or large size. If a sample is small, there is enough space to show all the detail in the data; the art is doing so while allowing broad features to be grasped easily. If a sample is very large, many data points often blur into each other, but that doesn't make such plots useless. Marked outliers, gaps, spikes and so forth are likely to be evident when they are present. Either way, quantile plots also convey information on skewness and tail weight.
Whenever data are strongly skewed, between two and four axis labels may be uncomfortably close to each other. This is often a sign that you might be better off using a transformed scale. Note that ysc(log) is available for on-the-fly logarithmic transformation of entirely positive data, Here we rely on median and quartiles of logarithms of data being usually close enough to logarithms of median and quartiles of data, at least for exploratory purposes. Otherwise you may need to apply a transformation separately.
Much of the point of this display is that we no longer need to worry about precisely what is shown outside the box of a box plot. We just show all the data points. So, we do not need to implement any particular rule or convention, such as displaying data points individually if and only if they are more than 1.5 IQR from the nearer quartile. Nor do we need to interpret the results of any such rule or convention.
Literature on so-called quantile-box plots is pertinent here. Parzen (1979a, 1979b, 1982, 1997) hybridised box and quantile plots as quantile-box plots. The help file gives several more references. For yet further references in this territory, see the help for stripplot.
Enough sales pitch. The code examples in the help are also bundled in qbplot_test.do which is ancillary to the package. Let me show a few.

That is a first simple example with mpg from the auto data. We are showing minimum, maximum, median and quartiles as annotation to the quantile plot.

Logarithmic scale does not make much difference in this case, but it is easy to try. We can add any other summary that appeals.

Wage from the nlsw88 data is (unsurprisingly) skewed, and we really would be better off on logarithmic scale. We can use standard graphics options to work on presentation details.

That's it in essence. The aim as said pedagogy or propaganda, or what unites those, persuasion. Use quantile plots, because annotation can give them the virtues of box plots, with none of the vices.
Parzen, E. 1979a. Nonparametric statistical data modeling. Journal, American Statistical Association 74: 105-121.
Parzen, E. 1979b. A density-quantile function perspective on robust estimation. In Launer, R.L. and G.N. Wilkinson (eds) Robustness in Statistics. New York: Academic Press, 237-258.
Parzen, E. 1982. Data modeling using quantile and density-quantile functions. In Tiago de Oliveira, J. and B. Epstein (eds) Some Recent Advances in Statistics. London: Academic Press, 23-52.
Parzen, E. 1997. Concrete statistics. In Ghosh, S., W.R. Schucany, and W.B. Smith (eds) Statistics of Quality. New York: Marcel Dekker, 309-332.
qbplot is a dedicated pedagogic or propaganda command with the aim of showing that a quantile plot can convey all the key information given by a box plot showing median and quartiles. Indeed, it can convey much more.
Box plots are very commonly used, but in my view over-used and often not nearly as helpful as other plots might be.
* The median and quartiles are convenient summary points but otherwise usually lack particular scientific or practical meaning.
* Box plots often fail to give enough information in the tails.
* Box plots usually fail to be very helpful, and can even be misleading, given distributions that are extremely skew, or include only a small number of distinct values, or are bimodal.
* The Tukey convention of showing data points individually if and only if they lie more than 1.5 IQR from the nearer quartile and otherwise extending so-called whiskers to the outermost data points within 1.5 IQR of each quartile had a rationale in the 1970s of being amenable to hand calculation and drawing plots with paper and pen. Fifty years on, we all have access to computers and can do better. Furthermore, this convention has proved often difficult to explain and understand. The last four books I have reviewed with content on statistical graphics all get it wrong! Other conventions are indeed possible and stripplot implements alternatives that might be congenial.
* Box plots don't usually show means as extra detail, or other summaries such as geometric means, when as or more appropriate. This comes down to how they are implemented and this is one detail in which Stata implementations graph box and graph hbox aren't flexible. Particularly bizarre, but remarkably common, are textbook or paper discussions of ANOVA results illustrated by box plots that don't even show the means.
Back to what qbplot does:
The main part of the display is a quantile plot, a scatter plot showing ordered values against the so-called plotting positions, labelled "Fraction of the data". Compare official command quantile; community-contributed command qplot (Stata Journal); community-contributed command multqplot (Stata Journal)); community-contributed command stripplot (from SSC).
On the vertical axis, the values of the minimum, maximum, median and quartiles are shown as axis labels. The median and quartiles are connected using horizontal and vertical line segments to plotting positions 0.25, 0.5, 0.75 on the horizontal axis. These segments may be extended mentally to imagine a conventional box showing median and quartiles, as in most variations of box plots.
Variations from the basic design may be obtained by particular option choices.
My suggestion is that this design can work well for samples of small, moderate or large size. If a sample is small, there is enough space to show all the detail in the data; the art is doing so while allowing broad features to be grasped easily. If a sample is very large, many data points often blur into each other, but that doesn't make such plots useless. Marked outliers, gaps, spikes and so forth are likely to be evident when they are present. Either way, quantile plots also convey information on skewness and tail weight.
Whenever data are strongly skewed, between two and four axis labels may be uncomfortably close to each other. This is often a sign that you might be better off using a transformed scale. Note that ysc(log) is available for on-the-fly logarithmic transformation of entirely positive data, Here we rely on median and quartiles of logarithms of data being usually close enough to logarithms of median and quartiles of data, at least for exploratory purposes. Otherwise you may need to apply a transformation separately.
Much of the point of this display is that we no longer need to worry about precisely what is shown outside the box of a box plot. We just show all the data points. So, we do not need to implement any particular rule or convention, such as displaying data points individually if and only if they are more than 1.5 IQR from the nearer quartile. Nor do we need to interpret the results of any such rule or convention.
Literature on so-called quantile-box plots is pertinent here. Parzen (1979a, 1979b, 1982, 1997) hybridised box and quantile plots as quantile-box plots. The help file gives several more references. For yet further references in this territory, see the help for stripplot.
Enough sales pitch. The code examples in the help are also bundled in qbplot_test.do which is ancillary to the package. Let me show a few.
That is a first simple example with mpg from the auto data. We are showing minimum, maximum, median and quartiles as annotation to the quantile plot.
Logarithmic scale does not make much difference in this case, but it is easy to try. We can add any other summary that appeals.
Wage from the nlsw88 data is (unsurprisingly) skewed, and we really would be better off on logarithmic scale. We can use standard graphics options to work on presentation details.
That's it in essence. The aim as said pedagogy or propaganda, or what unites those, persuasion. Use quantile plots, because annotation can give them the virtues of box plots, with none of the vices.
Parzen, E. 1979a. Nonparametric statistical data modeling. Journal, American Statistical Association 74: 105-121.
Parzen, E. 1979b. A density-quantile function perspective on robust estimation. In Launer, R.L. and G.N. Wilkinson (eds) Robustness in Statistics. New York: Academic Press, 237-258.
Parzen, E. 1982. Data modeling using quantile and density-quantile functions. In Tiago de Oliveira, J. and B. Epstein (eds) Some Recent Advances in Statistics. London: Academic Press, 23-52.
Parzen, E. 1997. Concrete statistics. In Ghosh, S., W.R. Schucany, and W.B. Smith (eds) Statistics of Quality. New York: Marcel Dekker, 309-332.
Comment