graph box: I can't believe there's no way to change the whiskers

Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#1

graph box: I can't believe there's no way to change the whiskers

23 Apr 2025, 08:39

At this point, I think I've googled this enough to know that there is no answer aside from doing something with -stripplot- (user written by Nick Cox) or laying multiple rbars / rspikes / rcaps over each other (written up by Nick Cox in a Stata Journal article on customizing boxplots), but I'm going to send this out into the world anyway and maybe someone will hear my cry for help. Or maybe simply say "I hear your pain."

graph box has a fatal flaw--the 1.5x IQR whiskers. It's canonical, but it's also not intuitive, and I am not sure anyone uses it as a measure of variation, etc. It would make way more sense to replace it with the 10th and 90th percentiles, or even just to hide the whiskers entirely. I don't pretend to be a programmer, but how hard could it be to have that as an option?

I spent a full day doing nothing but trying to make a box plot with four variables across four different conditions. I failed. I can get most of the way with:

Code:

graph box var1 var2 var3 var4, over(condition, sort(num_condition)) nooutsides

If it wasn't for the IQR x1.5, this gives me exactly what I want. Maybe I want to add the mean on top of that, but that's trivial.

Instead, here are the options I can find if I want to escape Tukey's deranged decision of an IQR x1.5, all of which I think was written by Nick Cox in some form or another:
Use -stripplot-, hide the data points, swap in the percentiles, mess around with the margins or something so that it looks good. I've done this in the past, and it looks great, but there is one problem: you can't do multiple variables combined with over(). (Ref: https://www.statalist.org/forums/for...olling-margins, see also (https://www.statalist.org/forums/for...updated-on-ssc))

Stack a zillion different plots on top of each other, hoping you don't make an error (Ref: https://journals.sagepub.com/doi/pdf...867X0900900309)

Collapse the dataset into only three variables (p25, p50, p75). This doesn't give me the 10th and 90th percentiles, but it does at least hide the IQR x1.5 (Ref: https://journals.sagepub.com/doi/pdf...36867X19893643)

Surely, there must be a better solution than these? Several years ago, Michael Blasnik almost solved this problem with his proposed program boxredo (https://www.stata.com/statalist/arch.../msg00205.html), but it never got posted (https://www.stata.com/statalist/arch.../msg00298.html). Must we wait for Nick Cox to bail us out with yet another user-written program to solve something that has been asked about on Statalist for over 20 years? Is it really this hard to add an option to switch in the percentile and / or hide the whiskers entirely? Is this a philosophical decision to ensure that Tukey's brainchild is never altered? How did it come to this, that the tradition of dead generations weighs like a nightmare upon the brains of the living?

Thanks for listening to me rant,
Jonathan
Tags: None
ericmelse

Join Date: May 2014

Posts: 437
#2

23 Apr 2025, 09:11

Dear Jonathan, as per FAQ recommendation 2.2 What to say about your data you best provide a minimal example others can use to replicate your issue and possibly provide a working alternative.

http://publicationslist.org/eric.melse
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 932
#3

23 Apr 2025, 10:19

You can change the whiskers without using Nick's new inventions in the last resort. You can suppress the whiskers, the adjacent line, the outsides...

Code:

sysuse sp500 gen condition = ceil(_n/62) graph box open high low close, over(condition) lines(lwidth(none) lstyle(none)) alsize(0) cwhiskers nooutside

Under Stata 14, you'd better add lstyle(none) option, but at least for Stata 16, you could omit lstyle(none).
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#4

23 Apr 2025, 12:32

Chen Samulsion yes--that works as long as you're okay with some extra white space at the top and the bottom. But that can be done with the graph editor, which I can then save as a .grec file and replay it later.

Thanks!

Still can't get the p10 and p90 on there but it's better than nothing.
Comment
Ben Jann

Join Date: Sep 2014

Posts: 269
#5

23 Apr 2025, 13:48

Could misuse violinplot (from SSC) to print IQR boxes without whiskers:

Code:

sysuse sp500 gen condition = ceil(_n/62) violinplot open high low close, over(condition) vertical nostack /// noline nofill nowhisk

The statistics used for the box can be customized, but unfortunately this is not possible for the whiskers. I took a note to change this in a future update.
ben
1 like
Comment
Ben Jann

Join Date: Sep 2014

Posts: 269
#6

23 Apr 2025, 15:29

A new version of violinplot that allows customizing the whiskers is now available from GitHub (https://github.com/benjann/violinplot). I also sent the update to Kit Baum for release on SSC. Example with whiskers between 5th and 95th percentile:

Code:

sysuse sp500 gen condition = ceil(_n/62) violinplot open high low close, over(condition) vertical nostack /// noline nofill key(box) whisk(stat(p5 p95))
7 likes
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 932
#7

23 Apr 2025, 16:13

13:48—15:29, the future comes soon enough, great works! Thank you Ben Jann
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35810
#8

24 Apr 2025, 00:00

Without wanting to address all the points raised directly or indirectly here, I add reflections of various kinds.

I tend to agree that the 1.5 IQR rule, which is a rule for the maximum allowed length of whiskers, not for their actual length, is no longer ideal after 50 or more years, But that rule remains remarkably popular in what I read and has even morphed, sad to say, into a commonly used rule for identifying outliers to be rejected and removed.

But even on a graph leaving out data points beyond the ends of those or any other whiskers is not a direction I personally endorse.

Jonathan's main rant seems to be that graph box should be more flexible, which is a question for StataCorp.

Having explained how to do what Jonathan wants in various ways, I don't feel obliged to add another.

My major point remains that if you have a design based on systematic rules that you like it's going to be programmable. My own inclination now is more usually to present boxes only as adjuncts to other displays, typically quantile plots or histograms.
1 like
Comment
Jonathan Horowitz

Join Date: Apr 2015

Posts: 102
#9

24 Apr 2025, 06:58

Originally posted by Chen Samulsion View Post

13:48—15:29, the future comes soon enough, great works! Thank you Ben Jann

I agree--this is great. Probably not good for my own behavior that a somewhat spur-of-the-moment rant led to someone programming the exact solution I wanted within two hours. But it is a great leap forward for...me and everyone else like me!
Comment
Ben Jann

Join Date: Sep 2014

Posts: 269
#10

24 Apr 2025, 07:31

Note that using violinplot to draw boxplots is computationally inefficient; the main purpose of violinplot is to draw density curves and the densities will always be estimated, even if their display is suppressed by noline nofill.
Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 932

#11

25 Apr 2025, 20:10

Maybe you'll be interest in Ben Jann's -robbox- command, which can produce the standard box plot (standard), the skewness-adjusted box plot (adjusted) based on the medcouple by Hubert and Vandervieren (2008), and the generalized box plot (the default) based on Tukey's g-and-h distribution by Bruffaerts et al. (2014). And what's more, it will report a table containing statistics such as median, lower hinge, upper hinge, lower whisker, upper whisker. By the way, you can draw box-and-whisker plot by yourself using Nick Cox's code in #2 in https://www.statalist.org/forums/for...t-the-raw-data

Code:

. ssc describe robbox

-----------------------------------------------------------------------------------------------------------------
package robbox from http://fmwww.bc.edu/repec/bocode/r
-----------------------------------------------------------------------------------------------------------------

TITLE
      'ROBBOX': module to compute generalized box plots

DESCRIPTION/AUTHOR(S)
      
         -robbox- is a command to produce (robust) box plots. Supported
      are the standard box plot (equivalent to -graph box-), the
      skewness-adjusted box plot based on the medcouple by Hubert and
      Vandervieren (2008), and the generalized box plot based on
      Tukey's g-and-h distribution by Bruffaerts et al. (2014). By
      default, -robbox- computes the generalized box plot.
      
      KW: robust statistics
      KW: box plot
      KW: Tukey's g-and-h distribution.medcouple
      
      Requires: Stata version 11
      
      Distribution-Date: 20190517
      
      Author: Ben Jann, University of Bern
      Support: email [email protected]
      
      Author: Vincenzo Verardi, University of Namur
      Support: email [email protected]
      
      Author: Catherine Vermandele, Universite Libre de Bruxelles
      Support: email
      

INSTALLATION FILES                               (type net install robbox)
      robbox.ado
      robbox.sthlp

ANCILLARY FILES                                  (type net get robbox)
      robbox.zip
-----------------------------------------------------------------------------------------------------------------
(type ssc install robbox to install)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35810
#12

26 Apr 2025, 01:54

Chen Samulsion Thanks for the cross-reference to my post, but the link is to a question in which the percentiles are available as separate variables in an observation. That is not the question here.

In turn, I will repeat propaganda for moving in a different direction. For example, I have never been able to work up enthusiasm for changing the definition of a box. I am much more positive about simpler rules for whiskers.

There is an entire bundle of loosely linked questions here, for which different answers make sense.

One situation is that only certain percentiles or quantiles are available.

Another situation is that the data are available but someone prefers only to show median and quartile boxes. This is programmable but it's hard for me to see why it's attractive without some extra motive, such as for example being shy of giving any details about the tails for reasons of confidentiality, privacy or security -- or (and this may be a misguided motive) to suppress the tails in some search for simplicity.

A more or less traditional median and quartiles box has only one role, to show that the middle half of the data are here, which leaves pretty open precisely where the other half of the data, except beyond the quartiles.

Worse, it is common ro see misinterpretations of even that idea. Perhaps the easiest to think through is a U-shaped distribution yielding a long box and short whiskers (implying on average lower density in the box zone and higher density in the tails), but even statistical people often don't spot that or interpret it correctly.

https://www.statalist.org/forums/for...ercentile-sets gives examples of quantile-box plots, a term with itself various interpretations. Following Parzen, the line I follow is plotting all the ordered values together with a box (and often with some other summary, such as a mean or geometric mean). Quantile-box is a fair search term other Statalist posts.
Comment
Ben Jann

Join Date: Sep 2014

Posts: 269
#13

27 Apr 2025, 05:39

Above (#10) I wrote that using violinplot for boxplots is computationally inefficient because density estimates will be computed even if display of density curves is suppressed. I now changed this; a new version of violinplot is available from SSC that skips density estimation if it is not needed. While looking into this, I noticed that command dstat summarize, which is called by violinplot, could be unnecessarily slow on small datasets, so I also updated dstat. To obtain the new versions of the two programs, type:

Code:

. ssc install violinplot, replace . ssc install dstat, replace

ben
2 likes
Comment
Razia Kadwa

Join Date: Apr 2025

Posts: 1
#14

27 Apr 2025, 07:40

how to determine frequency of an outcome between two treatment arms
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35810
#15

27 Apr 2025, 07:48

I think #14 needs to start a new thread and to be a longer question.
Comment

Announcement