Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • graph box: I can't believe there's no way to change the whiskers

    At this point, I think I've googled this enough to know that there is no answer aside from doing something with -stripplot- (user written by Nick Cox) or laying multiple rbars / rspikes / rcaps over each other (written up by Nick Cox in a Stata Journal article on customizing boxplots), but I'm going to send this out into the world anyway and maybe someone will hear my cry for help. Or maybe simply say "I hear your pain."

    graph box has a fatal flaw--the 1.5x IQR whiskers. It's canonical, but it's also not intuitive, and I am not sure anyone uses it as a measure of variation, etc. It would make way more sense to replace it with the 10th and 90th percentiles, or even just to hide the whiskers entirely. I don't pretend to be a programmer, but how hard could it be to have that as an option?

    I spent a full day doing nothing but trying to make a box plot with four variables across four different conditions. I failed. I can get most of the way with:

    Code:
    graph box var1 var2 var3 var4, over(condition, sort(num_condition)) nooutsides
    If it wasn't for the IQR x1.5, this gives me exactly what I want. Maybe I want to add the mean on top of that, but that's trivial.

    Instead, here are the options I can find if I want to escape Tukey's deranged decision of an IQR x1.5, all of which I think was written by Nick Cox in some form or another: Surely, there must be a better solution than these? Several years ago, Michael Blasnik almost solved this problem with his proposed program boxredo (https://www.stata.com/statalist/arch.../msg00205.html), but it never got posted (https://www.stata.com/statalist/arch.../msg00298.html). Must we wait for Nick Cox to bail us out with yet another user-written program to solve something that has been asked about on Statalist for over 20 years? Is it really this hard to add an option to switch in the percentile and / or hide the whiskers entirely? Is this a philosophical decision to ensure that Tukey's brainchild is never altered? How did it come to this, that the tradition of dead generations weighs like a nightmare upon the brains of the living?

    Thanks for listening to me rant,
    Jonathan

  • #2
    Dear Jonathan, as per FAQ recommendation 2.2 What to say about your data you best provide a minimal example others can use to replicate your issue and possibly provide a working alternative.
    http://publicationslist.org/eric.melse

    Comment


    • #3
      You can change the whiskers without using Nick's new inventions in the last resort. You can suppress the whiskers, the adjacent line, the outsides...
      Code:
      sysuse sp500
      gen condition = ceil(_n/62)
      graph box open high low close, over(condition) lines(lwidth(none) lstyle(none)) alsize(0) cwhiskers nooutside
      Under Stata 14, you'd better add lstyle(none) option, but at least for Stata 16, you could omit lstyle(none).
      Click image for larger version

Name:	Graph.png
Views:	1
Size:	130.9 KB
ID:	1776409

      Comment


      • #4
        Chen Samulsion yes--that works as long as you're okay with some extra white space at the top and the bottom. But that can be done with the graph editor, which I can then save as a .grec file and replay it later.

        Thanks!

        Still can't get the p10 and p90 on there but it's better than nothing.

        Comment


        • #5
          Could misuse violinplot (from SSC) to print IQR boxes without whiskers:

          Code:
          sysuse sp500
          gen condition = ceil(_n/62)
          violinplot open high low close, over(condition) vertical nostack ///
              noline nofill nowhisk
          The statistics used for the box can be customized, but unfortunately this is not possible for the whiskers. I took a note to change this in a future update.
          ben
          Click image for larger version

Name:	Graph.png
Views:	1
Size:	24.5 KB
ID:	1776419

          Comment


          • #6
            A new version of violinplot that allows customizing the whiskers is now available from GitHub (https://github.com/benjann/violinplot). I also sent the update to Kit Baum for release on SSC. Example with whiskers between 5th and 95th percentile:

            Code:
            sysuse sp500
            gen condition = ceil(_n/62)
            violinplot open high low close, over(condition) vertical nostack ///
                noline nofill key(box) whisk(stat(p5 p95))
            Click image for larger version

Name:	Graph.png
Views:	1
Size:	59.7 KB
ID:	1776425

            Comment


            • #7
              13:48—15:29, the future comes soon enough, great works! Thank you Ben Jann

              Comment


              • #8
                Without wanting to address all the points raised directly or indirectly here, I add reflections of various kinds.

                I tend to agree that the 1.5 IQR rule, which is a rule for the maximum allowed length of whiskers, not for their actual length, is no longer ideal after 50 or more years, But that rule remains remarkably popular in what I read and has even morphed, sad to say, into a commonly used rule for identifying outliers to be rejected and removed.

                But even on a graph leaving out data points beyond the ends of those or any other whiskers is not a direction I personally endorse.

                Jonathan's main rant seems to be that graph box should be more flexible, which is a question for StataCorp.

                Having explained how to do what Jonathan wants in various ways, I don't feel obliged to add another.

                My major point remains that if you have a design based on systematic rules that you like it's going to be programmable. My own inclination now is more usually to present boxes only as adjuncts to other displays, typically quantile plots or histograms.

                Comment


                • #9
                  Originally posted by Chen Samulsion View Post
                  13:48—15:29, the future comes soon enough, great works! Thank you Ben Jann
                  I agree--this is great. Probably not good for my own behavior that a somewhat spur-of-the-moment rant led to someone programming the exact solution I wanted within two hours. But it is a great leap forward for...me and everyone else like me!

                  Comment


                  • #10
                    Note that using violinplot to draw boxplots is computationally inefficient; the main purpose of violinplot is to draw density curves and the densities will always be estimated, even if their display is suppressed by noline nofill.

                    Comment


                    • #11
                      Maybe you'll be interest in Ben Jann's -robbox- command, which can produce the standard box plot (standard), the skewness-adjusted box plot (adjusted) based on the medcouple by Hubert and Vandervieren (2008), and the generalized box plot (the default) based on Tukey's g-and-h distribution by Bruffaerts et al. (2014). And what's more, it will report a table containing statistics such as median, lower hinge, upper hinge, lower whisker, upper whisker. By the way, you can draw box-and-whisker plot by yourself using Nick Cox's code in #2 in https://www.statalist.org/forums/for...t-the-raw-data
                      Code:
                      . ssc describe robbox
                      
                      -----------------------------------------------------------------------------------------------------------------
                      package robbox from http://fmwww.bc.edu/repec/bocode/r
                      -----------------------------------------------------------------------------------------------------------------
                      
                      TITLE
                            'ROBBOX': module to compute generalized box plots
                      
                      DESCRIPTION/AUTHOR(S)
                            
                               -robbox- is a command to produce (robust) box plots. Supported
                            are the standard box plot (equivalent to -graph box-), the
                            skewness-adjusted box plot based on the medcouple by Hubert and
                            Vandervieren (2008), and the generalized box plot based on
                            Tukey's g-and-h distribution by Bruffaerts et al. (2014). By
                            default, -robbox- computes the generalized box plot.
                            
                            KW: robust statistics
                            KW: box plot
                            KW: Tukey's g-and-h distribution.medcouple
                            
                            Requires: Stata version 11
                            
                            Distribution-Date: 20190517
                            
                            Author: Ben Jann, University of Bern
                            Support: email [email protected]
                            
                            Author: Vincenzo Verardi, University of Namur
                            Support: email [email protected]
                            
                            Author: Catherine Vermandele, Universite Libre de Bruxelles
                            Support: email
                            
                      
                      INSTALLATION FILES                               (type net install robbox)
                            robbox.ado
                            robbox.sthlp
                      
                      ANCILLARY FILES                                  (type net get robbox)
                            robbox.zip
                      -----------------------------------------------------------------------------------------------------------------
                      (type ssc install robbox to install)

                      Comment


                      • #12
                        Chen Samulsion Thanks for the cross-reference to my post, but the link is to a question in which the percentiles are available as separate variables in an observation. That is not the question here.

                        In turn, I will repeat propaganda for moving in a different direction. For example, I have never been able to work up enthusiasm for changing the definition of a box. I am much more positive about simpler rules for whiskers.

                        There is an entire bundle of loosely linked questions here, for which different answers make sense.

                        One situation is that only certain percentiles or quantiles are available.

                        Another situation is that the data are available but someone prefers only to show median and quartile boxes. This is programmable but it's hard for me to see why it's attractive without some extra motive, such as for example being shy of giving any details about the tails for reasons of confidentiality, privacy or security -- or (and this may be a misguided motive) to suppress the tails in some search for simplicity.

                        A more or less traditional median and quartiles box has only one role, to show that the middle half of the data are here, which leaves pretty open precisely where the other half of the data, except beyond the quartiles.

                        Worse, it is common ro see misinterpretations of even that idea. Perhaps the easiest to think through is a U-shaped distribution yielding a long box and short whiskers (implying on average lower density in the box zone and higher density in the tails), but even statistical people often don't spot that or interpret it correctly.

                        https://www.statalist.org/forums/for...ercentile-sets gives examples of quantile-box plots, a term with itself various interpretations. Following Parzen, the line I follow is plotting all the ordered values together with a box (and often with some other summary, such as a mean or geometric mean). Quantile-box is a fair search term other Statalist posts.

                        Comment


                        • #13
                          Above (#10) I wrote that using violinplot for boxplots is computationally inefficient because density estimates will be computed even if display of density curves is suppressed. I now changed this; a new version of violinplot is available from SSC that skips density estimation if it is not needed. While looking into this, I noticed that command dstat summarize, which is called by violinplot, could be unnecessarily slow on small datasets, so I also updated dstat. To obtain the new versions of the two programs, type:

                          Code:
                          . ssc install violinplot, replace
                          . ssc install dstat, replace
                          ben

                          Comment


                          • #14
                            how to determine frequency of an outcome between two treatment arms

                            Comment


                            • #15
                              I think #14 needs to start a new thread and to be a longer question.

                              Comment

                              Working...
                              X