No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Varying box plots


    I'm struggling a little to adapt approach outlined in Nick's Cox excellent paper: Speaking Stata: Creating and varying box plots. I have an ordinary data set, resambling the table below and I would like to generate a set of box plots highlighting selected observation on the chart. In particular, I would like to generate with a set of three box plot, each box plot corresponding to one indicator and additionally have OBS004 plotted on each of the box plots as a dot.

    Observations Indicator1 Indicator2 Indicator3
    OBS001 4234123 123 21.1
    OBS002 21.1 4156 467
    OBS003 123 4858 301
    OBS004 3667 543 623
    ... ... ... ....
    Kind regards,
    Version: Stata/IC 13.1

  • #2
    Done to be close to graph box, it's the code in that paper with a superimposed scatter. See also for an important correction. Both are reprinted in but both are also free to download. (Those aware of the 3-year window policy of the Stata Journal should note that the 2013 correction is an exception, as a correction.)

    Another way to approach this is via stripplot (SSC), which is an alternative to dotplot. The twist to your problem is a common one: you want to highlight a particular observation. For that purpose highlighting where it lies within a distribution is appealing. You can add boxes too. It's my prejudice that box plots are oversold as such: they all too often leave out a lot of detail that could be interesting or important. With three variables (your problem) there should be room to show much more.

    Here's a dopey example of loosely similar kind. Naturally some study of the help is needed to understand all the options. I increasingly prefer W.S. Cleveland's suggestion that box whiskers are just drawn to specified quantiles. Here pctile(5)implies whiskers out to 5% and 95% points. If you are showing all the data points any way, you need not worry too much about thresholds for outliers, outside values, etc.

    sysuse auto, clear
    set scheme s1color
    gen is42 = _n == 42
    stripplot turn trunk, box(barwidth(0.12)) separate(is42) stack vertical  ///
    boffset(-0.1) pctile(5) legend(order(6 "Plymouth Arrow") ring(0) pos(1)) ///
    xla(, noticks) yla(, ang(h)) variablelabels height(0.4)
    Attached Files
    Last edited by Nick Cox; 05 Jun 2014, 03:27.


    • #3
      Nick, thank you very much for your helpful reply.
      Kind regards,
      Version: Stata/IC 13.1


      • #4
        I very much like this approach and am wondering if it's possible to further customize by specifying different colors for each box&whisker (or box&pctile). I have tried doing this in the past with graph box, asyvars, but I don't think it's possible to overlay the scatter plot in that context.


        • #5
          Sorry to disappoint, but stripplot (SSC) isn't written to support different colours for different boxes (or whiskers). I never thought of doing that, as it would have no obvious function. The boxes are disjoint and identified by text, so you don't need extra variations to discriminate them. That's my view.


          • #6
            I should perhaps emphasise that there are choices on three levels here for box plots.

            1. From StataCorp, graph box (here including graph hbox) offers various choices (and various constraints). In particular, graph {h}box does not allow combination with twoway graphs.

            2. At another extreme is the idea of making your own box plots using different twoway elements to draw the box, the whiskers, whatever. That's some work putting together the details, but offers much more flexibility, e.g. in showing means too. That's the key point about my 2009 paper which Konrad cited.

            3. In between are user-written programs with particular canned choices. Others may exist that I've forgotten about or don't know but stripplot (SSC) is a kind of but unsurprisingly with emphasis on what the author thought interesting or useful, so there are limits too. On the other hand, you could always clone the code and write your own variations.


            • #7

              If boxplots are oversold, who has been doing the selling? The basic sources "sell" the boxplot as a graphical version of John Tukey's 5-number summary (min, lower fourth [quartile], median, upper fourth, max) with the added feature of showing "outside values" individually. No claim that boxplots are a panacea. A boxplot may (and often should) be accompanied by a plot that shows the data in full detail, as in your examples.

              The use of boxplots unfortunately suffers from considerable anarchy. Many people (Bill Cleveland may be among them) use a definition other than the standard one, but present the result as a boxplot without explaining the definition they have used. That attitude greatly complicates the task of interpreting "boxplots."

              David Hoaglin


              • #8
                The answer to your first question is, first of all but not only, authors of many introductory texts and teachers of many introductory courses who often (for example) push box plots for comparing just two groups, where far more detail is possible through other displays. (One text even presents box plots coloured one colour if they are right skewed and one colour if they are left skewed, which reaches some peak of silliness, although is also a rather trivial detail.) Some of these texts do seem to be pushing box plots as almost a panacea. You don't need to read introductory texts for any reason, I guess, but the standard is not that high in this respect.

                It is also very common, even standard, to find modern introductory texts pushing box plots in conjunction with analysis of variance. Looking at the data clearly beats not looking at the data but as analysis of variance is based on means not medians and SD-like quantities not IQR, the connection is at best a little indirect and at worst misleading. On the other hand, I remain naively amazed that plotting means alone is still a widely accepted standard.

                My perspective on box plots is inevitably a little unusual, at least for Stata users. As an academic geographer I am aware of a tradition of using box plot-like displays (usually called dispersion diagrams) that goes back to the 1930s if not earlier in geography and climatology. (Climatology is not only a part of geography, but even more a part or sibling of meteorology.) I first encountered these in geography textbooks circa 1967-1968 as utterly standard plots and perhaps about 1971 or 1972 first read of John Tukey's reinvention of the idea. As an amateur historian of statistics I am aware that the five-number summary and developments on it were favourites of Arthur Bowley more than 100 years ago and that he emphasised in one of his textbooks how it was a good basis for graphics. But these ideas did not have anything like the total impact that John Tukey's advocacy, followed by the texts and papers of people such as yourself, achieved from the middle 1970s on.

                It is slightly odd to sense here and there -- I'll let you confirm or refute whether this is your own position, although if it is the wording here would not be your style and is colourful only to add emphasis -- an idea that there is One True Way of doing box plots which is based on (3/2) IQR and so forth. The literature, published and samizdat, does show John Tukey and others experimenting with different possibilities. I do think that it's vital to be clear if you are using some unusual design, although what is unusual to some is simple and rational to others -- and this was a point I made firmly in my 2009 paper in the Stata Journal. I don't worry about anarchy that is explained. Bill Cleveland certainly was crystal clear in explaining his form. I don't think he pushed it in many places, as on the whole he has placed much more emphasis on quantile plots (which are a great personal favourite too).

                P.S. In my first post I said

                "It's my prejudice that box plots are oversold as such"

                and that really should be "often oversold"; clearly there have long been sober exceptions.
                Last edited by Nick Cox; 05 Jun 2014, 16:25.


                • #9
                  From my perspective of producing research and analytical products for non-academic audiences box plot are used as they require relatively little explanation in comparison to less common dispersion diagrams. As such I would welcome suggestion how to effectively and simply visualise:
                  1. Variance and distribution of the indicator
                  2. Position of a given data point in that indicator
                  In practice, the chart should answer on the following question:
                  1. How narrowly values cluster together
                  2. What is the highest and the lowest values
                  3. Where is the given observation
                  Histogram would be most obvious choice but it's not usable in this context as it is not easy to understand for non-expert audiences. Dotplot is also not usable as it creates and impression of a rank and this should be avoided in the context of this assignment (task requirement, not preference). So by looking at the chart the reader should be able to infer:
                  1. Whether given observation is in highest/lowest/middle group
                  2. Whether given observation can be informally classified as an outlier
                  3. But not able to infer the precise position of the observation in the distribution
                  I'm currently using this code to make a box plot with a simple line:

                  /* Box plot with line */
                  // EDIT - define indicators to graph together
                  local indicators some_varlist
                  * Set colour scheme.
                  set scheme s1color
                  foreach var of varlist `indicators' {
                      // get value for the line to macro.
                      levelsof `var' if(group == "Group 112"), local(ylinevalue)
                      // Get indicator title
                      local varlbl : var label `var'
                      // draw a nice graph
                      graph box `var',   ///
                          yline(`ylinevalue', lwidth(medthick) lcolor(maroon) lpattern(dash)) ///
                              lines(lwidth(thin) lpattern(solid)) ///
                          title("`var'", size(medsmall) ///
                              margin(vsmall) position(12)) ///)
                          box(1, lwidth(medthin)) ///
                          ytitle("Unit", size(small) margin(vsmall)) ///
                          ylabel(, labsize(vsmall)) ///
                          plotregion(lstyle(none)) ///
                          legend(on order(- "OBS") pos(10) ring(0) color(maroon) ///
                              size(small) region(fcolor(none) lcolor(none))) ///
                          name(box_`var', replace)
                  Kind regards,
                  Version: Stata/IC 13.1


                  • #10

                    I don't know what you understand by "dispersion diagrams". If I showed you some from the geographical and related literature, your reaction would probably be that they are really box plots. So, you must mean something else, but I don't know what it is.

                    More importantly, I really don't understand the objection to dot or strip plots here. If you have measurements they define an order or ranking, possibly with ties. How far that order is explicit varies between plots but it's there in the raw data.

                    My experience does not match yours on "non-experts" finding box plots easier to understand than histograms. In fact, I find more problems with statistically-minded people misreading unusual box plots than with the corresponding histograms!

                    So, we seem to have executed a perfect circle. You are playing with box plots with added extras, and posts in this thread have already documented ways to do that in Stata.


                    • #11
                      With respect to the dispersion diagrams I mean commonly used charts that somehow provide information on the dispersion of the indicator, like box plots, histograms, barcode charts and similar. With respect to the dot and strip plots, it was decided that those are not desirable in this context so it's a task requirement not my preference or objection, personally I find those charts informative and easy to understand.
                      Kind regards,
                      Version: Stata/IC 13.1


                      • #12
                        I see you are in a difficulty then, but that's between you and your boss(es).