Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Midgap plots, anyone? Or, box plots without boxes

    I recently stumbled across a reference to "midgap plots" and thought what are they? Here's the reference that a Google gives as #1 (ahead of numerous hits for other uses of "midgap" in physics):

    Stock, W. A., & Behrens, J. T. (1991). Box, line, and midgap plots: Effects of display characteristics on the accuracy and bias of estimates of whisker length. Journal of Educational Statistics, 16(1), 1–20. https://doi.org/10.2307/1165096
    Examined the accuracy and bias of estimates of whisker length on box-and-whisker plots based on box, line, and midgap plots. For each type of graph, a different sample of undergraduates (58 Ss total) viewed 48 single-plot graphs. For each plot, Ss were given the length of an interquartile spread and asked to estimate the length of a whisker. Plots varied in spatial orientation (horizontal or vertical), interquartile spread, the ratio of whisker length to interquartile spread, and whisker judged. Estimates of whisker length for box and line plots were more accurate and less biased than those for midgap plots. Interquartile spread, the ratio of whisker length to interquartile spread, and the interaction of these 2 factors significantly influenced both accuracy and bias. Boxplots displayed a predicted pattern of over- and underestimation. Midgap plots are judged to be less optimal displays than box and line plots.

    That paper is fairly widely accessible. People with access to Jstor will find it easily. You can tell from the abstract that the authors aren't positive about the design, but to me the question being focused on is of little or no interest: The point about a box plot is not whether I can estimate distances by eye exactly; if I want to know those distances I look at the numbers! The point about a box plot is whether it helps me get a good idea of the main features of a distribution. Also, with respect, whether a captive audience of undergraduates is good with a design they may have not seen before is of interest and importance, but not my main concern, which is communicating results to myself and to other researchers.

    But what is a midgap plot? It appears to be the authors' own term for a design mentioned, and not even very enthusiastically, by Edward Tufte in what I still think is his best single book on visualization, https://www.edwardtufte.com/tufte/books_vdqi Tufte's name is quartile plot.

    It's a box plot -- without the box -- but with a marker symbol for the median, and whiskers between each quartile and the extreme beyond. This is minimalism so minimal that minimalism is too long a word for it. But the information the box gives is implied by the other information.

    I think it's a much better idea than any of these authors imply, especially if the data are also shown.

    * Boxes in a box plot show emphatically where are the median and quartiles, but sometimes the emphasis is too strong. The quartiles are not magic thresholds at which anything happens beyond the cumulative probability passing 25% and 75%. This can bite very hard, as with a U-shaped distribution in which the top 25% and the bottom 25% are shown only by short whiskers. Even experienced statisticians have misread such box plots. (To be fair John Tukey in Exploratory Data Analysis has a salutary example showing the superiority of dot plots over box plots where the data are basically two groups.)

    * Boxes take up space, but you can control that by making them thin. The ultimate in control of box width is to make them invisible.

    * What is going on in the middle of a distribution is not necessarily the feature that needs most emphasis. The tails are as or more important for many problems.

    For box plots I sometimes use graph box or graph hbox but more often I reach for stripplot from SSC. I thought about hitting the code to add a distinct new option, but the syntax is complicated enough already and it's possible to get there without too much extra work. I like to show the data and summaries too, and I don't mind if that design is accused or being repetitive or redundant.


    Code:
    sysuse auto, clear
    set scheme s1color 
    egen median = median(mpg), by(foreign)
    gen where = foreign - 0.07
    stripplot mpg , over(foreign) stack box(barw(0))  pctile(0) boffset(-0.07) vertical addplot(scatter median where, ms(Dh) mc(black)) ms(Sh) height(0.2)


    Click image for larger version

Name:	midgapplot.png
Views:	1
Size:	29.2 KB
ID:	1571553



    I used diamonds partly for fun, whereas all the authors mentioned above used circles, but I think that really is detail at a designer's discretion. It's helpful, however, to be able to say in a caption: medians are shown by diamonds, or whatever.else you choose, and whiskers join quartiles and extremes. (Don't use the same marker symbol for data and medians.)

    Next time I will rotate the y axis labels to horizontal and lose the tick marks on the horizontal axis.

  • #2
    This is just to mention that plotting means, geometric means, trimmed means, whatever as well as medians could be fine too — just not all at once.

    Comment


    • #3
      As documented at #7 in https://www.statalist.org/forums/for...updated-on-ssc the files for stripplot on SSC have been updated to mention this design. The term midgap plot is not original to the paper cited in #1 of this thread.

      This is an update of a different kind. 37 years on, Edward Tufte has revisited this design on pp.100-101 of his 2020 book https://www.edwardtufte.com/tufte/se...ith-fresh-eyes

      He is now negative about the design as over-simplified -- but if presented as the plot of the data.

      General quotations:

      "Detailed data moves closer to the truth. No more binning, less cherry-picking, less truncation." (p.100)

      "To improve learning from data, credibility, and integrity, show the data." (p.101)

      I suggest that showing such a summary doesn't violate such advice if you show the original data too.





      Comment


      • #4
        Originally posted by Nick Cox View Post
        As documented at #7 in https://www.statalist.org/forums/for...updated-on-ssc the files for stripplot on SSC have been updated to mention this design. The term midgap plot is not original to the paper cited in #1 of this thread.

        This is an update of a different kind. 37 years on, Edward Tufte has revisited this design on pp.100-101 of his 2020 book https://www.edwardtufte.com/tufte/se...ith-fresh-eyes

        He is now negative about the design as over-simplified -- but if presented as the plot of the data.

        General quotations:

        "Detailed data moves closer to the truth. No more binning, less cherry-picking, less truncation." (p.100)

        "To improve learning from data, credibility, and integrity, show the data." (p.101)

        I suggest that showing such a summary doesn't violate such advice if you show the original data too.




        Thank you for sharing this book and the powerful command stripplot. What a legend.

        Comment


        • #5
          This is a cross-reference to https://www.statalist.org/forums/for...updated-on-ssc On 11 July. I posted an update to stripplot with a new option to do this. (I changed my mind.)

          Comment

          Working...
          X