Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Announcing box_logscale: Stata module to create box plots on the log scale (generalising Tukey's definition of whiskers)

    Dear Statalisters,

    I have just posted my program box_logscale on SSC.

    box_logscale presents box plots on a log scale. It log10 transforms the yvar(s), runs graph box and labels the numeric axis with nice original-scale numbers.

    It is an implementation of Nick Cox's advice in this FAQ: https://www.stata.com/support/faqs/g...rithmic-scales

    It creates the graph on the right below. It is an improvement on the graph on the left (which was created with Stata’s graph box command).
    Click image for larger version

Name:	boxplots.png
Views:	1
Size:	32.0 KB
ID:	1729330

    Left graph: So many high outside values and no low outside values. And a badly labelled numeric axis.
    Right graph: Just a few high outside values and just a few low outside values. And a nicely labelled numeric axis.

    Nearly all options of graph box can be specified.

    Feedback welcome.

    Best wishes, Mark


    Code:
    ssc install box_logscale
     
    *Generate a dataset with a lognormal variable
    clear all
    set seed 999
    set obs 999
    gen y = 10^rnormal(0,0.3)
     
    *Create a box plot in two different ways
    graph box y, yscale(log)  //  Stata's inbuilt command
    box_logscale y  //  new community-contributed command

    Detail:
    box_logscale calculates quartiles of log10(y) in the usual manner described in [G-2] graph box. Call them q1log, q2log, q3log. Then the following are calculated: Ulog = q3log + 1.5*(q3log - q1log) and Llog = q1log - 1.5*(q3log - q1log). Then adjacent values for log10(y) are defined in the usual manner described in [G-2] graph box, i.e. the upper adjacent value is the largest value of log10(y) not exceeding Ulog, and the lower adjacent value is the smallest value of log10(y) exceeding Llog. The plot is then drawn, and labels on the numeric axis are carefully chosen such that they correspond to (nice, possibly user specified) original-scale values - I make these nice values appear on the numeric axis.

    This has the effect of looking like a plot of the untransformed data, where the box shows q1 = 10^q1log, q2 = 10^q2log and q3 = 10^q3log, and whiskers are drawn between the box and adjacent values, where adjacent values are now defined as: the upper adjacent value is the largest value of y not exceeding U, and the lower adjacent value is the smallest value of y exceeding L, where
    U = 10^Ulog = q3* (q3/q1)^1.5, and
    L = 10^Llog = q1 / (q3/q1)^1.5.
    It could be argued this results in a generalisation of Tukey's definition of whiskers.

    Reference: Tukey, J.W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Working...
X