Dear Statalisters,
I have just posted my program box_logscale on SSC.
box_logscale presents box plots on a log scale. It log10 transforms the yvar(s), runs graph box and labels the numeric axis with nice original-scale numbers.
It is an implementation of Nick Cox's advice in this FAQ: https://www.stata.com/support/faqs/g...rithmic-scales
It creates the graph on the right below. It is an improvement on the graph on the left (which was created with Stata’s graph box command).

Left graph: So many high outside values and no low outside values. And a badly labelled numeric axis.
Right graph: Just a few high outside values and just a few low outside values. And a nicely labelled numeric axis.
Nearly all options of graph box can be specified.
Feedback welcome.
Best wishes, Mark
Detail:
box_logscale calculates quartiles of log10(y) in the usual manner described in [G-2] graph box. Call them q1log, q2log, q3log. Then the following are calculated: Ulog = q3log + 1.5*(q3log - q1log) and Llog = q1log - 1.5*(q3log - q1log). Then adjacent values for log10(y) are defined in the usual manner described in [G-2] graph box, i.e. the upper adjacent value is the largest value of log10(y) not exceeding Ulog, and the lower adjacent value is the smallest value of log10(y) exceeding Llog. The plot is then drawn, and labels on the numeric axis are carefully chosen such that they correspond to (nice, possibly user specified) original-scale values - I make these nice values appear on the numeric axis.
This has the effect of looking like a plot of the untransformed data, where the box shows q1 = 10^q1log, q2 = 10^q2log and q3 = 10^q3log, and whiskers are drawn between the box and adjacent values, where adjacent values are now defined as: the upper adjacent value is the largest value of y not exceeding U, and the lower adjacent value is the smallest value of y exceeding L, where
U = 10^Ulog = q3* (q3/q1)^1.5, and
L = 10^Llog = q1 / (q3/q1)^1.5.
It could be argued this results in a generalisation of Tukey's definition of whiskers.
Reference: Tukey, J.W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.
I have just posted my program box_logscale on SSC.
box_logscale presents box plots on a log scale. It log10 transforms the yvar(s), runs graph box and labels the numeric axis with nice original-scale numbers.
It is an implementation of Nick Cox's advice in this FAQ: https://www.stata.com/support/faqs/g...rithmic-scales
It creates the graph on the right below. It is an improvement on the graph on the left (which was created with Stata’s graph box command).
Left graph: So many high outside values and no low outside values. And a badly labelled numeric axis.
Right graph: Just a few high outside values and just a few low outside values. And a nicely labelled numeric axis.
Nearly all options of graph box can be specified.
Feedback welcome.
Best wishes, Mark
Code:
ssc install box_logscale *Generate a dataset with a lognormal variable clear all set seed 999 set obs 999 gen y = 10^rnormal(0,0.3) *Create a box plot in two different ways graph box y, yscale(log) // Stata's inbuilt command box_logscale y // new community-contributed command
Detail:
box_logscale calculates quartiles of log10(y) in the usual manner described in [G-2] graph box. Call them q1log, q2log, q3log. Then the following are calculated: Ulog = q3log + 1.5*(q3log - q1log) and Llog = q1log - 1.5*(q3log - q1log). Then adjacent values for log10(y) are defined in the usual manner described in [G-2] graph box, i.e. the upper adjacent value is the largest value of log10(y) not exceeding Ulog, and the lower adjacent value is the smallest value of log10(y) exceeding Llog. The plot is then drawn, and labels on the numeric axis are carefully chosen such that they correspond to (nice, possibly user specified) original-scale values - I make these nice values appear on the numeric axis.
This has the effect of looking like a plot of the untransformed data, where the box shows q1 = 10^q1log, q2 = 10^q2log and q3 = 10^q3log, and whiskers are drawn between the box and adjacent values, where adjacent values are now defined as: the upper adjacent value is the largest value of y not exceeding U, and the lower adjacent value is the smallest value of y exceeding L, where
U = 10^Ulog = q3* (q3/q1)^1.5, and
L = 10^Llog = q1 / (q3/q1)^1.5.
It could be argued this results in a generalisation of Tukey's definition of whiskers.
Reference: Tukey, J.W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley.