Challenge with getting nice logarithmic scale for box plot.

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#1

Challenge with getting nice logarithmic scale for box plot.

29 Jul 2014, 03:10

Colleagues,

I'm trying to generate a presentable box plot for the attached variable. yscale(log) produces rather meagre results:

I'm trying to follow the FAQ on creating box plots with logarithmic scale, and I'm using the mylabels (SSC). I drafted the following code:

Code:

// Create nice log scale clonevar logmeasure = measure replace logmeasure = log10(measure) // Define labels mylabels 0(20)100 500(2000)10000 15000(30000)115963, /// myscale(log10(@)) local(labels) // Graph box plot with population and area size graph box logmeasure, /// ytitle("Measure", size(small) margin(vsmall)) /// ylabel(, labsize(vsmall) angle(horizontal)) /// plotregion(lstyle(none)) /// lines(lwidth(vthin)) /// ylabel(`labels', angle(h)) /// title("mylabels + log10") /// name(box_measure, replace) // Drop log scale var drop logmeasure

which produces:

The obtained box plot looks slightly more like the one suggested in the FAQ but it's far from perfect. Consequently, I was wondering if someone could advise me how should I adjust the mylabels syntax in order to obtain nice 8 - 10 labels that would highlight the distribution of the attached data and spread nicely across the graph. With respect to the attachment, the data is in tab delimited format as, oddly, I couldn't attach dat or csv file.
Attached Files

boxvar.txt (63.4 KB, 1 view)

Last edited by Konrad Zdeb; 29 Jul 2014, 03:11. Reason: box plot

Kind regards,
Konrad
Version: Stata/IC 13.1
Tags: box plot, graph, scale, ssc
Nick Cox

Join Date: Mar 2014

Posts: 35709
#2

29 Jul 2014, 09:46

mylabels (SSC) can be useful in this and other problems, and I am as fond of it as anybody else, but it's a distraction here.

First off note that

1. Konrad's data (which define a single variable measure) range from about 1 to about 100000, so immediately 6 labels for the powers of ten 1, 10, 100, 1000, 10000, 100000 are natural.

2. Attempting to put 0 on a log scale is doomed.

3. Despite that FAQ, I would argue that using 1.5 IQR to determine what to plot outside the box is especially awkward on a log scale. Many readers would guess wrong what was being shown, and explaining it simply is a challenge. There is an easy alternative: draw whiskers to specified quantiles or percentiles, as the commutative property log of quantiles = quantile of logs is exact in principle and in practice compromised only slightly by any use of averaging adjacent values to get quantiles at specific % points.

4. Often with box plots what they leave out can be as interesting or important as what they show. So, showing a box together with something more detailed is usually a good idea.

For these and other reasons I would suggest stripplot (also SSC) as in this respect more versatile than graph box.

If a log scale is natural, or at least convenient, binning is ipso facto natural, or at least convenient, on that scale. That being so, it is a good idea to use

Code:

gen log_measure = log10(measure)

After some playing around here is one possibility:

Code:

stripplot log_measure , box(barw(0.1)) boffset(-0.1) pctile(1) xla(0 "1" 1 "10" 2 "100" 3 "1000" 4 "10000" 5 "100000") width(0.01) ms(sh) msize(*.25) stack xtitle(measure)

Here the whiskers are drawn to 1% and 99% percentiles. What is more intriguing is the apparent secondary mode on the log scale.

If you wanted more axis labels, then variations on 1 3 10 30 100 ... or 1 2 5 10 20 50 100 ... spring to mind, with the risk that the graph gets too busy. For those mylabels might indeed be useful.
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#3

29 Jul 2014, 09:55

Nick,

Thanks, stripplot comes often as very nice alternative. I have this and some other variables that are pronounced by similar distributions my ambition was to box plot them on log scales and then contrast on one chart. Having said that, I agree that this is not an easiest distribution to show. What I'm trying to do here is to say:
vast majority of values clusters around those values

but there are some values that are way off from what we would normally expect

some other variables are pronounced by similar characteristics (but other don't) - here similar box plots with different scales would follow

Having said that I can combine some stripplots with box plots and it will work as well, thanks for the useful suggestion.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35709
#4

29 Jul 2014, 09:59

Also, check out multqplot (SJ).
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#5

30 Jul 2014, 01:25

Originally posted by Nick Cox View Post

Also, check out multqplot (SJ).

I tried an it returned the following picture.

I think that I will go with the first chart that was suggested as it's most readable to me. On the other matter it occurred to me that qplot is required to run this:

Code:

. multqplot measure unrecognized command: qplot r(199);

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment

Announcement

Challenge with getting nice logarithmic scale for box plot.

Comment

Comment

Comment

Comment