Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Challenge with getting nice logarithmic scale for box plot.

    Colleagues,

    I'm trying to generate a presentable box plot for the attached variable. yscale(log) produces rather meagre results:



    I'm trying to follow the FAQ on creating box plots with logarithmic scale, and I'm using the mylabels (SSC). I drafted the following code:

    Code:
    // Create nice log scale
    clonevar logmeasure = measure
    replace logmeasure = log10(measure)
    
    // Define labels
    mylabels 0(20)100 500(2000)10000 15000(30000)115963, ///
        myscale(log10(@)) local(labels)
    
    // Graph box plot with population and area size
    graph box logmeasure, ///
        ytitle("Measure", size(small) margin(vsmall)) ///
        ylabel(, labsize(vsmall) angle(horizontal)) ///
        plotregion(lstyle(none)) ///
        lines(lwidth(vthin)) ///
        ylabel(`labels', angle(h)) ///
        title("mylabels + log10") ///
        name(box_measure, replace)
    
    
    // Drop log scale var
    drop logmeasure
    which produces:


    The obtained box plot looks slightly more like the one suggested in the FAQ but it's far from perfect. Consequently, I was wondering if someone could advise me how should I adjust the mylabels syntax in order to obtain nice 8 - 10 labels that would highlight the distribution of the attached data and spread nicely across the graph. With respect to the attachment, the data is in tab delimited format as, oddly, I couldn't attach dat or csv file.
    Attached Files
    Last edited by Konrad Zdeb; 29 Jul 2014, 03:11. Reason: box plot
    Kind regards,
    Konrad
    Version: Stata/IC 13.1

  • #2
    mylabels (SSC) can be useful in this and other problems, and I am as fond of it as anybody else, but it's a distraction here.

    First off note that

    1. Konrad's data (which define a single variable measure) range from about 1 to about 100000, so immediately 6 labels for the powers of ten 1, 10, 100, 1000, 10000, 100000 are natural.

    2. Attempting to put 0 on a log scale is doomed.

    3. Despite that FAQ, I would argue that using 1.5 IQR to determine what to plot outside the box is especially awkward on a log scale. Many readers would guess wrong what was being shown, and explaining it simply is a challenge. There is an easy alternative: draw whiskers to specified quantiles or percentiles, as the commutative property log of quantiles = quantile of logs is exact in principle and in practice compromised only slightly by any use of averaging adjacent values to get quantiles at specific % points.

    4. Often with box plots what they leave out can be as interesting or important as what they show. So, showing a box together with something more detailed is usually a good idea.

    For these and other reasons I would suggest stripplot (also SSC) as in this respect more versatile than graph box.

    If a log scale is natural, or at least convenient, binning is ipso facto
    natural, or at least convenient, on that scale. That being so, it is a good idea to use

    Code:
     
    gen log_measure = log10(measure)
    After some playing around here is one possibility:

    Code:
    stripplot log_measure , box(barw(0.1))  boffset(-0.1) pctile(1)  xla(0 "1" 1 "10" 2 "100" 3 "1000" 4 "10000" 5 "100000") width(0.01) ms(sh) msize(*.25) stack xtitle(measure)


    Click image for larger version

Name:	stripplot.png
Views:	1
Size:	11.1 KB
ID:	109149


    Here the whiskers are drawn to 1% and 99% percentiles. What is more intriguing is the apparent secondary mode on the log scale.

    If you wanted more axis labels, then variations on 1 3 10 30 100 ... or 1 2 5 10 20 50 100 ... spring to mind, with the risk that the graph gets too busy. For those mylabels might indeed be useful.


    Comment


    • #3
      Nick,

      Thanks, stripplot comes often as very nice alternative. I have this and some other variables that are pronounced by similar distributions my ambition was to box plot them on log scales and then contrast on one chart. Having said that, I agree that this is not an easiest distribution to show. What I'm trying to do here is to say:
      • vast majority of values clusters around those values
      • but there are some values that are way off from what we would normally expect
      • some other variables are pronounced by similar characteristics (but other don't) - here similar box plots with different scales would follow
      Having said that I can combine some stripplots with box plots and it will work as well, thanks for the useful suggestion.
      Kind regards,
      Konrad
      Version: Stata/IC 13.1

      Comment


      • #4
        Also, check out multqplot (SJ).

        Comment


        • #5
          Originally posted by Nick Cox View Post
          Also, check out multqplot (SJ).
          I tried an it returned the following picture.
          Click image for larger version

Name:	__000001.png
Views:	1
Size:	22.9 KB
ID:	111023

          I think that I will go with the first chart that was suggested as it's most readable to me. On the other matter it occurred to me that qplot is required to run this:

          Code:
          . multqplot measure
          unrecognized command:  qplot
          r(199);
          Kind regards,
          Konrad
          Version: Stata/IC 13.1

          Comment

          Working...
          X