Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate a legend explaining a box plot

    Hi all,

    I'm using Stata 13.1 and generating a series of box plots (209 in total). I understand the workings of a box plot, but my intended audience may not. Therefore I would like to include in my boxplot an example box plot as a legend showing/explaining what each part of the box plot is displaying. Something similiar to this:




    Is anyone aware of a way of generating something like this in Stata which I can then place in/on my chart?

    For a working example of a box plot, I've borrowed an example from Nick Cox using tw bar to generate a box plot:

    ***BEGIN EXAMPLE****
    sysuse lifeexp, clear
    label var lexp "Life expectancy (years)"
    egen median = median(lexp), by(region)
    egen upq = pctile(lexp), p(75) by(region)
    egen loq = pctile(lexp), p(25) by(region)

    egen iqr = iqr(lexp), by(region)
    egen upper = max(min(lexp, upq + 1.5 * iqr)), by(region)
    egen lower = min(max(lexp, loq - 1.5 * iqr)), by(region)
    egen mean = mean(lexp), by(region)


    #delim ;
    . twoway rbar med upq region, pstyle(p1) blc(gs15) bfc(gs8) barw(0.35) ||
    rbar med loq region, pstyle(p1) blc(gs15) bfc(gs8) barw(0.35) ||
    rspike upq upper region, pstyle(p1) ||
    rspike loq lower region, pstyle(p1) ||
    rcap upper upper region, pstyle(p1) msize(*2) ||
    rcap lower lower region, pstyle(p1) msize(*2) ||
    scatter mean region, ms(Dh) msize(*2) ||
    scatter lexp region if !inrange(lexp, lower, upper), ms(Oh) mla(country)
    legend(off)
    xla(1 `""Europe and" "Central Asia""' 2 "North America" 3 "South America", noticks)
    yla(, ang(h)) ytitle(Life expectancy (years)) xtitle("") ;
    #delim cr


    Thanks

    Tim

  • #2
    Tim: Please use CODE delimiters, as previously requested of you.

    http://www.statalist.org/forums/foru...ighlighted-bar

    You are copying from SJ 9: 478-496 (2009) (which is fine by me, but the reference might be of interest to others).

    Corrected code appears in SJ 13:398-400 (2013).

    Both 2009 and 2013 papers are in my Speaking Stata graphics from Stata Press.

    The two lines

    Code:
    egen upper = max(min(lexp, upq + 1.5 * iqr)), by(region)
    egen lower = min(max(lexp, loq - 1.5 * iqr)), by(region)
    should be

    Code:
    egen upper = max(lexp / (lexp < upq + 1.5 * iqr)), by(region)
    egen lower = min(lexp / (lexp > loq - 1.5 * iqr)), by(region)
    In principle, once you use graph twoway to generate a box plot, then you can put text any where you like using a text() option and arrows similarly using pcarrowi

    In practice, I make it my own business never to do this with box plots, which in their Tukey form are often oversold (beyond, I believe, what Tukey would have approved).

    1. The verbal explanation that the central box shows the median and the two quartiles is a reasonable start.

    2. For my own empirical work, I now never use the arbitrary cut-offs based on lower quartile - 1.5 IQR and upper quartile + 1.5 IQR, That didn't stop me explaining to those wishing to reproduce Stata's defaults how to do it, as cited above. But they are just too arbitrary and awkward to explain to anyone who does not know them already (and undermined by people drawing box plots with quite different conventions, which I often approve, as below).

    3. Better practices include drawing whiskers to specific quantiles (percentiles) (e.g. 1 and 99% points, or 5 and 95%, an idea that goes back to the 1930s at least) and the hybrid quantile-box plots originally suggested by the late Emanuel Parzen. Both are supported by stripplot (SSC), which is quite often mentioned in this forum.

    I'd add that the quantile-box plot makes it much easier to explain that half the points are inside the box and half outside, and one quarter in ... each quarter, as you can see the points, at least collectively.

    Last edited by Nick Cox; 08 Mar 2016, 04:35.

    Comment


    • #3
      Nick,

      I take your point on the box plot using arbitrary value of 1.5. I considered just calculating the 95% confidence intervals around the mean, but I have subsequently discounted this because I'm looking at GP practices within a local geography and trying to show variation within that local geography. If I calculated 95% CIs, all that would happen is that I would have narrow CIs for geographies with many GP practices and wider ones for fewer. Ultimately I want to show variation within a geography. Perhaps the standard box plot defaults is still not the best to use. In reference to my original comment, I could tweak the legend and include some rudimentary information as an explanation for what I'm presenting - I've had a stab below, using CODE delimiters:

      sysuse lifeexp, clear
      label var lexp "Life expectancy (years)"
      egen median = median(lexp), by(region)
      egen upq = pctile(lexp), p(75) by(region)
      egen loq = pctile(lexp), p(25) by(region)
      egen iqr = iqr(lexp), by(region)
      egen upper = max(lexp / (lexp < upq + 1.5 * iqr)), by(region)
      egen lower = min(lexp / (lexp > loq - 1.5 * iqr)), by(region)
      egen mean = mean(lexp), by(region)

      #delim ;
      twoway (rbar med upq region, pstyle(p1) blc(gs15) bfc(gs8) barw(0.35))
      (rbar med loq region, pstyle(p1) blc(gs15) bfc(gs8) barw(0.35))
      (rspike upq upper region, pstyle(p1))
      (rspike loq lower region, pstyle(p1))
      (rcap upper upper region, pstyle(p1) msize(*2))
      (rcap lower lower region, pstyle(p1) msize(*2))
      (scatter mean region, ms(Dh) msize(*2))
      (scatter lexp region if !inrange(lexp, lower, upper), ms(Oh) mla(country)
      xla(1 `""Europe and" "Central Asia""' 2 "North America" 3 "South America", noticks)
      yla(, ang(h)) ytitle(Life expectancy (years)) xtitle("")
      legend(order(1 "IQR" 4 "Sample range" 6 "Average" 7 "Outliers") row(1) region(lwidth(none)))) ;
      #delim cr



      Thanks Tim

      Comment

      Working...
      X