Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • graph box acts weird

    What is going on here? According to the graph box documentation whiskers should be ended with lower and upper adjacent value, meanwhile:

    clear all
    input v
    32
    55
    60
    61
    62
    64
    64
    68
    73
    75
    75
    76
    78
    78
    79
    79
    80
    80
    82
    83
    84
    85
    88
    90
    92
    93
    95
    98
    end
    list

    qui su v, d
    local lower_adjacent_value `r(p25)' - (3/2)*(`r(p75)'-`r(p25)')
    local upper_adjacent_value `r(p75)' + (3/2)*(`r(p75)'-`r(p25)')
    di `lower_adjacent_value'
    di `upper_adjacent_value'

    graph hbox v , ylabel(30(10)120) ///
    text(38.25 25 "L") ///
    text(112.25 25 "U")

    * And:

    qui su v, d
    graph hbox v , text(`r(p1)' 25 "x") ///
    text(`r(p5)' 25 "L") ///
    text(`r(p99)' 25 "U")

    * Despite the problem described above, I am still confused why the below is not working:

    qui su v, d
    local lower_adjacent_value `r(p25)' - (3/2)*(`r(p75)'-`r(p25)')
    local upper_adjacent_value `r(p75)' + (3/2)*(`r(p75)'-`r(p25)')
    di `lower_adjacent_value'
    di `upper_adjacent_value'

    graph hbox v , ylabel(30(10)120) ///
    text(`lower_adjacent_value' 25 "L") ///
    text(`upper_adjacent_value' 25 "U")
    Last edited by Maciej Koniewski; 28 Nov 2023, 06:13.

  • #2
    According to the Tukey criteria you're using the whiskers should be drawn to the outermost data points within the interval from P25 - 1.5 IQR to P75 + 1.5 IQR, not to those so-called adjacent values. Any data points beyond should be plotted individually -- and with these data only 32 qualifies.

    FWIW, I suspect from reading and teaching that

    * few people who publish box plots like this fully understand these criteria -- or (if that is unfair) they don't always explain them clearly, which would be a good idea if only because other criteria are in use for box plots too

    * few people who hear or read this definition grasp it easily or remember it for long (and in any case 1.5 is mysterious unless explained as a rule of thumb)

    Backing up: John Tukey circa 1970 was concerned with something that could be drawn quickly as a pen (*) and paper task. His self-set limit was that calculations should use only sorting, counting and at most halving. These aren't most people's typical computing context 50 years later.

    I will here flag that Tukey re-invented dispersion diagrams used by geographers from 1933 and routine for them long before M.E. Spear published a range-bar chart in 1952, itself recycling K.W. Haemer's idea from 1948.

    I recommend -- if you use a box plot at all -- that

    1. You explain your criteria for whiskers in your report. I have found that whiskers to the extremes or to 5% and 95% points work well, and are much easier to explain than the Tukey criteria, so long as you follow the rest of the advice:

    2. Plot all the data points separately regardless, alongside or underneath.

    3. If you have so much data that the points just blur into each other, well and good, and possibly some other display will work as well or better.

    (*) According to legend, Tukey was especially positive about ballpoint pens with multiple colours.

    Comment


    • #3
      Nick Cox Thank you a lot! "The whiskers should be drawn to the outermost data points within the interval from P25 - 1.5 IQR to P75 + 1.5 IQR, not to those so-called adjacent values" - this is straight to the point definition which clarify this rather blurry concept to many people.

      As for the second part of my problem, do you see any issue below which prevents this code from running properly?

      qui su v, d
      local lower_adjacent_value `r(p25)' - (3/2)*(`r(p75)'-`r(p25)')
      local upper_adjacent_value `r(p75)' + (3/2)*(`r(p75)'-`r(p25)')
      di `lower_adjacent_value'
      di `upper_adjacent_value'

      graph hbox v , ylabel(30(10)120) ///
      text(`lower_adjacent_value' 25 "L") ///
      text(`upper_adjacent_value' 25 "U")

      Comment


      • #4
        The problem lies in your use of local macros. Consider this one. It's the same problem for both. The statement


        Code:
        local lower_adjacent_value `r(p25)' - (3/2)*(`r(p75)'-`r(p25)')
        assigns text to the local macro after evaluating the references to quartiles and display will evaluate the expression, but graph hbox will not do that.

        You needed

        Code:
        local lower_adjacent_value = `r(p25)' - (3/2)*(`r(p75)'-`r(p25)')

        See 18.3.1 in https://www.stata.com/manuals/u18.pdf noting that

        local four 2 + 2
        local four = 2 + 2


        are not two ways of saying the same thing.

        Note that display does not just echo or show what is in local macros; it evaluates them too. macro list is one way to see what is inside.

        Comment

        Working...
        X