Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Version 15.1 vs. 13.1: One vs two outliers in box-plot demo

    The following is a little exercise I use in an intro biostats class--it is borrowed from the website of Burt Gerstman, author of Basic Biostatistics.

    Code:
    * 3.11 The median is more robust than the mean. Body weights (n = 10)
    * expressed as "percentage of ideal" for 10 individuals are
    * {99, 101, 107, 114, 116, 119, 121, 125, 152, 155}.
    
    clear
    input BW_pct_ideal  
     99  
    101  
    107  
    114  
    116  
    119  
    121  
    125  
    152  
    155  
    end
    
    * Calculate the mean & median.
    tabstat BW_pct_ideal, stat(n mean median)
    
    * Make a boxplot of the data and identify the two outliers in the dataset.
    graph box BW_pct_ideal
    
    * With the two outliers excluded, recalculate the mean and median. What
    * effect did removing the outliers have on the mean and median?
    
    tabstat BW_pct_ideal if BW_pct_ideal < 152, stat(n mean median)
    
    * When we used all of the data, mean > median (120.9 > 117.5).
    * But when we excluded the two outliers, mean < median (112.75 < 115).
    When I first did the problem, I was using Stata 13, and indeed, the box-plot showed the two highest scores (152, 155) as potential outliers. But an eagle-eyed student in this year's class pointed out that with Stata 15, she was seeing only one outlier (155). I am currently using 15.1, but still have 13.1 installed. So I tried it with both. Version 13.1 shows 2 outliers, version 15.1 shows one. Furthermore, when I use version control in 15.1 (e.g., version 13: graph box BW_pct_ideal), I see only one outlier.

    I have looked at help whatsnew, and searched for <boxplot> and <outlier>, but thus far have not found anything that indicates a change in the rules for identifying outliers. Does anyone here have any thoughts on what might be causing this discrepancy?

    Thanks,
    Bruce

    Click image for larger version

Name:	Ex02-Q3.11C_boxplot_in_2_Stata_versions.png
Views:	1
Size:	56.1 KB
ID:	1427239
    --
    Bruce Weaver
    Email: [email protected]
    Version: Stata/MP 18.5 (Windows)

  • #2
    Probably a subtle fix so that a marker does not coincide with the upper extreme. In this case, the upper extreme is at 152 so there is only 1 outlier. In any case, the current behavior makes more sense to me


    Code:
    graph box BW_pct_ideal, ylab(100 110 120 135 140 145 152 160, angle(vertical))
    Click image for larger version

Name:	bx2.png
Views:	1
Size:	79.1 KB
ID:	1427243

    Last edited by Andrew Musau; 24 Jan 2018, 07:53.

    Comment


    • #3
      Not the question, but this points up the arbitrary nature of the "rule" that identified points are those at least 1.5 IQR from the nearer quartile.

      That was only ever a rule of thumb. The outcome of viewing many box plots -- for Tukey and his collaborators -- was intended to be that you think about a transformation that makes sense if you identify skewness and/or outliers first time round. (1.5 evolved after experiment with small or moderate datasets in which as Tukey explained informally 1 was found to be too low and 2 too high.)

      That rule (good grief! as Snoopy used to say; convention prohibits my reaching to the depths of my vocabulary) is even taken in some quarters as a threshold beyond you should automatically delete or ignore points as being outliers!

      I don't see that we need follow all these little rituals. We can show all the data quite easily, and a box, and even means too. (The extra line is the mean. Some people identify the mean on a box plot by an extra point symbol; I prefer a line.)

      As one of many alternatives, this quantile-box plot makes clear the main point at issue, namely two distinct high values that make us worry.
      Click image for larger version

Name:	weaver.png
Views:	1
Size:	14.0 KB
ID:	1427288





      Code:
      clear
      input BW_pct_ideal  
       99  
      101  
      107  
      114  
      116  
      119  
      121  
      125  
      152  
      155  
      end
      
      * to install if not done previously:
      * ssc inst stripplot
      stripplot BW, cumul box refline vertical centre aspect(2) yla(, ang(h))
      Last edited by Nick Cox; 24 Jan 2018, 11:19.

      Comment


      • #4
        Yet another prejudice: my mantra with students and colleagues is child-like:

        If half of the data points are inside the box, then half are outside the box too.

        The conventions of an opaque box, often given a strong colour, and wispy whiskers and perhaps some identified points often seem to mislead naive and even some experienced readers.

        Their take-home message is often cruder and what's appropriately called a half-truth: the data are concentrated in the box!

        But the half of the data points outside the box often include the really interesting, important or dangerous points.

        So, show the data. And make boxes transparent.

        Sure if you have 100, 1000, 10000, ... points rather than 10 the data points often mush together. That's fine too. You can still check for outliers and gaps and clusters and other structure.

        Comment

        Working...
        X