Version 15.1 vs. 13.1: One vs two outliers in box-plot demo

Bruce Weaver

Join Date: May 2014

Posts: 1128
#1

Version 15.1 vs. 13.1: One vs two outliers in box-plot demo

24 Jan 2018, 07:08

The following is a little exercise I use in an intro biostats class--it is borrowed from the website of Burt Gerstman, author of Basic Biostatistics.

Code:

* 3.11 The median is more robust than the mean. Body weights (n = 10) * expressed as "percentage of ideal" for 10 individuals are * {99, 101, 107, 114, 116, 119, 121, 125, 152, 155}. clear input BW_pct_ideal 99 101 107 114 116 119 121 125 152 155 end * Calculate the mean & median. tabstat BW_pct_ideal, stat(n mean median) * Make a boxplot of the data and identify the two outliers in the dataset. graph box BW_pct_ideal * With the two outliers excluded, recalculate the mean and median. What * effect did removing the outliers have on the mean and median? tabstat BW_pct_ideal if BW_pct_ideal < 152, stat(n mean median) * When we used all of the data, mean > median (120.9 > 117.5). * But when we excluded the two outliers, mean < median (112.75 < 115).

When I first did the problem, I was using Stata 13, and indeed, the box-plot showed the two highest scores (152, 155) as potential outliers. But an eagle-eyed student in this year's class pointed out that with Stata 15, she was seeing only one outlier (155). I am currently using 15.1, but still have 13.1 installed. So I tried it with both. Version 13.1 shows 2 outliers, version 15.1 shows one. Furthermore, when I use version control in 15.1 (e.g., version 13: graph box BW_pct_ideal), I see only one outlier.

I have looked at help whatsnew, and searched for <boxplot> and <outlier>, but thus far have not found anything that indicates a change in the rules for identifying outliers. Does anyone here have any thoughts on what might be causing this discrepancy?

Thanks,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10168
#2

24 Jan 2018, 07:43

Probably a subtle fix so that a marker does not coincide with the upper extreme. In this case, the upper extreme is at 152 so there is only 1 outlier. In any case, the current behavior makes more sense to me

Code:

graph box BW_pct_ideal, ylab(100 110 120 135 140 145 152 160, angle(vertical))

Last edited by Andrew Musau; 24 Jan 2018, 07:53.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35585
#3

24 Jan 2018, 11:16

Not the question, but this points up the arbitrary nature of the "rule" that identified points are those at least 1.5 IQR from the nearer quartile.

That was only ever a rule of thumb. The outcome of viewing many box plots -- for Tukey and his collaborators -- was intended to be that you think about a transformation that makes sense if you identify skewness and/or outliers first time round. (1.5 evolved after experiment with small or moderate datasets in which as Tukey explained informally 1 was found to be too low and 2 too high.)

That rule (good grief! as Snoopy used to say; convention prohibits my reaching to the depths of my vocabulary) is even taken in some quarters as a threshold beyond you should automatically delete or ignore points as being outliers!

I don't see that we need follow all these little rituals. We can show all the data quite easily, and a box, and even means too. (The extra line is the mean. Some people identify the mean on a box plot by an extra point symbol; I prefer a line.)

As one of many alternatives, this quantile-box plot makes clear the main point at issue, namely two distinct high values that make us worry.

Code:

clear input BW_pct_ideal 99 101 107 114 116 119 121 125 152 155 end * to install if not done previously: * ssc inst stripplot stripplot BW, cumul box refline vertical centre aspect(2) yla(, ang(h))

Last edited by Nick Cox; 24 Jan 2018, 11:19.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35585
#4

24 Jan 2018, 12:25

Yet another prejudice: my mantra with students and colleagues is child-like:

If half of the data points are inside the box, then half are outside the box too.

The conventions of an opaque box, often given a strong colour, and wispy whiskers and perhaps some identified points often seem to mislead naive and even some experienced readers.

Their take-home message is often cruder and what's appropriately called a half-truth: the data are concentrated in the box!

But the half of the data points outside the box often include the really interesting, important or dangerous points.

So, show the data. And make boxes transparent.

Sure if you have 100, 1000, 10000, ... points rather than 10 the data points often mush together. That's fine too. You can still check for outliers and gaps and clusters and other structure.
Comment

Announcement

Version 15.1 vs. 13.1: One vs two outliers in box-plot demo

Comment

Comment

Comment