Detecting clusters of outliers by repeatedly dropping observations

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#1

Detecting clusters of outliers by repeatedly dropping observations

27 Jul 2018, 08:32

Hi All,

I have data that resembles the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(output year) 213 1999 123 2000 231 2001 3212 2002 2313 2003 3213 2004 2132 2005 123 2006 321 2007 321 2008 231 2009 end

In the above, I have data on output by year. What I wish to do is detect outliers by brute force (defined for my purposes as abnormal increases from past years- 200%, say). Now, there are two classes of problem, one of which is easy to solve. In the case where there is a single outlying observation, I do the following:

Code:

gen logoutput=log(output) gen percentagechange=abs(logoutput[_n]-logoutput[_n-1] drop if percentagaechange>2

In the above dataset, this will just get rid of the first outlying observation (3212 in 2002). However, given that 2313 is not an outlier with respect to 3212, it will not get identified as such. However, this is an unusually large cluster of outliers with respect to the variable generally. As such, I would like to repeatedly drop outliers over and over again- so for instance, after the first iteration, I obtain the dataset as above, but without the year 2002. I want to repeat the procedure again, so now the percentage change of 2003 with respect to 2001 will be calculated, 2003 identified as an outlier with respect to 2001 and so on. I want to this to continue till it reaches 2006, which will not be identified as an outlier with respect to 2001. This way, I will be able to identify large clusters of outlying observations. Any suggestions for this are much appreciated.

Kind Regards,
CS
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4397
#2

27 Jul 2018, 18:57

Originally posted by Chinmay Sharma View Post

I would like to repeatedly drop outliers over and over again

You can set up a -while r(N) >0- loop containing a -replace- and with a -count if- at the end. You'd need to prime it (use code analogous to what you show); Mata has a -do while-, but Stata doesn't.

But, if I were interested in what affected output, then a 10- to 15-fold increase, sustained over a period of four years, would be about the last thing that I would want to omit from the dataset.
1 like
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

30 Jul 2018, 14:32

There are also a bunch of diagnostic tools documented in the regress and regress post estimation sections of the manuals for outliers. However, as Joseph notes, you really need to know what is going on when you get an order of magnitude change from year to year. I wonder if someone didn't change the measurement scale or something. This is unlikely to be a random coding error when it hits for a bunch of consecutive years.
Comment

Announcement

Detecting clusters of outliers by repeatedly dropping observations

Comment

Comment