Hi All,
I have data that resembles the following:
In the above, I have data on output by year. What I wish to do is detect outliers by brute force (defined for my purposes as abnormal increases from past years- 200%, say). Now, there are two classes of problem, one of which is easy to solve. In the case where there is a single outlying observation, I do the following:
In the above dataset, this will just get rid of the first outlying observation (3212 in 2002). However, given that 2313 is not an outlier with respect to 3212, it will not get identified as such. However, this is an unusually large cluster of outliers with respect to the variable generally. As such, I would like to repeatedly drop outliers over and over again- so for instance, after the first iteration, I obtain the dataset as above, but without the year 2002. I want to repeat the procedure again, so now the percentage change of 2003 with respect to 2001 will be calculated, 2003 identified as an outlier with respect to 2001 and so on. I want to this to continue till it reaches 2006, which will not be identified as an outlier with respect to 2001. This way, I will be able to identify large clusters of outlying observations. Any suggestions for this are much appreciated.
Kind Regards,
CS
I have data that resembles the following:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float(output year) 213 1999 123 2000 231 2001 3212 2002 2313 2003 3213 2004 2132 2005 123 2006 321 2007 321 2008 231 2009 end
In the above, I have data on output by year. What I wish to do is detect outliers by brute force (defined for my purposes as abnormal increases from past years- 200%, say). Now, there are two classes of problem, one of which is easy to solve. In the case where there is a single outlying observation, I do the following:
Code:
gen logoutput=log(output) gen percentagechange=abs(logoutput[_n]-logoutput[_n-1] drop if percentagaechange>2
In the above dataset, this will just get rid of the first outlying observation (3212 in 2002). However, given that 2313 is not an outlier with respect to 3212, it will not get identified as such. However, this is an unusually large cluster of outliers with respect to the variable generally. As such, I would like to repeatedly drop outliers over and over again- so for instance, after the first iteration, I obtain the dataset as above, but without the year 2002. I want to repeat the procedure again, so now the percentage change of 2003 with respect to 2001 will be calculated, 2003 identified as an outlier with respect to 2001 and so on. I want to this to continue till it reaches 2006, which will not be identified as an outlier with respect to 2001. This way, I will be able to identify large clusters of outlying observations. Any suggestions for this are much appreciated.
Kind Regards,
CS
Comment