Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Detecting clusters of outliers by repeatedly dropping observations

    Hi All,

    I have data that resembles the following:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(output year)
     213 1999
     123 2000
     231 2001
    3212 2002
    2313 2003
    3213 2004
    2132 2005
     123 2006
     321 2007
     321 2008
     231 2009
    end

    In the above, I have data on output by year. What I wish to do is detect outliers by brute force (defined for my purposes as abnormal increases from past years- 200%, say). Now, there are two classes of problem, one of which is easy to solve. In the case where there is a single outlying observation, I do the following:
    Code:
    gen logoutput=log(output)
    gen percentagechange=abs(logoutput[_n]-logoutput[_n-1]
    drop if percentagaechange>2

    In the above dataset, this will just get rid of the first outlying observation (3212 in 2002). However, given that 2313 is not an outlier with respect to 3212, it will not get identified as such. However, this is an unusually large cluster of outliers with respect to the variable generally. As such, I would like to repeatedly drop outliers over and over again- so for instance, after the first iteration, I obtain the dataset as above, but without the year 2002. I want to repeat the procedure again, so now the percentage change of 2003 with respect to 2001 will be calculated, 2003 identified as an outlier with respect to 2001 and so on. I want to this to continue till it reaches 2006, which will not be identified as an outlier with respect to 2001. This way, I will be able to identify large clusters of outlying observations. Any suggestions for this are much appreciated.


    Kind Regards,
    CS

  • #2
    Originally posted by Chinmay Sharma View Post
    I would like to repeatedly drop outliers over and over again
    You can set up a -while r(N) >0- loop containing a -replace- and with a -count if- at the end. You'd need to prime it (use code analogous to what you show); Mata has a -do while-, but Stata doesn't.

    But, if I were interested in what affected output, then a 10- to 15-fold increase, sustained over a period of four years, would be about the last thing that I would want to omit from the dataset.

    Comment


    • #3
      There are also a bunch of diagnostic tools documented in the regress and regress post estimation sections of the manuals for outliers. However, as Joseph notes, you really need to know what is going on when you get an order of magnitude change from year to year. I wonder if someone didn't change the measurement scale or something. This is unlikely to be a random coding error when it hits for a bunch of consecutive years.

      Comment

      Working...
      X