There any way to identify outliers using STATA?
Announcement
Collapse
No announcement yet.
X

Hello!
There are many ways to identify outliers. Here is a good document for reference: http://www3.nd.edu/~rwilliam/stats2/l24.pdf
Anton
Comment

You have to give much more specific detail on exactly what you are interested in to make fuller answers likely.
Ignoring graphics here is a personal choice, as would be ignoring questions based on such a blinkered attitude.
Please also note our preference for using full real names and for the correct spelling "Stata". See the FAQ Advice for more detail on this and other advice on posing questions.Last edited by Nick Cox; 23 Sep 2014, 12:00.
Comment

Hi Nick Cox,
Thank you for the answer!
I have a database, with many variables, to compare the values between two groups of countries. I want to compare the average, minimum, maximum and SD. But I want to eliminate the outliers, because I see that some values is to high.
And, my attitude to not chose graphic is because I have thousands observation, so it will be more difficult to identify outliers! So that I want to know if is there any command, that I can use, it can say that the value, for example, more than 500, is outliers.
Comment

Thanks for trying to provide detail, but my answer remains pretty much the same.
In effect, you are asking if there is a Stata command that will tell you if values are "too high". If you can translate that into some statistical criterion, then there will be Stata code to do it.
In any case, eliminating outliers is a highly debatable tactic. It's just one of several possible actions and in my view usually one of the worst imaginable.
There are entire books and many, many articles on treatment of outliers; the discussion by Richard Williams Anton cited in #3 is good and linked to Stata; another discussion is at http://stats.stackexchange.com/quest...ierswithmean
On graphics: I think you have it precisely the wrong way round. The more data you have, the easier it usually is to identify possible outliers or  more importantly  decide what to do given skewed or heavytailed distributions.
Last edited by Nick Cox; 23 Sep 2014, 12:42.
Comment

litosbrito: you got it the wrong way around: You need to tell Stata when a value is "too high". Too high is necessarily a subjective statement. So it can only be made by humans. You can think of a criterium, and ask a computer (Stata) to apply that criterium, but you, and only you, can choose the criterium. But before you start on that road, try to answer this question: How can you hope to find anything new, if you first remove all surprising observations from your data?
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl

 2 likes
Comment

Litosbrito (please, as per FAQ, reregister with your ful name and surname. Just click on the Contact us button at the bottomright corner of the screen):
Provided that I second all the previous sound advices, an option for detecting outliers is to loop over a variable list, as in the following toyexample:
Code:set obs 100 g A=runiform() g B=runiform() g C=runiform() foreach var of varlist AC { quietly summarize `var' g Z_`var'=1 if `var'>3*r(sd) ///the aim of Z_`var' is to detect the values beyond a thresholdvalue you decide to set (let's say >3 standard deviation apart)/// replace Z_`var'=0 if Z_`var'==. list `var' Z_`var' if Z_`var'==1 }
CarloKind regards,
Carlo
(Stata 16.0 SE)
 2 likes
Comment

Carlo wrote code for an indicator variable flagging values more than 3 times the standard deviation (SD). But consider a bundle of countries with life expectancy mean 60 years and SD 10 years. Then all values >30 years would be flagged as outliers, but not those with <30 years (which on most other criteria would be staggering outliers).
I guess Carlo was just plucking a criterion out of the air as an example, or he was thinking about some criterion of the form value  mean > k SD, but made a slip in his coding. Even then the mean and SD are both likely to be strongly affected by outliers when they exist, so wouldn't we be better off using median and interquartile range (IQR), say, as the basis for any rule of thumb?
Hang on: we are rediscovering box plot criteria. For those who want tables, I wrote extremes (SSC) but don't use it much. It deliberately (or so I suppose) doesn't offer hooks for dropping outliers, which is almost always bad practice in my view.
P.S. A sideeffect of Carlo's code is that missing values will be flagged as 1. Taking his criterion literally, the code might be rewritten
Code:foreach var of varlist AC { quietly summarize `var' g Z_`var'= (`var' > 3*r(sd)) if `var' < . list `var' Z_`var' if Z_`var' == 1 }
 3 likes
Comment
Comment