Outliers on Stata

litosbrito

Join Date: Jul 2014

Posts: 14
#1

Outliers on Stata

23 Sep 2014, 10:08

There any way to identify outliers using STATA?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35652
#2

23 Sep 2014, 10:16

There are many, many ways, depending on your definition of outliers. A good one is to plot your data and think about data points that seem surprising.
1 like
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#3

23 Sep 2014, 10:17

Hello!

There are many ways to identify outliers. Here is a good document for reference: http://www3.nd.edu/~rwilliam/stats2/l24.pdf

Anton
1 like
Comment
litosbrito

Join Date: Jul 2014

Posts: 14
#4

23 Sep 2014, 10:22

Thank you for the answers!
I want to know if there any STATA command that I can use! I don´t want to use graphic!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35652
#5

23 Sep 2014, 10:40

You have to give much more specific detail on exactly what you are interested in to make fuller answers likely.

Ignoring graphics here is a personal choice, as would be ignoring questions based on such a blinkered attitude.

Please also note our preference for using full real names and for the correct spelling "Stata". See the FAQ Advice for more detail on this and other advice on posing questions.

Last edited by Nick Cox; 23 Sep 2014, 11:00.
Comment
litosbrito

Join Date: Jul 2014

Posts: 14
#6

23 Sep 2014, 11:08

Hi Nick Cox,
Thank you for the answer!
I have a database, with many variables, to compare the values between two groups of countries. I want to compare the average, minimum, maximum and SD. But I want to eliminate the outliers, because I see that some values is to high.

And, my attitude to not chose graphic is because I have thousands observation, so it will be more difficult to identify outliers! So that I want to know if is there any command, that I can use, it can say that the value, for example, more than 500, is outliers.
Comment
Anton Ivanov

Join Date: Sep 2014

Posts: 267
#7

23 Sep 2014, 11:29

Keep in mind that you need strong theoretical justification in order to eliminate outliers from the analysis.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35652
#8

23 Sep 2014, 11:34

Thanks for trying to provide detail, but my answer remains pretty much the same.

In effect, you are asking if there is a Stata command that will tell you if values are "too high". If you can translate that into some statistical criterion, then there will be Stata code to do it.

In any case, eliminating outliers is a highly debatable tactic. It's just one of several possible actions and in my view usually one of the worst imaginable.

There are entire books and many, many articles on treatment of outliers; the discussion by Richard Williams Anton cited in #3 is good and linked to Stata; another discussion is at http://stats.stackexchange.com/quest...iers-with-mean

On graphics: I think you have it precisely the wrong way round. The more data you have, the easier it usually is to identify possible outliers or -- more importantly -- decide what to do given skewed or heavy-tailed distributions.

Last edited by Nick Cox; 23 Sep 2014, 11:42.
1 like
Comment
litosbrito

Join Date: Jul 2014

Posts: 14
#9

23 Sep 2014, 11:45

Thanks ofr the answers!

Yes, is that I want to know, if is possible to Stata to say me that the value is "too high"!!!

I will follow your suggestion, and see if I can resolve my problem.

Thank you once for all the comments!!
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35652
#10

23 Sep 2014, 11:58

"Think on a logarithmic scale" solves many more problems than eliminating outliers.
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3449
#11

23 Sep 2014, 13:20

litosbrito: you got it the wrong way around: You need to tell Stata when a value is "too high". Too high is necessarily a subjective statement. So it can only be made by humans. You can think of a criterium, and ask a computer (Stata) to apply that criterium, but you, and only you, can choose the criterium. But before you start on that road, try to answer this question: How can you hope to find anything new, if you first remove all surprising observations from your data?

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
3 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17704
#12

23 Sep 2014, 22:57

Litosbrito (please, as per FAQ, re-register with your ful name and surname. Just click on the Contact us button at the bottom-right corner of the screen):
Provided that I second all the previous sound advices, an option for detecting outliers is to loop over a variable list, as in the following toy-example:

Code:

set obs 100 g A=runiform() g B=runiform() g C=runiform() foreach var of varlist A-C { quietly summarize `var' g Z_`var'=1 if `var'>3*r(sd) ///the aim of Z_`var' is to detect the values beyond a threshold-value you decide to set (let's say >3 standard deviation apart)/// replace Z_`var'=0 if Z_`var'==. list `var' Z_`var' if Z_`var'==1 }

Kind regards,
Carlo

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
litosbrito

Join Date: Jul 2014

Posts: 14
#13

24 Sep 2014, 00:52

Thank you for the suggestion Carlo Lazzaro! I will tray it!

Thank you Maarten Buis for the suggestion of analysis!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35652
#14

24 Sep 2014, 03:37

Carlo wrote code for an indicator variable flagging values more than 3 times the standard deviation (SD). But consider a bundle of countries with life expectancy mean 60 years and SD 10 years. Then all values >30 years would be flagged as outliers, but not those with <30 years (which on most other criteria would be staggering outliers).

I guess Carlo was just plucking a criterion out of the air as an example, or he was thinking about some criterion of the form |value - mean| > k SD, but made a slip in his coding. Even then the mean and SD are both likely to be strongly affected by outliers when they exist, so wouldn't we be better off using median and interquartile range (IQR), say, as the basis for any rule of thumb?

Hang on: we are rediscovering box plot criteria. For those who want tables, I wrote extremes (SSC) but don't use it much. It deliberately (or so I suppose) doesn't offer hooks for dropping outliers, which is almost always bad practice in my view.

P.S. A side-effect of Carlo's code is that missing values will be flagged as 1. Taking his criterion literally, the code might be rewritten

Code:

foreach var of varlist A-C { quietly summarize `var' g Z_`var'= (`var' > 3*r(sd)) if `var' < . list `var' Z_`var' if Z_`var' == 1 }

A bigger problem is that looking for univariate outliers is only part of the problem. It's entirely possible to have bivariate outliers that aren't univariate outliers, trivariate outliers that aren't bivariate outliers, and so forth. There are naturally ways of finding these.
4 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17704
#15

24 Sep 2014, 03:52

Dear Nick,
I

was just plucking a criterion out of the air as an example...

All your remarks are, as always, sound.

Kind regards,
Carlo

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment