Identifying and eliminating outliers in extremely large data sets

Luke Masha

Join Date: Aug 2017

Posts: 13
#1

Identifying and eliminating outliers in extremely large data sets

12 Aug 2017, 11:23

Hello,

I am working with a dataset with >200K observations and potential outliers that number in the low thousands. I can easily identify outliers by say Cook's distances and the like. I can even generate a list of all these outliers in stata without too much of an issue. The question is, how what stata code input can have stata to quickly and efficiently drop all of these essentially random observations whose common link is say a cook's distance or the like? So far the only way I am coming up to do this is to do it manually and the number of outliers is quite considerable.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

12 Aug 2017, 11:54

-help drop-

That said, you probably shouldn't do this in the first place.
1 like
Comment
Luke Masha

Join Date: Aug 2017

Posts: 13
#3

12 Aug 2017, 14:35

For the dataset in hand, there is likely considerable input error (ie misplaced decimals) so we are considering it.
Regardless, the problem is I have to drop each observation 1 at a time manually. For thousands, that may be a problem.
I don't know of a way to simply tell stata to drop observations with high cook distances.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

12 Aug 2017, 15:27

So you calculate the cook distance and store the result in a variable. Let's call the variable cook_d. Suppose your threshold for an outlier (actually a high leverage point) is 1. Then the command is

Code:

drop if cook_d > 1 & !missing(cook_d)

It seems that you are brand new to Stata. I suggest that before you attempt to undertake any real work, you invest time in learning the basics. Your Stata installation comes with PDF manuals installed. Select PDF Documentation on the Help menu. Read the Getting Started [GS] and User's Guide [U] volumes of the documentation. These will acquaint you with basic Stata operations, the approach to data management, and elementary data analysis. You'll be exposed to the commands that are used every day when working with Stata. There are many worked examples in the documentation to illustrate how things are done. You won't remember every detail, but you will learn enough that in most situations you will know what commands are likely to be useful for your current task, and then you will be able to refer to the help files and the PDF documentation to fill in the details of syntax and the details of just how the commands are implemented.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#5

12 Aug 2017, 15:37

Dear Luke,

I believe that what you want can be done like this:

Code:

sysuse auto reg price mpg rep78 trunk predict cd, cooksd drop if c>0.2

Note, however, that this will drop influential observations, not observations with misplaced decimals. So, just like Clyde Schechter, I wonder whether this is a sensible thing to do; probably it is not!

Best wishes,

Joao
PS: Oops... too late!
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#6

14 Aug 2017, 14:13

If your data have a natural range, then you can legitimately drop observations outside that range. E.g., if x can only range from 0 to 1, it is legitimate to delete values >1 or <0. All other techniques to handle extreme values have generated substantial debate (which can be seen by looking at discussions of outliers on Statalist).
1 like
Comment

Announcement

Identifying and eliminating outliers in extremely large data sets

Comment

Comment

Comment

Comment

Comment