Efficient way of removing outliers in many variables?

Senor Massao

Join Date: Dec 2015

Posts: 26
#1

Efficient way of removing outliers in many variables?

02 Nov 2022, 08:19

Hi all,
I need to remove outliers (with IQR (1.5) criteria at both ends) in about 70 variables. I’m looking for a software that can help me achieve it in as simple and straight forward way as possible, i.e., without having to first make iqr 1.5 cut offs for each variable - then flagging outliers with some binary variable - then removing them or generating a new variable – one variable at a time. I want to do it all together in one go: remove all outliers on all individual variables in varlist and generate new variables without outliers with some new prefix/suffix or something.

I understand IQR is not the best approach and I understand the limitations and potential problems of identifying outliers with IQR approach – but the project I’m working on requires it.

Is there a user-written software to make this process efficient?

Thankfully,
Massao

Last edited by Senor Massao; 02 Nov 2022, 08:56.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

02 Nov 2022, 09:41

To my understanding, taking literally what you ask for leads to something impossible: The IQR is a property of an individual variable, so one would *have* to work on one variable at a time. Also, I'm not sure what you mean by "remove." I'm going to assume that by "remove" you mean "change the value to missing," and that you want to avoid having to make changes manually, one variable at a time. With that understanding, I'd suggest this loop:

Code:

foreach v of varlist YourVar1 ..... YourVar70 { // put in your own variable list quiet summ `v', detail quiet replace `v' = . if !inrange(`v', r(p25), r(p75)) }
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

02 Nov 2022, 09:47

I think Mike Lacy means

Code:

quiet replace `v' = . if !inrange(`v', r(p25) - 1.5 * (r(p75) - r(p25)), r(p75) + 1.5 * (r(p75) - r(p25))

although the code might be speeded up doing some of those calculations just once, and I think this would be terrible practice for anything close to what I do.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#4

02 Nov 2022, 10:14

Yes, thanks go to Nick Cox for that friendly correction. I got so caught up in other features the question that I ignored the 1.5. <grin> Nick's suggestion makes me wonder, though, about the point relative to which outliers are to be defined. His code interprets Senor's intention as defining outliers relative to the 25th and 75th percentile, but perhaps Senor had in mind to define them relative to some central point, e.g. median +/- 1.5 * IQR.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

02 Nov 2022, 10:21

The criterion mentioned was one that Tukey in the 1970s used to identify on boxplots interesting points, at least modestly extreme, to show individually and think about. Thinking about might include, say, realising that a transformation would help ,mightily.

Somewhen in the last 50 years someone took that to mean bad data points to discard as quickly as possible. Where is this justified?
Comment
Senor Massao

Join Date: Dec 2015

Posts: 26
#6

02 Nov 2022, 10:36

Originally posted by Mike Lacy View Post

To my understanding, taking literally what you ask for leads to something impossible: The IQR is a property of an individual variable, so one would *have* to work on one variable at a time.

Sorry for being unclear. Yes, I think your code is in the right direction. Many thanks. As IQR is an individual property of each variable, some sort of a loop would be needed to utilize IQR from each variable.

Originally posted by Mike Lacy View Post

Also, I'm not sure what you mean by "remove." I'm going to assume that by "remove" you mean "change the value to missing,"

Yes, that is what I meant, to convert them into missing.

Originally posted by Mike Lacy View Post

and that you want to avoid having to make changes manually, one variable at a time. With that understanding, I'd suggest this loop:

Code:

foreach v of varlist YourVar1 ..... YourVar70 { // put in your own variable list quiet summ `v', detail quiet replace `v' = . if !inrange(`v', r(p25), r(p75)) }

Many thanks. I tried the code with suggested change by Nick, as follows:

Code:

foreach v of varlist logsa93 loga772 logh897 { quiet summ `v', detail quiet replace `v' = . if !inrange(`v', r(p25) - 1.5 * (r(p75) - r(p25)), r(p75) + 1.5 * (r(p75) - r(p25)) }

Only three variables included for simplicity. I’m getting an error:
too few ‘)’ or ’]’ included
Any ideas where I might have been wrong in transferring the code?
Comment
Senor Massao

Join Date: Dec 2015

Posts: 26
#7

02 Nov 2022, 10:40

Originally posted by Nick Cox View Post

I think Mike Lacy means

Code:

quiet replace `v' = . if !inrange(`v', r(p25) - 1.5 * (r(p75) - r(p25)), r(p75) + 1.5 * (r(p75) - r(p25))

[FONT=arial]although the code might be speeded up doing some of those calculations just once

Many thanks :-)

Originally posted by Nick Cox View Post

, and I think this would be terrible practice

I completely agree. I hope this is not used by anyone else without realizing the potential issues with this. I have been 'asked' to do it and I know this is absolutely bad-to say the least. But a quicker coding solution should reduce my frustration somewhat :-)
Comment
Senor Massao

Join Date: Dec 2015

Posts: 26
#8

02 Nov 2022, 10:45

Originally posted by Nick Cox View Post

The criterion mentioned was one that Tukey in the 1970s used to identify on boxplots interesting points, at least modestly extreme, to show individually and think about. Thinking about might include, say, realising that a transformation would help ,mightily.

Somewhen in the last 50 years someone took that to mean bad data points to discard as quickly as possible. Where is this justified?

Agree. Transformation does help and yes, it was originally meant as part of making box plot by hand in Tukey's book from 1977. And yes, it is a bad practice to remove univariate outliers like this - but as I indicated, its part of 'job', so even with 'disclaimers' , I still need to apply this approach :-)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#9

02 Nov 2022, 11:32

#6 Needs extra ) at the end. My bad.
Comment
Senor Massao

Join Date: Dec 2015

Posts: 26
#10

02 Nov 2022, 14:05

Originally posted by Nick Cox View Post

#6 Needs extra ) at the end. My bad.

Many thanks. Kind regards, Massao
Comment

Announcement

Efficient way of removing outliers in many variables?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment