Getting rid of entire variables due to single outlier observations

Carl Fredrik

Join Date: Oct 2020

Posts: 7
#1

Getting rid of entire variables due to single outlier observations

11 Nov 2021, 16:29

I have a dataset where values should be between 0.25 and 1.25 in order to makes sense. Values that are beyond this range are not useful, and the entire variable is likely not worth analyzing.
Therefore, I would like to drop the whole variable for any variable that has any observation that is outside this range.

Example data:
date var1001 var2001 var3001 ...

21937 0.9 0.8 0.7

21938 0.8 0.7 0.6

21939 0.6 1.4 0.9

21940 1.0 0.2 1.0

In this example I would like to totally drop the entire variable var2001 - and I would like to do this in an automated way, as I have 1500 variables, and ~900 observations per variable.
I really can't figure out how to do it, neither in long or wide formats.

Really grateful for any help.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

11 Nov 2021, 17:18

Code:

foreach v of varlist var* { summ `v', meanonly drop `v' if r(min) < 0.25 | r(max) > 1.25 }

will do this.

That said, this strikes me as a very extreme approach to outlying values. Perhaps in your context it makes sense, but would it not be better to investigate these outlying values to find out what the correct values would be and then substitute them in? Or perhaps just remove the offending values and perhaps consider some imputation process to deal with them? With your approach, a variable could have just one error in 900 observations and you will jettison the whole thing. You are discarding information at a breathtaking pace! Throwing out the baby with the bathwater.
Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 923

11 Nov 2021, 17:27

Dear Clyde Schechter, I do the code in #2, but Stata report error. Could you please check it for me? Thank you.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int date double(var1001 var2001 var3001)
21937 .9  .8 .7
21938 .8  .7 .6
21939 .6 1.4 .9
21940  1  .2  1
end

Code:

. set trace on

. do "C:\Users\chen\AppData\Local\Temp\STD27e0_000000.tmp"

. foreach v of varlist var* {
  2.     summ `v', meanonly
  3.     drop `v' if r(min) < 0.25 | r(max) > 1.25
  4. }
- foreach v of varlist var* {
- summ `v', meanonly
= summ var1001, meanonly
- drop `v' if r(min) < 0.25 | r(max) > 1.25
= drop var1001 if r(min) < 0.25 | r(max) > 1.25
invalid syntax
  }
r(198);

end of do-file

r(198);

Another code is also reported invalid:

Code:

. do "C:\Users\chen\AppData\Local\Temp\STD27e0_000000.tmp"

. foreach v of varlist var* {
  2.  drop `v' if inrange(`v',0.25,1.25)==0
  3.  }
- foreach v of varlist var* {
- drop `v' if inrange(`v',0.25,1.25)==0
= drop var1001 if inrange(var1001,0.25,1.25)==0
invalid syntax
  }
r(198);

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

11 Nov 2021, 17:33

Sorry, that was a truly dumb mistake on my part. I wrote an -if-qualifier when it should have been an -if- command:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int date double(var1001 var2001 var3001) 21937 .9 .8 .7 21938 .8 .7 .6 21939 .6 1.4 .9 21940 1 .2 1 end foreach v of varlist var* { summ `v', meanonly if r(min) < 0.25 | r(max) > 1.25 { drop `v' } }

By the way, your approach with -inrange()- will not work. -inrange()- cannot look at the entire variable: it looks at once observation at a time. The post in #1 calls for dropping the variable if it has any values < 0.25 or > 1.25.

Last edited by Clyde Schechter; 11 Nov 2021, 17:35.
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 923
#5

11 Nov 2021, 17:40

Thank you Clyde, it works! I had thought there's a mistake in my do-file editor or something else. Do this mean that when we need to -drop- variables we cannot use -if- qualifier in a loop?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

11 Nov 2021, 17:50

Even outside a loop, -drop variable(s)- does not allow an if qualifier.* If you think about it this makes perfect sense. An -if- qualifier designates which observations to apply a command to. But -drop variable- cannot be so restricted: you either drop the whole variable or you keep it. On the other hand, you may want to only drop the variable(s) if it(they) satisfy some overall condition: and the syntax for that is an -if- command. An -if- command does not designate what observations to apply commands to: rather, it designates some condition on the current state of Stata and the data that doesn't necessarily refer to any particular observations, and then the command(s) inside the braces are executed or not, depending on whether the condition is satisfied as a whole.

*Just to keep you confused: there are of course perfectly good commands like -drop if !inrange(var, 0.25, 1.25)-. The key difference is that there is no variable to be dropped here. Here the observations that satisfy the condition are to be dropped--and of course the way to distinguish which observations are the targets of this command is with an -if- qualifier. That's what -if- qualifiers do!
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 923
#7

11 Nov 2021, 17:58

Thank you Clyde, I should re-learn -help drop-.
Comment
Carl Fredrik

Join Date: Oct 2020

Posts: 7
#8

12 Nov 2021, 02:49

Originally posted by Clyde Schechter View Post

That said, this strikes me as a very extreme approach to outlying values. Perhaps in your context it makes sense, but would it not be better to investigate these outlying values to find out what the correct values would be and then substitute them in? Or perhaps just remove the offending values and perhaps consider some imputation process to deal with them? With your approach, a variable could have just one error in 900 observations and you will jettison the whole thing. You are discarding information at a breathtaking pace! Throwing out the baby with the bathwater.

Wow, all of this was tremendously useful for me (and apparently for others as well).
I had been trying to use an if command, but it was either throwing out all my values - or none of them.

And I agree, this is an extreme way of cleaning my data, but I'm still in the early process of looking at it, and I need this mostly to be able to explore the data. I will need to return to the outliers and identify why they are what they are - but that is a later step once I understand it better. Most likely nearly all of those outlying datapoints are faulty (or due to errors/variability in the data collection process) - making the entire variable uncertain and of little use. As the variables I am interested in are ratios of activity in 2019 versus 2020 - this can't theoretically be conceived to vary much beyond that range (or based of the hypothesis I'm looking at: such variation would be irrelevant for the outcome).

Each variable is connected to a geographic location, with varying population - and I could either exclude those values that fulfill combined criteria of low population, high variability (some arbitrary SD-cutoff), AND extending beyond the given range(0.25-~1.25). However, this would be a later step, once I understand the data better. Also, the data I'm collecting will need to be analyzed using a different method in the coming years, as I will have more baseline data, so that I can do more than just use ratios.

I can perhaps give an example of what the data looks like (very messy), when all of the variables are plotted in a graph - this is 2020 and 2021:

Hope this makes sense, otherwise I'm very open to suggestions!

I will get back to you once I can tell if the analysis works.

EDIT:
It works!

Now I can start looking into the data more in depth (not plotting all variables in one graph). The major point is understanding what differentiates those variables within this main series that trend high, versus those that trend low.)
Graph after cleaning:

Last edited by Carl Fredrik; 12 Nov 2021, 03:27.
Comment

date	var1001	var2001	var3001	...
21937	0.9	0.8	0.7
21938	0.8	0.7	0.6
21939	0.6	1.4	0.9
21940	1.0	0.2	1.0

Announcement

Getting rid of entire variables due to single outlier observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment