Outliers on Stata

litosbrito started a topic Outliers on Stata

23 Sep 2014, 10:08
Outliers on Stata

There any way to identify outliers using STATA?
Tags: None
Nick Cox replied

04 Dec 2019, 08:20
#25 John W. Tukey proposed a rule of thumb to plot points separately on a box plot if greater than p75 + 1.5 IQR or less than p25 - 1.5 IQR.

So far, so good. This wasn't a recipe for identifying points to drop. In most cases the occurrence of outliers was, at least for Tukey, a signal to think about a transformation.
Leave a comment:
Kithinji Charles replied

04 Dec 2019, 07:46
*This example shows how to highlight outliers using percentiles
input x
1
2
12
14
15
14
16
15
14
98
76
end
* let show outliers using boxplot
graph box x
*we can then summarize with details
sum x,detail
return list
gen x_outlier=1 if x<=r(p25)-(1.5*(r(p75)-r(p25)))|x>=r(p75)+(1.5*(r(p75)-r(p25)))
keep if x_outlier==1
Leave a comment:
Denila Jinny replied

25 Oct 2018, 08:30
Originally posted by Nick Cox View Post

I don't know anything you don't about SEM. My advice is to start a new thread with a title like "Non-normality and structural equation models" so that people who know about SEM can see that. Also, I would show some graphs of the distributions of your continuous variables to give us some flavour.

OK. Thank you very much.
Leave a comment:
Nick Cox replied

25 Oct 2018, 05:42
I don't know anything you don't about SEM. My advice is to start a new thread with a title like "Non-normality and structural equation models" so that people who know about SEM can see that. Also, I would show some graphs of the distributions of your continuous variables to give us some flavour.
1 like
Leave a comment:
Denila Jinny replied

25 Oct 2018, 04:41
Originally posted by Nick Cox View Post

Denila Jinny Not at all. A large sample can be highly non-normal too. To give a better answer, we need to know more about your data and your goals, especially on whether or why you think your data "should be" normal.

Thank you very much sir for your immediate reply.
I am working on a cross sectional data. My objective is to study the causal relationships between funding, profitability and productivity. Literature suggests bi-directional relationships among these variables. Therefore I intend to do non-recursive SEM, one of the assumptions of which is normality. I have 4 continuous variables, 1 interaction variable that interacts 2 continuous variables, 2 interaction variables that interact one continuous variable with 1 dichotomous variable, and few other categorical variables. Will this be enough for you to help me with this issue?
Leave a comment:
Nick Cox replied

25 Oct 2018, 04:33
Denila Jinny Not at all. A large sample can be highly non-normal too. To give a better answer, we need to know more about your data and your goals, especially on whether or why you think your data "should be" normal.
Leave a comment:
Denila Jinny replied

25 Oct 2018, 04:22
Originally posted by Nick Cox View Post

"Think on a logarithmic scale" solves many more problems than eliminating outliers.

Sir,

What should I do, if the log value is also not normal?

I have a dataset with 9000 observations. Can I assume normality just because the sample is large?

Denila.
Leave a comment:
Wesley Mokkink replied

20 Apr 2016, 08:34
Steve Samuels I started a new thread.

HTML Code:

http://www.statalist.org/forums/forum/general-stata-discussion/general/1336660-remove-outliers-on-stata
Leave a comment:
Steve Samuels replied

20 Apr 2016, 08:24
You have now tacked a question on to a thread that was closed over a year ago. Start a new thread.
Leave a comment:
Wesley Mokkink replied

20 Apr 2016, 08:12
Nick Cox Dear Nick,

I installed the "extremes" code written by you. I would like to use it to remove extreme values in my sample. However, I do not know how to actually remove those extreme values instead of just listing them. Is there any way to do this?

Thanks in advance!

Kind regards,

Wesley
Leave a comment:
Attaullah Shah replied

24 Sep 2014, 04:08
Nick is right in his point about multivariate outliers. As a matter of fact, I have seen many papers in Finance that winsorize or drop values that are 3 SD away from mean values. In that case, we can adopt the following code

sysuse auto
foreach x of varlist price mpg{
sum `x'
drop if (`x' -(r(mean))>(3*r(sd)))
}

Regards
Attaullah Shah

Last edited by Attaullah Shah; 24 Sep 2014, 04:10.
1 like
Leave a comment:
Carlo Lazzaro replied

24 Sep 2014, 03:52
Dear Nick,
I

was just plucking a criterion out of the air as an example...

All your remarks are, as always, sound.

Kind regards,
Carlo
Leave a comment:
Nick Cox replied

24 Sep 2014, 03:37
Carlo wrote code for an indicator variable flagging values more than 3 times the standard deviation (SD). But consider a bundle of countries with life expectancy mean 60 years and SD 10 years. Then all values >30 years would be flagged as outliers, but not those with <30 years (which on most other criteria would be staggering outliers).

I guess Carlo was just plucking a criterion out of the air as an example, or he was thinking about some criterion of the form |value - mean| > k SD, but made a slip in his coding. Even then the mean and SD are both likely to be strongly affected by outliers when they exist, so wouldn't we be better off using median and interquartile range (IQR), say, as the basis for any rule of thumb?

Hang on: we are rediscovering box plot criteria. For those who want tables, I wrote extremes (SSC) but don't use it much. It deliberately (or so I suppose) doesn't offer hooks for dropping outliers, which is almost always bad practice in my view.

P.S. A side-effect of Carlo's code is that missing values will be flagged as 1. Taking his criterion literally, the code might be rewritten

Code:

foreach var of varlist A-C { quietly summarize `var' g Z_`var'= (`var' > 3*r(sd)) if `var' < . list `var' Z_`var' if Z_`var' == 1 }

A bigger problem is that looking for univariate outliers is only part of the problem. It's entirely possible to have bivariate outliers that aren't univariate outliers, trivariate outliers that aren't bivariate outliers, and so forth. There are naturally ways of finding these.
4 likes
Leave a comment:
litosbrito replied

24 Sep 2014, 00:52
Thank you for the suggestion Carlo Lazzaro! I will tray it!

Thank you Maarten Buis for the suggestion of analysis!
Leave a comment:

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: