winsor2 advice

iliana wouters

Join Date: Apr 2017

Posts: 12
#1

winsor2 advice

18 Apr 2017, 03:40

Hi,

I'm starting to use STATA. I read about winsorizing the data. But I have difficulties to chose between cuts(1 99) of cuts(2.5 97.5) for some variables.
Should I look at the smalles variance?
My variable GROWTH for example has a variance of 800 and st. Dev of almost 29.

Kind regards,
Iliana
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

18 Apr 2017, 03:56

The more you Winsorize, the smaller the variance will be. That's not a criterion for choice, but an inevitable side-effect. Your difficulty over choosing is precisely the difficulty that methods of this kind pose, how to choose without arbitrariness.

Why precisely do you want to Winsorize any way?
Comment
iliana wouters

Join Date: Apr 2017

Posts: 12
#3

18 Apr 2017, 04:17

Nick,
Thanks for your response. I want to winsorize to disminuish the effect of my outliers. Because I have a lot of outliers, so I can't just delete them all...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

18 Apr 2017, 05:00

What is your criterion for outliers? You do need one to apply what you are asking about.

But you don't have to delete any of them. You can for example use a transformation or a non-identity link. Quite what will work best depends on your data, on which we need further information.

There are perhaps twenty different approaches to outliers, many of which are listed in
https://stats.stackexchange.com/ques...iers-with-mean (depending naturally on how you count).

Here is a clear example in which somebody was convinced they had a problem, but all they needed was to think on logarithmic scale.
https://stats.stackexchange.com/ques...stributed-data

Conversely, we don't have more information than you do on which to make arbitrary decisions for your data and given your goals.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#5

18 Apr 2017, 06:59

you say:

I have a lot of outliers

I look at outliers in the following way: an outlier is a value that is surprising given my model of the data; note that often the "model of the data" is implicit - if implicit, you need to make it at least partly explicit because, with a lot of outliers, there is a good chance that it is your model that is wrong and you should be using a different one (e.g., you may be expecting your data to be symmetric around the mean and they are actually skewed); for more on this way of looking at outliers, see Barnett, V and Lewis, T (1994), Outliers in Statistical Data, 3rd edition, Wiley
1 like
Comment

Announcement

Comment

Comment

Comment

Comment