outliers: the variable has some outliers but cannot be winsorred

Eddie Pengma

Join Date: May 2018

Posts: 38
#1

outliers: the variable has some outliers but cannot be winsorred

31 May 2018, 19:59

Dear:

Thank you sincerely. I met a problem that some variable has outliers indeed (shown by the graph box), but it seems that winsor (in some p values) does not work. For example,

sysuse auto, clear
(1978 Automobile Data)

. adjacent price

------------------------------------------
price | lower adjacent upper adjacent
----------+-------------------------------
. | 3291 8814
------------------------------------------

. graph box price

. winsor price, gen(price_n) p(0.01)
0 values to be Winsorized
r(198);

When i change the p value into 0.02, 0.05, it works. In this case, which p value should I use? If I use 0.02, it still has some outliers. If I increase p value into 0.05, the number of outliers reduced.
But again, it has some outliers. The number of outliers reduced when I improve the p value, but I wonder do I need to change all the outliers into the normal number before I do the further analysis. It is just the case for one variable, should I follow the same logic to the whole variables?

Best,

Eddie
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4396
#2

31 May 2018, 21:26

You seem to be banging your head against the wall with this preliminary winsorization step. Perhaps you might want to stop for a moment, step back and entertain an alternative tack. Take a look at the following simple, three-step approach and see whether it (or something along the lines of it) might yield more progress toward your eventual goal.

Step 1. regress price <whatever>

Step 2. rvfplot

Step 3. Discuss
Comment
Eddie Pengma

Join Date: May 2018

Posts: 38
#3

31 May 2018, 21:53

Thank you Joseph. But the existance of outliers has a strong impact of the regression results. So I wonder whether we need to deal with them before the regression?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4396
#4

31 May 2018, 22:16

A couple of points to consider if you haven't already:

1. In most cases, you don't really know whether you have outliers by looking at the outcome variable in isolation. And you don't really know which of them are the outliers from inspection of the outcome variable alone. A datum that appears to you to be an outlier in the outcome variable might have a very small residual.

2. In most cases, you never really know whether you have outliers at all by inspection alone. What appear to be outliers in the residuals might very well be your first hint that your model needs further thought (omitted variables, change in functional form of the regression and so on). If you have values that are physically / biologically / economically impossible realizations of the phenomenon under study, then you have measurement issues that need to be addressed further upstream toward the source. Winsorization is probably doing you a disservice in that case.

There are diagnostic plots that can help hone in on what data have a "strong impact". If you haven't done so already, take a look at the variety of diagnostic plots brought up by

Code:

help regress postestimation plots

But again, Step 3 above is probably what you will want to start anticipating at the outset.
2 likes
Comment
Eddie Pengma

Join Date: May 2018

Posts: 38
#5

01 Jun 2018, 00:24

Thank you Joseph for your detailed explanations.
Comment
Amin Sofla

Join Date: May 2018

Posts: 67
#6

01 Jun 2018, 03:10

Joseph explained clearly.
You might want to take a look at some articles such as Aguinis et al. (2013).
Comment
Eddie Pengma

Join Date: May 2018

Posts: 38
#7

01 Jun 2018, 04:21

Thank you Joseph and Amin!
Comment

Announcement

outliers: the variable has some outliers but cannot be winsorred

Comment

Comment

Comment

Comment

Comment

Comment