Trimming Bottom and Top 0.1%

rozita saadan

Join Date: Apr 2016

Posts: 60
#1

Trimming Bottom and Top 0.1%

16 May 2016, 04:28

Hello Statalist,

I would like to do trimming to ignore the outliers. The trimming for observations (return) would in the top or bottom 0.1%. I use the command as following:

Code:
gen returntrim=return if return>=r(p 0.1) & return<=r(p 99.9)

However, I got this error result:
p0.1 invalid name
r(198);

When I try for the top or bottom 1%, I succeed.

Code:
gen returntrim=return if return>=r(p1) & return<=r(p99)

Does anyone know why it has happened? Does the trimming is only for the percentile starting from 1%?
I would appreciate the kind help and comment.
Thank you.

Regards,
Rozita
Tags: None
Jimmy Chung

Join Date: May 2016

Posts: 5
#2

16 May 2016, 04:34

Hi Rozita,

Option 1
1) I would try the following:
ssc install winsor2

Option 2
1) Alternatively, you could also make two new columns, one with the 99.9% values and one with the 0.1% values:
2) Then compare the column with the values you already have against the 99.9% values and 0.1% values, and generate a new column which excludes these values if the are above/below the 99.9% and 0.1% values.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

16 May 2016, 04:38

Rozita:
the usual foreword is that trimming/deleting the so called "outliers" is in general a bad idea (unless they are apparent mistakes as from data entry).
That said, what follows might be helpful:

Code:

. sysuse auto.dta
(1978 Automobile Data)

. centile price, centile(5 50 95)

                                                       -- Binom. Interp. --
    Variable |       Obs  Percentile    Centile        [95% Conf. Interval]
-------------+-------------------------------------------------------------
       price |        74          5     3727.75        3291.232    3914.159
             |                   50      5006.5        4593.566    5717.898
             |                   95       13498        11061.53     15865.3

. return list

scalars:
             r(n_cent) =  3
                  r(N) =  74
               r(ub_3) =  15865.30130651409
               r(lb_3) =  11061.53233673295
                r(c_3) =  13498
               r(ub_2) =  5717.897793287274
               r(lb_2) =  4593.566284952721
                r(c_2) =  5006.5
               r(ub_1) =  3914.158992888473
               r(lb_1) =  3291.231571513433
                r(c_1) =  3727.75

macros:
           r(centiles) : "5 50 95"

. scalar A=5

. scalar B=50

. scalar C=95

. g flag=1 if price>=A | price<=C

PS: cyber-crossed with Jimmy's reply.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35711

16 May 2016, 04:57

summarize does not produce the 0.1% or 99.9% percentiles, as is documented in its help; and if it did the returned results would not be called r(p 0.1) or r(p 99.9) as names in Stata can't include spaces or periods.

If you wanted trimmed means, then check out

Code:

search trimming

Code:

SJ-13-3 st0313  . . . . . . . . . . . . . .  Speaking Stata: Trimming to taste
        (help trimmean, trimplot if installed)  . . . . . . . . . .  N. J. Cox
        Q3/13   SJ 13(3):640--666
        tutorial review of trimmed means, emphasizing the scope for
        trimming to varying degrees in describing and exploring data

If you want points outside cumulative probabilities (0.001, 0.999) to be set to missing here is one way

Code:

. webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. ssc inst winsor

. winsor ttl_exp, gen(ttl_exp2) p(0.001)

. su ttl_exp*

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     ttl_exp |     28,534    6.215316    4.652117          0   28.88461
    ttl_exp2 |     28,534    6.213207    4.643781   .0384615   22.48077

. replace ttl_exp2 = . if ttl_exp2 != ttl_exp
(53 real changes made, 53 to missing)

although I don't approve! Note that univariate trimming is not guaranteed to catch bivariate or multivariate outliers, for one.

Comment

rozita saadan

Join Date: Apr 2016

Posts: 60
#5

16 May 2016, 05:57

Thank you Jimmy Chung, Carlo Lazzaro and Nick Cox.

Jimmy, I have installed the ssc install winsor2.

Carlo, I take into account your advice. If I'm not mistaken, I assume that you are saying remove the outliers is not a good idea. Rather than remove, we just let the data included in the analysis.

Nick, I have tried your example. Please pardon my bad English because I quite confuse when you say that univariate trimming is not guaranteed to catch bivariate or multivariate outliers, for one.
Does it mean, the code:

winsor ttl_exp, gen(ttl_exp2) p(0.001) is only for the 0.1% and not 0.1% and 99.90% ?

I also try the command:

Code:
.winsor2 ttl_exp, replace cuts (0.1 99.90) trim by (year) . su ttl_exp* Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ttl_exp | 28,496 6.208994 4.63594 0 26.84615 ttl_exp2 | 28,534 6.213207 4.643781 .0384615 22.48077 . replace ttl_exp2 = . if ttl_exp2 != ttl_exp (80 real changes made, 80 to missing)
I just search the solution in this forum by not fully understand the code. Does it give different meaning using "cut" and trim.

Thank you.
Comment
rozita saadan

Join Date: Apr 2016

Posts: 60
#6

16 May 2016, 06:10

hank you Jimmy Chung, Carlo Lazzaro and Nick Cox.

Jimmy, I have installed the ssc install winsor2.

Carlo, I take into account your advice. If I'm not mistaken, I assume that you are saying remove the outliers is not a good idea. Rather than remove, we just let the data included in the analysis.

Nick, I have tried your example. Please pardon my bad English because I quite confuse when you say that univariate trimming is not guaranteed to catch bivariate or multivariate outliers, for one.

Code:
winsor ttl_exp, gen(ttl_exp2) p(0.001)

Does it mean, the code is only for the 0.1% and not 0.1% and 99.90% ?

I also try the:

Code:
.winsor2 ttl_exp, replace cuts (0.1 99.90) trim by (year) .

I just search the solution in this forum by not fully understand the code. Does it give different meaning using "cut" and trim.

Thank you.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#7

16 May 2016, 06:21

On what winsor does, do please read its help. Also, the example in #4 makes it explicit that it works on the upper tail too.

On outliers, I mean this kind of thing:

Code:

clear set scheme s1color set seed 2803 set obs 1000 matrix C = (1, .9 \ .9, 1) corr2data y x, corr(C) replace x = 2 in 666 replace y = -2 in 666 scatter y x || scatter y x in 666, ms(S) msize(*2) legend(off)

There's (by construction) an outlier at (2, -2). Will trimming catch it?

Incidentally, if you have enough data to calculate 0.1% and 99.9% percent points, you are looking for 0.2% outliers. Won't the other 99.8% usually counteract them enough?

Last edited by Nick Cox; 16 May 2016, 06:25.
Comment
rozita saadan

Join Date: Apr 2016

Posts: 60
#8

16 May 2016, 07:36

Thank you Nick Cox for the explanation. I need to read more in order to deeply understand the mechanism.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#9

16 May 2016, 07:45

Rozita.
yes, I meant exactly what you get.

Kind regards,
Carlo
(Stata 19.0)
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#10

17 May 2016, 11:37

Carlo is right that the coarse approaches to handling outliers are extremely questionable. There is a wide range of approaches to outliers that include robust regression (various different approaches that generally do not weight errors as their squares), winsorizing, cook's d, leverage, dfbeta, etc. All of these have drawbacks. For various reasons, we have relatively few general econometric findings about outliers. On the other hand, in many large data sets with many estimators, if you don't do something about outliers your results can be driven by a very few observations. Most of us don't want this.

I work with return on assets frequently. Average ROA runs around .05 but firms with almost no remaining assets can have extreme values for ROA. One observation of 5 or -5 can have more influence than a great many around .05 (because the squared error criteria makes .05 into .0025 and 2 into 25).

Pragmatically, you probably want to do what people in your area do. You'd need to be a lot more expert on this subject before you'd want to open it up as an issue to debate with referees.

Phl
1 like
Comment

Announcement