Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming Bottom and Top 0.1%

    Hello Statalist,

    I would like to do trimming to ignore the outliers. The trimming for observations (return) would in the top or bottom 0.1%. I use the command as following:

    Code:
    gen returntrim=return if return>=r(p 0.1) & return<=r(p 99.9)

    However, I got this error result:
    p0.1 invalid name
    r(198);

    When I try for the top or bottom 1%, I succeed.

    Code:
    gen returntrim=return if return>=r(p1) & return<=r(p99)

    Does anyone know why it has happened? Does the trimming is only for the percentile starting from 1%?
    I would appreciate the kind help and comment.
    Thank you.

    Regards,
    Rozita

  • #2
    Hi Rozita,

    Option 1
    1) I would try the following:
    ssc install winsor2

    Option 2
    1) Alternatively, you could also make two new columns, one with the 99.9% values and one with the 0.1% values:
    2) Then compare the column with the values you already have against the 99.9% values and 0.1% values, and generate a new column which excludes these values if the are above/below the 99.9% and 0.1% values.

    Comment


    • #3
      Rozita:
      the usual foreword is that trimming/deleting the so called "outliers" is in general a bad idea (unless they are apparent mistakes as from data entry).
      That said, what follows might be helpful:
      Code:
      . sysuse auto.dta
      (1978 Automobile Data)
      
      . centile price, centile(5 50 95)
      
                                                             -- Binom. Interp. --
          Variable |       Obs  Percentile    Centile        [95% Conf. Interval]
      -------------+-------------------------------------------------------------
             price |        74          5     3727.75        3291.232    3914.159
                   |                   50      5006.5        4593.566    5717.898
                   |                   95       13498        11061.53     15865.3
      
      . return list
      
      scalars:
                   r(n_cent) =  3
                        r(N) =  74
                     r(ub_3) =  15865.30130651409
                     r(lb_3) =  11061.53233673295
                      r(c_3) =  13498
                     r(ub_2) =  5717.897793287274
                     r(lb_2) =  4593.566284952721
                      r(c_2) =  5006.5
                     r(ub_1) =  3914.158992888473
                     r(lb_1) =  3291.231571513433
                      r(c_1) =  3727.75
      
      macros:
                 r(centiles) : "5 50 95"
      
      . scalar A=5
      
      . scalar B=50
      
      . scalar C=95
      
      . g flag=1 if price>=A | price<=C
      PS: cyber-crossed with Jimmy's reply.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        summarize does not produce the 0.1% or 99.9% percentiles, as is documented in its help; and if it did the returned results would not be called r(p 0.1) or r(p 99.9) as names in Stata can't include spaces or periods.

        If you wanted trimmed means, then check out

        Code:
        search trimming
        Code:
        SJ-13-3 st0313  . . . . . . . . . . . . . .  Speaking Stata: Trimming to taste
                (help trimmean, trimplot if installed)  . . . . . . . . . .  N. J. Cox
                Q3/13   SJ 13(3):640--666
                tutorial review of trimmed means, emphasizing the scope for
                trimming to varying degrees in describing and exploring data
        If you want points outside cumulative probabilities (0.001, 0.999) to be set to missing here is one way

        Code:
        . webuse nlswork, clear
        (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
        
        . ssc inst winsor
        
        . winsor ttl_exp, gen(ttl_exp2) p(0.001)
        
        . su ttl_exp*
        
            Variable |        Obs        Mean    Std. Dev.       Min        Max
        -------------+---------------------------------------------------------
             ttl_exp |     28,534    6.215316    4.652117          0   28.88461
            ttl_exp2 |     28,534    6.213207    4.643781   .0384615   22.48077
        
        . replace ttl_exp2 = . if ttl_exp2 != ttl_exp
        (53 real changes made, 53 to missing)
        although I don't approve! Note that univariate trimming is not guaranteed to catch bivariate or multivariate outliers, for one.

        Comment


        • #5
          Thank you Jimmy Chung, Carlo Lazzaro and Nick Cox.

          Jimmy, I have installed the ​ssc install winsor2.

          Carlo, I take into account your advice. If I'm not mistaken, I assume that you are saying remove the outliers is not a good idea. Rather than remove, we just let the data included in the analysis.

          Nick, I have tried your example. Please pardon my bad English because I quite confuse when you say that univariate trimming is not guaranteed to catch bivariate or multivariate outliers, for one.
          Does it mean, the code:


          winsor ttl_exp, gen(ttl_exp2) p(0.001) is only for the 0.1% and not 0.1% and 99.90% ?

          I also try the command:

          Code:
          .winsor2 ttl_exp, replace cuts (0.1 99.90) trim by (year) . su ttl_exp* Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ttl_exp | 28,496 6.208994 4.63594 0 26.84615 ttl_exp2 | 28,534 6.213207 4.643781 .0384615 22.48077 . replace ttl_exp2 = . if ttl_exp2 != ttl_exp (80 real changes made, 80 to missing)
          I just search the solution in this forum by not fully understand the code. Does it give different meaning using "cut" and trim.

          Thank you.

          Comment


          • #6
            hank you Jimmy Chung, Carlo Lazzaro and Nick Cox.

            Jimmy, I have installed the ssc install winsor2.

            Carlo, I take into account your advice. If I'm not mistaken, I assume that you are saying remove the outliers is not a good idea. Rather than remove, we just let the data included in the analysis.

            Nick, I have tried your example. Please pardon my bad English because I quite confuse when you say that univariate trimming is not guaranteed to catch bivariate or multivariate outliers, for one.

            Code:
            winsor ttl_exp, gen(ttl_exp2) p(0.001)

            Does it mean, the code is only for the 0.1% and not 0.1% and 99.90% ?


            I also try the:

            Code:
            .winsor2 ttl_exp, replace cuts (0.1 99.90) trim by (year) .


            I just search the solution in this forum by not fully understand the code. Does it give different meaning using "cut" and trim.

            Thank you.

            Comment


            • #7
              On what winsor does, do please read its help. Also, the example in #4 makes it explicit that it works on the upper tail too.

              On outliers, I mean this kind of thing:

              Code:
              clear
              set scheme s1color
              set seed 2803
              set obs 1000
              matrix C = (1, .9 \ .9, 1)
              corr2data y x, corr(C)
              replace x = 2 in 666
              replace y = -2 in 666
              scatter y x || scatter y x in 666, ms(S) msize(*2) legend(off)
              There's (by construction) an outlier at (2, -2). Will trimming catch it?

              Incidentally, if you have enough data to calculate 0.1% and 99.9% percent points, you are looking for 0.2% outliers. Won't the other 99.8% usually counteract them enough?
              Click image for larger version

Name:	problem.png
Views:	1
Size:	12.2 KB
ID:	1340962

              Last edited by Nick Cox; 16 May 2016, 06:25.

              Comment


              • #8
                Thank you Nick Cox for the explanation. I need to read more in order to deeply understand the mechanism.

                Comment


                • #9
                  Rozita.
                  yes, I meant exactly what you get.
                  Kind regards,
                  Carlo
                  (Stata 19.0)

                  Comment


                  • #10
                    Carlo is right that the coarse approaches to handling outliers are extremely questionable. There is a wide range of approaches to outliers that include robust regression (various different approaches that generally do not weight errors as their squares), winsorizing, cook's d, leverage, dfbeta, etc. All of these have drawbacks. For various reasons, we have relatively few general econometric findings about outliers. On the other hand, in many large data sets with many estimators, if you don't do something about outliers your results can be driven by a very few observations. Most of us don't want this.

                    I work with return on assets frequently. Average ROA runs around .05 but firms with almost no remaining assets can have extreme values for ROA. One observation of 5 or -5 can have more influence than a great many around .05 (because the squared error criteria makes .05 into .0025 and 2 into 25).

                    Pragmatically, you probably want to do what people in your area do. You'd need to be a lot more expert on this subject before you'd want to open it up as an issue to debate with referees.

                    Phl

                    Comment

                    Working...
                    X