Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming percentiles within each age category

    Hello stata-users,
    I have come across code for trimming data globally but not what I am specifically looking for. I have two variables - age and a score. I want to remove the 1st and 99th percentile of the score within each age group. For instance, I'd like to remove the 1st and 99th percentile of the score for everyone that is 18 years old. Then I'd do the same for 19 years old in the dataset, and so on until 110 yo.

    I have thought of creating age categories individually and then trimming but that is a lot of manual work (n=thousands). Are there any shortcuts in stata that I am unaware of?

    Thank you.

  • #2
    How do you want to compute these percentiles, within the age group, or globally?

    And when you say trim/remove percentile, you mean you want to throw out the observations which are above 99th, and below 1st percentile, or you want to set these "extreme values " to the respective percentiles.

    Use -dataex- and show some data we can work with.

    Comment


    • #3
      Actually, I am not in favor of trimming | removing outliers.

      That being said, I hope this code helps:

      Code:
      . sysuse auto
      (1978 Automobile Data)
      
      . by foreign, sort: sum mpg, detail
      
      ----------------------------------------------------------------------------------------------------------
      -> foreign = Domestic
      
                              Mileage (mpg)
      -------------------------------------------------------------
            Percentiles      Smallest
       1%           12             12
       5%           14             12
      10%           14             14       Obs                  52
      25%         16.5             14       Sum of Wgt.          52
      
      50%           19                      Mean           19.82692
                              Largest       Std. Dev.      4.743297
      75%           22             28
      90%           26             29       Variance       22.49887
      95%           29             30       Skewness       .7712432
      99%           34             34       Kurtosis       3.441459
      
      ----------------------------------------------------------------------------------------------------------
      -> foreign = Foreign
      
                              Mileage (mpg)
      -------------------------------------------------------------
            Percentiles      Smallest
       1%           14             14
       5%           17             17
      10%           17             17       Obs                  22
      25%           21             18       Sum of Wgt.          22
      
      50%         24.5                      Mean           24.77273
                              Largest       Std. Dev.      6.611187
      75%           28             31
      90%           35             35       Variance       43.70779
      95%           35             35       Skewness        .657329
      99%           41             41       Kurtosis        3.10734
      
      
      . bysort foreign: egen float my1 = pctile(mpg), p(1)
      
      . bysort foreign: egen float my99 = pctile(mpg), p(99)
      
      . by foreign, sort : keep if mpg > my1 & mpg < my99
      (5 observations deleted)
      
      . by foreign, sort: sum mpg, detail
      
      ----------------------------------------------------------------------------------------------------------
      -> foreign = Domestic
      
                              Mileage (mpg)
      -------------------------------------------------------------
            Percentiles      Smallest
       1%           14             14
       5%           14             14
      10%           14             14       Obs                  49
      25%           17             14       Sum of Wgt.          49
      
      50%           19                      Mean           19.85714
                              Largest       Std. Dev.      4.143268
      75%           22             28
      90%           26             28       Variance       17.16667
      95%           28             29       Skewness       .6493822
      99%           30             30       Kurtosis       2.815251
      
      ----------------------------------------------------------------------------------------------------------
      -> foreign = Foreign
      
                              Mileage (mpg)
      -------------------------------------------------------------
            Percentiles      Smallest
       1%           17             17
       5%           17             17
      10%         17.5             18       Obs                  20
      25%           21             18       Sum of Wgt.          20
      
      50%         24.5                      Mean               24.5
                              Largest       Std. Dev.      5.316311
      75%           27             30
      90%           33             31       Variance       28.26316
      95%           35             35       Skewness        .472225
      99%           35             35       Kurtosis        2.59246
      
      
      .
      The example above concerns the estimation 'by" a categorical variable.

      But if you wish to estimate "by" a discrete variable (implicitly, I'm underlining that using a continuous variable for that matter is preposterous), you may do something like:

      Code:
      bysort trunk: egen float my1 = pctile(mpg), p(1)
      bysort trunk: egen float my99 = pctile(mpg), p(99)
      by trunk, sort : keep if mpg > my1 & mpg < my99
      All in all, it turns out that we are fundamentally maiming the data.

      In the first example, we lost around 5% of the data. In the second example, similar to what you wish to accomplish, we deleted more than 50% of the data.

      What a pity...

      Hopefully that helps, and more hopefully I convinced you to eschew the idea of trimming outliers.
      Last edited by Marcos Almeida; 28 Jan 2019, 09:33.
      Best regards,

      Marcos

      Comment


      • #4
        Two quite different processes are being muddled together in this thread.

        Trimming is ignoring data in the tails when summarizing: it is common, but not compulsory, to ignore the same fraction in each tail. Anyone who has ever worked with medians has carried out trimming, as the median ignores all the data except the one or two values in the middle of an ordered sample.

        Winsorizing is replacing extreme values beyond certain percentiles with those percentiles and then summarizing that version of the data.

        Consider a toy example with data 1, 2, 3, 4, 555. A 20% trimmed mean is the mean of 2, 3, 4, so 3. A 20% Winsorized mean is the mean of 2, 2, 3, 4, 4, also 3.

        I deliberately chose a toy example where the processes have the same result, because that can happen!

        #1 is further muddled and muddied by conflating percentiles and the bins they define.

        The first percentile, at least historically, is the value than which 1% of values are smaller and 99% are larger. People in some fields then talk about the first percentile as the bin or interval of values less than that. I wish they wouldn't but then any ambiguity doesn't really bite hard.

        I advise strongly against terminology like "removing" values. Nothing is, or should be, removed from the data any more than calculating a median implies that you remove almost all the data.

        It's singularly depressing that no-one seems to have thought of searching for, and no-one seems aware of,

        search for trimming (manual: [R] search)
        ----------------------------------------------------------------------------------------------------------------

        Search of official help files, FAQs, Examples, SJs, and STBs

        SJ-13-3 st0313 . . . . . . . . . . . . . . Speaking Stata: Trimming to taste
        (help trimmean, trimplot if installed) . . . . . . . . . . N. J. Cox
        Q3/13 SJ 13(3):640--666
        tutorial review of trimmed means, emphasizing the scope for
        trimming to varying degrees in describing and exploring data
        Last edited by Nick Cox; 28 Jan 2019, 10:04.

        Comment


        • #5
          Originally posted by Joro Kolev View Post
          How do you want to compute these percentiles, within the age group, or globally?

          And when you say trim/remove percentile, you mean you want to throw out the observations which are above 99th, and below 1st percentile, or you want to set these "extreme values " to the respective percentiles.

          Use -dataex- and show some data we can work with.
          Code:
          * Example generated by    -dataex-.    To install:    ssc    install    dataex
          clear
          input float makeid int age float score
          13 18 13
          17 18 17
           8 18  8
          29 18 29
          26 18 26
          41 18 41
          11 18 11
          16 18 16
          18 18 18
          13 18 13
           4 18  4
           8 18  8
          18 18 18
           6 18  6
          38 18 38
          65 18 65
          15 18 15
          10 18 10
          10 18 10
          13 18 13
          end

          I'd like to discard the values in the first and 99th percentile for now. The percentiles will be created within the age group. Not globally!
          @Marcos Almeida: I just want to see how it impacts the curve in the scatter plot after p 1 and 99 are removed.
          @Nick: Thank you for the article. I don't have much knowledge on this topic so your article is certainly helpful. Still reading through it!

          Comment


          • #6
            I dont know whether this is wise, and whether you have enough observations in each group so that this makes any difference, but something like this should do it:

            Code:
            . egen agegroup = group(age)
            
            . summ agegroup, meanonly
            
            . forvalues i = 1/`r(max)' {
              2. _pctile score, p(1 99)
              3. drop if agegroup==`i' & (score<r(r1) | score> r(r2))
              4. }
            (0 observations deleted)
            
            . _pctile score, p(1 99)
            
            . return list
            
            scalars:
                             r(r1) =  4
                             r(r2) =  65
            
            . sort score
            
            . dis score[1]
            4
            
            . dis score[20]
            65
            
            .
            The code did not do anything because for 20 observations there is nothing below the first percentile, and nothing above the 99 percentile.

            Comment


            • #7
              I made an error in the loop aboven now it computes percentiles relative to the whole population, which you explicitly said that you do not want.

              The loop should be like this:

              Code:
              . egen agegroup = group(age)
              
              . summ agegroup, meanonly
              
              . forvalues i = 1/`r(max)' {
                2. _pctile score if agegroup==`i', p(1 99)
                3. drop if agegroup==`i' & (score<r(r1) | score> r(r2))
                4. }
              (0 observations deleted)

              Comment


              • #8
                Nick Cox. Thank you for the explanation and for mentioning this article.
                Best regards,

                Marcos

                Comment


                • #9
                  Originally posted by Joro Kolev View Post
                  I made an error in the loop aboven now it computes percentiles relative to the whole population, which you explicitly said that you do not want.

                  The loop should be like this:

                  Code:
                  . egen agegroup = group(age)
                  
                  . summ agegroup, meanonly
                  
                  . forvalues i = 1/`r(max)' {
                  2. _pctile score if agegroup==`i', p(1 99)
                  3. drop if agegroup==`i' & (score<r(r1) | score> r(r2))
                  4. }
                  (0 observations deleted)
                  Thank you for your help. While the code has worked, I am afraid I don't understand what you did there. Can you kindly give some resources to further my understanding on it?
                  Also, why is it then that the second code provided by Marcos Almeida also worked? What's the difference between the one you provided and the one he has provided? (note: it worked but a lot more values were lost than the one you have provided).

                  Comment


                  • #10
                    The code of Marcos in #3, should be doing the same as my code in #7. That is, you should be getting the same results by applying the two mentioned procedures.

                    My code in #6 is trimming at the percentiles computed over the whole sample (not by age groups). As you said you do not want that, the code in #6 is not doing what you want.

                    If you do not understand my code in #6 and #7, but you understand Marcos' code, just carry on with Marcos' code.

                    If you want to learn more about how to do loops in Stata, have a look at the the columns by Nick Cox in Stata Journal called "Speaking Stata". I cannot provide extensive review on everything he has written on loops, but from my immediate access memory, search for keywords "lists", and "repeating oneself without going mad".

                    Comment


                    • #11
                      Look at the inequalities. The difference between drop if value < low | value > high and keep if value > low & value < high are those observations with value == low or value == high, which are not dropped in the first instance but are dropped (not kept) in the second.

                      As #4 explained, I think it's a bad idea to drop observations in the tails. It's hard, if not impossible, to do that consistently with regard to other aims of management and analysis. If you want to focus on some middle portion of the data, fine, but use an indicator variable for each instance.

                      EDIT Thus I don't agree that you should necessarily get the same results. That's possible if the percentiles are all between actual data values, but discrepancy does not surprise me.
                      Last edited by Nick Cox; 31 Jan 2019, 08:31.

                      Comment

                      Working...
                      X