Trimming percentiles within each age category

Pakeezah Saadat

Join Date: Aug 2018

Posts: 29
#1

Trimming percentiles within each age category

28 Jan 2019, 08:49

Hello stata-users,
I have come across code for trimming data globally but not what I am specifically looking for. I have two variables - age and a score. I want to remove the 1st and 99th percentile of the score within each age group. For instance, I'd like to remove the 1st and 99th percentile of the score for everyone that is 18 years old. Then I'd do the same for 19 years old in the dataset, and so on until 110 yo.

I have thought of creating age categories individually and then trimming but that is a lot of manual work (n=thousands). Are there any shortcuts in stata that I am unaware of?

Thank you.
Tags: percentile, trim, trimming
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

28 Jan 2019, 09:27

How do you want to compute these percentiles, within the age group, or globally?

And when you say trim/remove percentile, you mean you want to throw out the observations which are above 99th, and below 1st percentile, or you want to set these "extreme values " to the respective percentiles.

Use -dataex- and show some data we can work with.
1 like
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

28 Jan 2019, 09:30

Actually, I am not in favor of trimming | removing outliers.

That being said, I hope this code helps:

Code:

. sysuse auto
(1978 Automobile Data)

. by foreign, sort: sum mpg, detail

----------------------------------------------------------------------------------------------------------
-> foreign = Domestic

                        Mileage (mpg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           12             12
 5%           14             12
10%           14             14       Obs                  52
25%         16.5             14       Sum of Wgt.          52

50%           19                      Mean           19.82692
                        Largest       Std. Dev.      4.743297
75%           22             28
90%           26             29       Variance       22.49887
95%           29             30       Skewness       .7712432
99%           34             34       Kurtosis       3.441459

----------------------------------------------------------------------------------------------------------
-> foreign = Foreign

                        Mileage (mpg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           14             14
 5%           17             17
10%           17             17       Obs                  22
25%           21             18       Sum of Wgt.          22

50%         24.5                      Mean           24.77273
                        Largest       Std. Dev.      6.611187
75%           28             31
90%           35             35       Variance       43.70779
95%           35             35       Skewness        .657329
99%           41             41       Kurtosis        3.10734


. bysort foreign: egen float my1 = pctile(mpg), p(1)

. bysort foreign: egen float my99 = pctile(mpg), p(99)

. by foreign, sort : keep if mpg > my1 & mpg < my99
(5 observations deleted)

. by foreign, sort: sum mpg, detail

----------------------------------------------------------------------------------------------------------
-> foreign = Domestic

                        Mileage (mpg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           14             14
 5%           14             14
10%           14             14       Obs                  49
25%           17             14       Sum of Wgt.          49

50%           19                      Mean           19.85714
                        Largest       Std. Dev.      4.143268
75%           22             28
90%           26             28       Variance       17.16667
95%           28             29       Skewness       .6493822
99%           30             30       Kurtosis       2.815251

----------------------------------------------------------------------------------------------------------
-> foreign = Foreign

                        Mileage (mpg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           17             17
 5%           17             17
10%         17.5             18       Obs                  20
25%           21             18       Sum of Wgt.          20

50%         24.5                      Mean               24.5
                        Largest       Std. Dev.      5.316311
75%           27             30
90%           33             31       Variance       28.26316
95%           35             35       Skewness        .472225
99%           35             35       Kurtosis        2.59246


.

The example above concerns the estimation 'by" a categorical variable.

But if you wish to estimate "by" a discrete variable (implicitly, I'm underlining that using a continuous variable for that matter is preposterous), you may do something like:

Code:

bysort trunk: egen float my1 = pctile(mpg), p(1)
bysort trunk: egen float my99 = pctile(mpg), p(99)
by trunk, sort : keep if mpg > my1 & mpg < my99

All in all, it turns out that we are fundamentally maiming the data.

In the first example, we lost around 5% of the data. In the second example, similar to what you wish to accomplish, we deleted more than 50% of the data.

What a pity...

Hopefully that helps, and more hopefully I convinced you to eschew the idea of trimming outliers.

Last edited by Marcos Almeida; 28 Jan 2019, 09:33.

Best regards,

Marcos

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

28 Jan 2019, 09:56

Two quite different processes are being muddled together in this thread.

Trimming is ignoring data in the tails when summarizing: it is common, but not compulsory, to ignore the same fraction in each tail. Anyone who has ever worked with medians has carried out trimming, as the median ignores all the data except the one or two values in the middle of an ordered sample.

Winsorizing is replacing extreme values beyond certain percentiles with those percentiles and then summarizing that version of the data.

Consider a toy example with data 1, 2, 3, 4, 555. A 20% trimmed mean is the mean of 2, 3, 4, so 3. A 20% Winsorized mean is the mean of 2, 2, 3, 4, 4, also 3.

I deliberately chose a toy example where the processes have the same result, because that can happen!

#1 is further muddled and muddied by conflating percentiles and the bins they define.

The first percentile, at least historically, is the value than which 1% of values are smaller and 99% are larger. People in some fields then talk about the first percentile as the bin or interval of values less than that. I wish they wouldn't but then any ambiguity doesn't really bite hard.

I advise strongly against terminology like "removing" values. Nothing is, or should be, removed from the data any more than calculating a median implies that you remove almost all the data.

It's singularly depressing that no-one seems to have thought of searching for, and no-one seems aware of,

search for trimming (manual: [R] search)
----------------------------------------------------------------------------------------------------------------

Search of official help files, FAQs, Examples, SJs, and STBs

SJ-13-3 st0313 . . . . . . . . . . . . . . Speaking Stata: Trimming to taste
(help trimmean, trimplot if installed) . . . . . . . . . . N. J. Cox
Q3/13 SJ 13(3):640--666
tutorial review of trimmed means, emphasizing the scope for
trimming to varying degrees in describing and exploring data

Last edited by Nick Cox; 28 Jan 2019, 10:04.
3 likes
Comment
Pakeezah Saadat

Join Date: Aug 2018

Posts: 29
#5

29 Jan 2019, 08:35

Originally posted by Joro Kolev View Post

How do you want to compute these percentiles, within the age group, or globally?

And when you say trim/remove percentile, you mean you want to throw out the observations which are above 99th, and below 1st percentile, or you want to set these "extreme values " to the respective percentiles.

Use -dataex- and show some data we can work with.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float makeid int age float score 13 18 13 17 18 17 8 18 8 29 18 29 26 18 26 41 18 41 11 18 11 16 18 16 18 18 18 13 18 13 4 18 4 8 18 8 18 18 18 6 18 6 38 18 38 65 18 65 15 18 15 10 18 10 10 18 10 13 18 13 end

I'd like to discard the values in the first and 99th percentile for now. The percentiles will be created within the age group. Not globally!
@Marcos Almeida: I just want to see how it impacts the curve in the scatter plot after p 1 and 99 are removed.
@Nick: Thank you for the article. I don't have much knowledge on this topic so your article is certainly helpful. Still reading through it!
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

29 Jan 2019, 10:14

I dont know whether this is wise, and whether you have enough observations in each group so that this makes any difference, but something like this should do it:

Code:

. egen agegroup = group(age) . summ agegroup, meanonly . forvalues i = 1/`r(max)' { 2. _pctile score, p(1 99) 3. drop if agegroup==`i' & (score<r(r1) | score> r(r2)) 4. } (0 observations deleted) . _pctile score, p(1 99) . return list scalars: r(r1) = 4 r(r2) = 65 . sort score . dis score[1] 4 . dis score[20] 65 .

The code did not do anything because for 20 observations there is nothing below the first percentile, and nothing above the 99 percentile.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#7

29 Jan 2019, 12:48

I made an error in the loop aboven now it computes percentiles relative to the whole population, which you explicitly said that you do not want.

The loop should be like this:

Code:

. egen agegroup = group(age) . summ agegroup, meanonly . forvalues i = 1/`r(max)' { 2. _pctile score if agegroup==`i', p(1 99) 3. drop if agegroup==`i' & (score<r(r1) | score> r(r2)) 4. } (0 observations deleted)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#8

29 Jan 2019, 13:24

Nick Cox. Thank you for the explanation and for mentioning this article.

Best regards,

Marcos
Comment
Pakeezah Saadat

Join Date: Aug 2018

Posts: 29
#9

31 Jan 2019, 08:04

Originally posted by Joro Kolev View Post

I made an error in the loop aboven now it computes percentiles relative to the whole population, which you explicitly said that you do not want.

The loop should be like this:

Code:

. egen agegroup = group(age) . summ agegroup, meanonly . forvalues i = 1/`r(max)' { 2. _pctile score if agegroup==`i', p(1 99) 3. drop if agegroup==`i' & (score<r(r1) | score> r(r2)) 4. } (0 observations deleted)

Thank you for your help. While the code has worked, I am afraid I don't understand what you did there. Can you kindly give some resources to further my understanding on it?
Also, why is it then that the second code provided by Marcos Almeida also worked? What's the difference between the one you provided and the one he has provided? (note: it worked but a lot more values were lost than the one you have provided).
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#10

31 Jan 2019, 08:22

The code of Marcos in #3, should be doing the same as my code in #7. That is, you should be getting the same results by applying the two mentioned procedures.

My code in #6 is trimming at the percentiles computed over the whole sample (not by age groups). As you said you do not want that, the code in #6 is not doing what you want.

If you do not understand my code in #6 and #7, but you understand Marcos' code, just carry on with Marcos' code.

If you want to learn more about how to do loops in Stata, have a look at the the columns by Nick Cox in Stata Journal called "Speaking Stata". I cannot provide extensive review on everything he has written on loops, but from my immediate access memory, search for keywords "lists", and "repeating oneself without going mad".
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#11

31 Jan 2019, 08:28

Look at the inequalities. The difference between drop if value < low | value > high and keep if value > low & value < high are those observations with value == low or value == high, which are not dropped in the first instance but are dropped (not kept) in the second.

As #4 explained, I think it's a bad idea to drop observations in the tails. It's hard, if not impossible, to do that consistently with regard to other aims of management and analysis. If you want to focus on some middle portion of the data, fine, but use an indicator variable for each instance.

EDIT Thus I don't agree that you should necessarily get the same results. That's possible if the percentiles are all between actual data values, but discrepancy does not surprise me.

Last edited by Nick Cox; 31 Jan 2019, 08:31.
Comment

Announcement