Dealing with outliers/extreme values by group

Anne-Claire Jo

Join Date: Feb 2021

Posts: 162
#1

Dealing with outliers/extreme values by group

02 Jun 2025, 05:09

Hello,

First of all, I found that due to bug, I had multiple posts/replies regarding the similar questions in this forum and sorry for the mess and confusion.

Now, in stata, I am trying to do robustness checks of outliers (or extreme values) by group and I am encountering issues with its implementation.

First, I would like to trim/winsorize by group using this code:

Code:

foreach var in `varlist' { bys group: replace `var' = `r(p1)' if `var' < `r(p1)' bys group: replace `var' = `r(p99)' if `var' > `r(p99)' }

however, the error says:

Code:

"if not found"

. when i try only

Code:

foreach var in `varlist' { replace `var' = `r(p1)' if `var' < `r(p1)' replace `var' = `r(p99)' if `var' > `r(p99)' }

this, it works perfectly fine but I wanted to do it by group. Could someone help me with this?

Second, I am running multiple regressions with panel data (that has multiple countries/sectors/years). and I would like to ensure that for each file the regression for different LHS are based on the common sample (i.e. ensure that results across LHS are not driven by differences in sample). For instance, if I say i have list of LHS: var1 var2 var3, then all three variables are missing if either var1 or var2 or var3 is missing. This way it doesn't depend on the fixed effects and i should get a consistent sample. And by doing this i can check that you are not loosing too many observations at the same time. Having this in mind, I wanted to implement this method but i don't have much idea how to actually make them in Stata code. Maybe someone could give me any insights on the structure/code that I can use, please?

Thanks so much in advance!
Tags: None
Felix Bittmann

Join Date: Aug 2018

Posts: 691
#2

02 Jun 2025, 10:47

There are two problems with your code. First, you want to access the locals r(p1) and r(p99), but these are never computed in your code as there is no call to summarize. Second, when replacing values larger than r(p99), you will replace missing values with the p99 value. This is a potentially fatal error. You can solve this task by using a double loop as such:

Code:

sysuse nlsw88, clear levelsof industry, local(groups) foreach G of local groups { foreach VAR of varlist wage hours { sum `VAR' if industry == `G', det replace `VAR' = `r(p1)' if industry == `G' & `VAR' < `r(p1)' replace `VAR' = `r(p99)' if industry == `G' & `VAR' > `r(p99)' & !missing(`VAR') } }

Here, industry is the group variable and wage and hours are being winsorized.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35681
#3

02 Jun 2025, 11:13

Cross-posted (and answered) at https://stackoverflow.com/questions/...alues-by-group

We ask that you tell us about cross-posting.

Felix Bittmann makes an excellent point about missings.
Comment
Anne-Claire Jo

Join Date: Feb 2021

Posts: 162
#4

02 Jun 2025, 14:33

Thank you Felix and Nick for your reply, it is very helpful.
and thanks for your advice on cross-posting.

Felix Bittmann i have additional clarification question:
if i use

Code:

foreach var of varlist { bys group: replace ‘var’ = r(p1) if ‘var’ < r(p1) bys group: replace ‘var’ = r(p99) if ‘var’ < r(p99) & !missing(‘var’)

then is this equivalent to your code from #2?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35681
#5

02 Jun 2025, 15:24

Absolutely not equivalent, for the reason Felix explained in this thread and I explained on Stack Overflow. You have fixed one problem only, missing values.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#6

02 Jun 2025, 15:29

On the assumption that both Nick Cox and Felix Bittman are asleep at this time of day in Europe, I'll jump in here.

No, the code you show in #4 is definitely not equivalent to the code in #2. The code in #2 is correct for your purpose; that in #4 is not.

The difference is that in #2, there are two nested loops. Consequently, the -summarize- command is executed separately for each group, as well as separately for each variable VAR, and then compares the values of the current VAR to the current `r(p1)' and `r(p9)'.

By contrast, in #4, you have only one loop, with some -by group- commands, which are like loops, but are not exactly the same thing, nested within it. It contains no -summarize- command at all. (Consequently `r(p1)' and `r(p99)' are undefined and will cause syntax error messages.) Even if you add a -summarize- command on the line after -foreach var of varlist-, for each var it will be executed only once for the entire data set, not by group. Consequently the values of `r(p1)' and `r(p99)' referred to in the -replace `var' = ...- commands will not be group specific.

Added: Crossed with #5, and apparently Nick is still up!
2 likes
Comment
Anne-Claire Jo

Join Date: Feb 2021

Posts: 162
#7

03 Jun 2025, 02:06

Nick Cox Clyde Schechter Thank you all for your kind and detailed reply.
It is indeed super helpful and I now got the point, I appreciate it a lot.
Comment

Announcement

Dealing with outliers/extreme values by group

Comment

Comment

Comment

Comment

Comment

Comment