Have question about winsor2 procedures

Rabab Al hasni

Join Date: May 2019

Posts: 70
#1

Have question about winsor2 procedures

06 May 2020, 02:38

Dear Statalists,

I hope you are well. I would like to ask you please about the process of using the code winsor2 to clean the dataset from the outlier issue. I have tried with the following steps with a number of variables but the variables have not changed- as shown in the examples.

Example (1)

clonevar PO_ST_W = PO_GEN
su PO_GEN_W , d
winsor2 P_GEN_W , replace cuts(1 99)
replace P_GEN_W =r(p99) if PO_GEN_W >=r(p99) & PO_GEN_W <.
replace P_GEN_W =r(p1) if PO_GEN_W >=r(p1) & PO_GEN_W <.

. replace PO_GEN_W =r(p1) if PO_GEN_W >=r(p1) & PO_GEN_W >.
(0 real changes made)

. replace PO_GEN_W =r(p99) if PO_GEN_W >=r(p99) & PO_GEN_W <.
(0 real changes made)

Example (2)

clonevar PO_ST_W = PO_GEN
su R_ST_W , d
winsor2 R_ST_W , replace cuts(1 99)
replace R_ST_W =r(p99) if R_ST_W >=r(p99) & R_ST_W <.
replace R_ST_W =r(p1) if R_ST_W >=r(p1) & R_ST_W <.

. replace R_ST_W =r(p1) if R_ST_W >=r(p1) & R_ST_W >.
(0 real changes made)

. replace R_ST_W =r(p99) if R_ST_W >=r(p99) & R_ST_W <.
(0 real changes made)

su R_ST_W, d

Level of satisfaction

Percentiles Smallest
1% 0 0
5% 0 0
10% .5 0 Obs 300
25% 1.5 0 Sum of Wgt. 300

50% 2 Mean 1.65
Largest Std. Dev. .6549273
75% 2 2
90% 2 2 Variance .4289298
95% 2 2 Skewness -1.63945
99% 2 2 Kurtosis 4.263773

I have attached here a sample of a graph box that shows the existence of the outlier in one of the variables.

probit Sksupprt i.FST_EXP i.FST_B i.FST_GW i.FST_AD i.FST_ADV i.R_LN i.R_ST_W i.PO_GEN i.PO_CIT i.PO_EP i.PO_EC i.FA_SE i.FA_AE i.FA_SI

My variables are dummy and categorical variables coded the former as01 and the later start wit 0, 1, 2, ... for 300 observations.

Could you please help on how to apply winsorize2 for the variables that have outliers? and why I am getting no changes made a result?

Many thanks for your continuous help

Kind Regards,
Rabab
Attached Files

Graph.gph (3.1 KB, 1 view)
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35697
#2

06 May 2020, 03:41

winsor2 is from SSC, as you are asked to explain (FAQ Advice #12). The same place explains how you can format code and results readably using CODE delimiters. (winsor2 is not winsor, which I wrote.)

I focus here on R_ST_W for which you give full details. I am confident that the issue is generic. (Incidentally, there is a typo in the variable label which you will want to fix.)

Box plots really don't show your data well. My interpretation is that you have a variable coded 0, 1, 2 and that several points are 0, but less than 25%, so 0 is the minimum but not also reported as the lower quartile. But your lower quartile is 1.5 and your upper quartile 2, so your interquartile range IQR is 0.5. It follows that values of 0 are plotted as distinct points on a box plot by the rule used by Stata, as the zeros are values more than 1.5 IQR away from the nearer quartile,

But that isn't an outlier (singular). It's several points all the same and necessarily shown in the same place on the plot.

Otherwise put, there are ties in both tails of your distribution. This is clearer in the summarize results than on the box plot.

These ties imply that the 1st and 99th percentiles are the same as the sample minimum and maximum in your case. Again, this is explicit in the summarize results.

In short, there is nothing wrong here. winsor2 is doing what was intended.

More broadly -- although there are differences of opinion on winsorizing -- I doubt that even its most enthusiastic advocates would see either need or value to winsorizing a categorical variable. It's nonsensical if the value is nominal scale but still usually pointless if it is ordinal scale.

Incidentally, Stata's rules imply that winsorizing at 1% and 99% will make no difference unless the sample size is 100 or more, even if ties do not bite.

Here is an experiment you can run.

Code:

clear set obs 100 set seed 2803 gen foo = rnormal() distinct ---------------------------- | total distinct -----+---------------------- foo | 100 100 ---------------------------- forval n = 1/100 { qui summarize foo in 1/`n', detail di %3.0f `n' " " cond(r(min) == r(p1), "same", "different") } 1 same 2 same 3 same 4 same 5 same 6 same 7 same 8 same 9 same 10 same 11 same 12 same 13 same 14 same 15 same 16 same 17 same 18 same 19 same 20 same 21 same 22 same 23 same 24 same 25 same 26 same 27 same 28 same 29 same 30 same 31 same 32 same 33 same 34 same 35 same 36 same 37 same 38 same 39 same 40 same 41 same 42 same 43 same 44 same 45 same 46 same 47 same 48 same 49 same 50 same 51 same 52 same 53 same 54 same 55 same 56 same 57 same 58 same 59 same 60 same 61 same 62 same 63 same 64 same 65 same 66 same 67 same 68 same 69 same 70 same 71 same 72 same 73 same 74 same 75 same 76 same 77 same 78 same 79 same 80 same 81 same 82 same 83 same 84 same 85 same 86 same 87 same 88 same 89 same 90 same 91 same 92 same 93 same 94 same 95 same 96 same 97 same 98 same 99 same 100 different .
Comment
Rabab Al hasni

Join Date: May 2019

Posts: 70
#3

06 May 2020, 05:59

Many thanks, Nick for your prompt reply

so, how I could please solve the issue of the outliers that I think it is existed as to the results of stdres show in my dataset (Table below)? my data is categorical. the dependent variable that takes code consists of 213 observations, while with code 0 there are only 87 observations. I think the small size of the data has made the problem of outliers. Please I advise me because I am stuck with this issue I cannot the dataset for now or even increase the sample size.

sum stdres

Variable Obs Mean Std. Dev. Min Max

stdres 300 -.0055664 1.1132 -2.326778 5.3328

Firm Number Pearson Residuals (stdres) Deviance Residuals (dv) Pregibon leverage (hat)

92 3.9 2.3 .092

194 4.2 2.4 .075

53 5.1 2.6 .026

148 5.3 2.6 .031

Thank you for your help
Rabab
Comment
Rabab Al hasni

Join Date: May 2019

Posts: 70
#4

06 May 2020, 06:12

Hi,

I forgot to tell you Nick that I have tried to run the code you suggested but it did not work,

. forval n = 1/100 { qui summarize foo in 1/`n', detail di %3.0f `n' " " cond(r(min) == r(p1), "same", "different")}

I got this comment:

program error: code follows on the same line as open brace

Kind regards,
Rabab
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#5

06 May 2020, 06:40

#3 Sorry, but I have only a dim idea of what you are showing there or understanding of what you are asking. You have residuals and leverages from some previous command(s) you do not show. The problem may lie in the data or in a poor model. Hard to say without more context.

#4 Stata is telling you what the problem is.

As shown in #2 this code is four commands, not one. The open brace must not be followed by any code in the same command line.

Code:

forval n = 1/100 { qui summarize foo in 1/`n', detail di %3.0f `n' " " cond(r(min) == r(p1), "same", "different") }
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#6

08 May 2020, 08:58

I return to the example of #1 for a riff of my own on how to look at distributions that remains of some relevance to an underlying goal here of identifying possible outliers -- and should have some wider interest too.

The data behind the graph in #1 are not given but can be reconstructed with some small detective work as follows.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float whatever int _freq 0 30 1 45 2 225 end

Copying the code makes the example here reproducible in your Stata.

Here's the main idea with a nod to Yudi Pawitan and his insistence that a normal quantile plot can be useful for any kind of numeric variable, even categorical variables.

(See slide 29 in https://www.stata.com/meeting/uk16/slides/cox_uk16.pdf for the reference, but the whole presentation bears upon this thread too.)

Using a normal distribution as a reference distribution no more implies an expectation, or even a hope, that variables will all be normally distributed than using sea level as an origin for altitudes implies that we think that the Earth is flat or that using water's freezing point as an origin for the Celsius scale implies anything about expected temperatures.

Code:

expand _freq set scheme s1color qnorm whatever

Using a normal quantile plot (other names: normal probability plot, normal scores plot, probit plot) may well seem a little bizarre here. But the display shows clearly three distinct values and no outliers (or at least nothing I would dream of calling an outlier).

I have no quarrel with anyone who wants to insist that a histogram (a bar chart, if you wish) showing the category frequencies is more direct and easier to think about for this variable. But in general histograms can be hard to optimize: the bin width and even the bin origin can be hard to choose well, let alone choose automatically for variables of different kinds.

I do have a mild quarrel with anyone who wants to sell box plots as universal distribution plots. As this thread shows, they often prove puzzling or even misleading. If you want another example see https://stats.stackexchange.com/ques...ormed-suitably

In general, box plots often omit too much or make choices for you that don't suit the data. A salutary example: generate a U-shaped distribution, draw a box plot and ask your colleagues or students to infer the distribution from the box plot. In my experience most people get It quite the wrong way round and infer a short-tailed unimodal distribution.

For another example, consider the auto data. I use multqplot (Stata Journal), which itself requires qplot (also Stata Journal).

[/CODE]
sysuse auto, clear
multqplot price-foreign, trscale(invnormal(@)) xla(-2/2) yla(#4)
[/CODE]

Depending a little on your monitor size, displays of around 3 x 3, 4 x 4 or 5 x 5 variables could be manageable for a first or overview scrutiny of the data, from which it is easy to see features -- some evident, some more subtle -- such as

* no dramatically obvious outliers

* variables such as price which are skewed and for that and other reasons might be better treated on a transformed scale

* evident categorical variables foreign and rep78

* granularity in a variable such as headroom

A strategy of looking at the data and thinking carefully about what their distributions tell you beats a mechanistic Winsorizing of tails in a fear of what outliers might do.

.
Comment
Rabab Al hasni

Join Date: May 2019

Posts: 70
#7

11 May 2020, 14:24

Dear Nick Cox,

Many thanks for your explanation. I will consider it.

Kind regards,
Rabab
Comment
Rabab Al hasni

Join Date: May 2019

Posts: 70
#8

20 May 2020, 13:37

Dear Nick

I have tried to search for references that support your point view (your above explanations #6 whether outlier exists with categorical variables or not) because I would like to consider it into my analysis and methodology chapters but with supportive evidence for that. Unfortunately, I did not find references. Therefore, I would like to ask you please could you please recommend to me papers or books in this regard? I need to support my approach of keeping what may others consider them as outliers with evidence.

Greatly appreciate your kind support and efforts

Kind regards,
Rabab
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35697
#9

21 May 2020, 02:11

The arguments here on my side are very simple really, so I don't know why you seek references. I imagine that I am much older than you are but I am aware of many things that seem widely known, or even obvious, but for which it is hard to pull out literature references. Not knowing all the literature is an obvious limitation for us all, but it's wider than that. For example, I regard it as widely known that principal component analysis on social science data is usually a pointless waste of time and effort, but (surprise) most of the literature is written by those who think otherwise.

I could add that your inclination to winsorize categorical variables is not one that you substantiate with literature references either.

As I've posted elsewhere on this forum, and often, I am puzzled about winsorizing. But let's divide up the cases.

Binary variables, say those coded 0 and 1. Winsorizing might winsorize 0 to 1 or 1 to 0 if one category is very rare, but that sounds like something a researcher should not want in any circsumstance whatsoever.

Nominal variables for which codes are arbitrary, say race. Here percentiles make no sense to me, and winsorizing no sense either.

Ordered categorical variables, say grades 1 to 5. Here percentiles make more sense, but in practice one isn't (or shouldn't be) worried about outliers -- if only because an ordered response will be treated on its own terms and an ordered predictor as a set of indicator variables.

If one extreme category say 1 or 5, is very rare indeed, winsorizing might suggest pulling it in to 2 or 4, but that sounds unnecessary on various other grounds.

It can be difficult to fit models if one category of a categorical variable is very rare, but winsorizing is far from the only solution -- and not the solution at all if the rare category is not an extreme. A full discussion of what to do best in that circumstance is hard to give at this point.
Comment
Rabab Al hasni

Join Date: May 2019

Posts: 70
#10

22 May 2020, 16:27

Dear Nick

I am so grateful for your kind explanations and being patient with my questions. I agree with your point of view. Thus, I will do my best to clarify it wisely in my chapter.
For me as a beginner learner in the approach of econometrics analysis not easy to reach the perfectionism in such issue or to approach for a solution in a very tight time, but I am trying to do all my best and ensure the goodness of fit for the models. I have read some papers regarding categorical outliers but the suggested methods are used for continuous variables (e.g. log transformation) and I am wondering how we could apply it for the categorical variables or maybe it is not clear to me yet.

Thank you very much again for your support and help

Kind regards,
Rabab
Comment

Firm Number	Pearson Residuals (stdres)	Deviance Residuals (dv)	Pregibon leverage (hat)
92	3.9	2.3	.092
194	4.2	2.4	.075
53	5.1	2.6	.026
148	5.3	2.6	.031

Announcement

Have question about winsor2 procedures

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment