Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Have question about winsor2 procedures

    Dear Statalists,

    I hope you are well. I would like to ask you please about the process of using the code winsor2 to clean the dataset from the outlier issue. I have tried with the following steps with a number of variables but the variables have not changed- as shown in the examples.

    Example (1)

    ​​​​​​clonevar PO_ST_W = PO_GEN
    su PO_GEN_W , d
    winsor2 P_GEN_W , replace cuts(1 99)
    replace P_GEN_W =r(p99) if PO_GEN_W >=r(p99) & PO_GEN_W <.
    replace P_GEN_W =r(p1) if PO_GEN_W >=r(p1) & PO_GEN_W <.

    . replace PO_GEN_W =r(p1) if PO_GEN_W >=r(p1) & PO_GEN_W >.
    (0 real changes made)

    . replace PO_GEN_W =r(p99) if PO_GEN_W >=r(p99) & PO_GEN_W <.
    (0 real changes made)

    Example (2)

    ​​​​​​clonevar PO_ST_W = PO_GEN
    su R_ST_W , d
    winsor2 R_ST_W , replace cuts(1 99)
    replace R_ST_W =r(p99) if R_ST_W >=r(p99) & R_ST_W <.
    replace R_ST_W =r(p1) if R_ST_W >=r(p1) & R_ST_W <.

    . replace R_ST_W =r(p1) if R_ST_W >=r(p1) & R_ST_W >.
    (0 real changes made)

    . replace R_ST_W =r(p99) if R_ST_W >=r(p99) & R_ST_W <.
    (0 real changes made)


    su R_ST_W, d

    Level of satisfaction

    Percentiles Smallest
    1% 0 0
    5% 0 0
    10% .5 0 Obs 300
    25% 1.5 0 Sum of Wgt. 300

    50% 2 Mean 1.65
    Largest Std. Dev. .6549273
    75% 2 2
    90% 2 2 Variance .4289298
    95% 2 2 Skewness -1.63945
    99% 2 2 Kurtosis 4.263773




    I have attached here a sample of a graph box that shows the existence of the outlier in one of the variables.


    probit Sksupprt i.FST_EXP i.FST_B i.FST_GW i.FST_AD i.FST_ADV i.R_LN i.R_ST_W i.PO_GEN i.PO_CIT i.PO_EP i.PO_EC i.FA_SE i.FA_AE i.FA_SI

    My variables are dummy and categorical variables coded the former as01 and the later start wit 0, 1, 2, ... for 300 observations.


    Could you please help on how to apply winsorize2 for the variables that have outliers? and why I am getting no changes made a result?


    Many thanks for your continuous help

    Kind Regards,
    Rabab
    Attached Files

  • #2
    winsor2 is from SSC, as you are asked to explain (FAQ Advice #12). The same place explains how you can format code and results readably using CODE delimiters. (winsor2 is not winsor, which I wrote.)

    I focus here on R_ST_W for which you give full details. I am confident that the issue is generic. (Incidentally, there is a typo in the variable label which you will want to fix.)

    Box plots really don't show your data well. My interpretation is that you have a variable coded 0, 1, 2 and that several points are 0, but less than 25%, so 0 is the minimum but not also reported as the lower quartile. But your lower quartile is 1.5 and your upper quartile 2, so your interquartile range IQR is 0.5. It follows that values of 0 are plotted as distinct points on a box plot by the rule used by Stata, as the zeros are values more than 1.5 IQR away from the nearer quartile,

    But that isn't an outlier (singular). It's several points all the same and necessarily shown in the same place on the plot.

    Otherwise put, there are ties in both tails of your distribution. This is clearer in the summarize results than on the box plot.

    These ties imply that the 1st and 99th percentiles are the same as the sample minimum and maximum in your case. Again, this is explicit in the summarize results.

    In short, there is nothing wrong here. winsor2 is doing what was intended.

    More broadly -- although there are differences of opinion on winsorizing -- I doubt that even its most enthusiastic advocates would see either need or value to winsorizing a categorical variable. It's nonsensical if the value is nominal scale but still usually pointless if it is ordinal scale.

    Incidentally, Stata's rules imply that winsorizing at 1% and 99% will make no difference unless the sample size is 100 or more, even if ties do not bite.

    Here is an experiment you can run.

    Code:
    clear
    
    set obs 100
    
    set seed 2803
    
    gen foo = rnormal()
    
    distinct
    
    ----------------------------
         |     total   distinct
    -----+----------------------
     foo |       100        100
    ----------------------------
    
    forval n = 1/100 {
        qui summarize foo in 1/`n', detail
        di %3.0f `n' "  " cond(r(min) == r(p1), "same", "different")
    }
    
      1  same
      2  same
      3  same
      4  same
      5  same
      6  same
      7  same
      8  same
      9  same
     10  same
     11  same
     12  same
     13  same
     14  same
     15  same
     16  same
     17  same
     18  same
     19  same
     20  same
     21  same
     22  same
     23  same
     24  same
     25  same
     26  same
     27  same
     28  same
     29  same
     30  same
     31  same
     32  same
     33  same
     34  same
     35  same
     36  same
     37  same
     38  same
     39  same
     40  same
     41  same
     42  same
     43  same
     44  same
     45  same
     46  same
     47  same
     48  same
     49  same
     50  same
     51  same
     52  same
     53  same
     54  same
     55  same
     56  same
     57  same
     58  same
     59  same
     60  same
     61  same
     62  same
     63  same
     64  same
     65  same
     66  same
     67  same
     68  same
     69  same
     70  same
     71  same
     72  same
     73  same
     74  same
     75  same
     76  same
     77  same
     78  same
     79  same
     80  same
     81  same
     82  same
     83  same
     84  same
     85  same
     86  same
     87  same
     88  same
     89  same
     90  same
     91  same
     92  same
     93  same
     94  same
     95  same
     96  same
     97  same
     98  same
     99  same
    100  different
    
    .

    Comment


    • #3
      Many thanks, Nick for your prompt reply

      so, how I could please solve the issue of the outliers that I think it is existed as to the results of stdres show in my dataset (Table below)? my data is categorical. the dependent variable that takes code consists of 213 observations, while with code 0 there are only 87 observations. I think the small size of the data has made the problem of outliers. Please I advise me because I am stuck with this issue I cannot the dataset for now or even increase the sample size.


      sum stdres

      Variable Obs Mean Std. Dev. Min Max

      stdres 300 -.0055664 1.1132 -2.326778 5.3328



      Firm Number Pearson Residuals (stdres) Deviance Residuals (dv) Pregibon leverage (hat)
      92 3.9 2.3 .092
      194 4.2 2.4 .075
      53 5.1 2.6 .026
      148 5.3 2.6 .031














      Thank you for your help
      Rabab


















      Comment


      • #4
        Hi,

        I forgot to tell you Nick that I have tried to run the code you suggested but it did not work,

        . forval n = 1/100 { qui summarize foo in 1/`n', detail di %3.0f `n' " " cond(r(min) == r(p1), "same", "different")}

        I got this comment:

        program error: code follows on the same line as open brace


        Kind regards,
        Rabab

        Comment


        • #5
          #3 Sorry, but I have only a dim idea of what you are showing there or understanding of what you are asking. You have residuals and leverages from some previous command(s) you do not show. The problem may lie in the data or in a poor model. Hard to say without more context.

          #4 Stata is telling you what the problem is.

          As shown in #2 this code is four commands, not one. The open brace must not be followed by any code in the same command line.

          Code:
          forval n = 1/100 {    
              qui summarize foo in 1/`n', detail    
              di %3.0f `n' "  " cond(r(min) == r(p1), "same", "different")
          }

          Comment


          • #6
            I return to the example of #1 for a riff of my own on how to look at distributions that remains of some relevance to an underlying goal here of identifying possible outliers -- and should have some wider interest too.

            The data behind the graph in #1 are not given but can be reconstructed with some small detective work as follows.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float whatever int _freq
            0  30
            1  45
            2 225
            end
            Copying the code makes the example here reproducible in your Stata.

            Here's the main idea with a nod to Yudi Pawitan and his insistence that a normal quantile plot can be useful for any kind of numeric variable, even categorical variables.

            (See slide 29 in https://www.stata.com/meeting/uk16/slides/cox_uk16.pdf for the reference, but the whole presentation bears upon this thread too.)

            Using a normal distribution as a reference distribution no more implies an expectation, or even a hope, that variables will all be normally distributed than using sea level as an origin for altitudes implies that we think that the Earth is flat or that using water's freezing point as an origin for the Celsius scale implies anything about expected temperatures.

            Code:
            expand _freq
            set scheme s1color
            qnorm whatever
            Click image for larger version

Name:	whatever.png
Views:	1
Size:	22.0 KB
ID:	1551982

            Using a normal quantile plot (other names: normal probability plot, normal scores plot, probit plot) may well seem a little bizarre here. But the display shows clearly three distinct values and no outliers (or at least nothing I would dream of calling an outlier).

            I have no quarrel with anyone who wants to insist that a histogram (a bar chart, if you wish) showing the category frequencies is more direct and easier to think about for this variable. But in general histograms can be hard to optimize: the bin width and even the bin origin can be hard to choose well, let alone choose automatically for variables of different kinds.

            I do have a mild quarrel with anyone who wants to sell box plots as universal distribution plots. As this thread shows, they often prove puzzling or even misleading. If you want another example see https://stats.stackexchange.com/ques...ormed-suitably

            In general, box plots often omit too much or make choices for you that don't suit the data. A salutary example: generate a U-shaped distribution, draw a box plot and ask your colleagues or students to infer the distribution from the box plot. In my experience most people get It quite the wrong way round and infer a short-tailed unimodal distribution.

            For another example, consider the auto data. I use multqplot (Stata Journal), which itself requires qplot (also Stata Journal).


            [/CODE]
            sysuse auto, clear
            multqplot price-foreign, trscale(invnormal(@)) xla(-2/2) yla(#4)
            [/CODE]

            Click image for larger version

Name:	multqplot.png
Views:	1
Size:	59.5 KB
ID:	1551983


            Depending a little on your monitor size, displays of around 3 x 3, 4 x 4 or 5 x 5 variables could be manageable for a first or overview scrutiny of the data, from which it is easy to see features -- some evident, some more subtle -- such as

            * no dramatically obvious outliers

            * variables such as price which are skewed and for that and other reasons might be better treated on a transformed scale

            * evident categorical variables foreign and rep78

            * granularity in a variable such as headroom

            A strategy of looking at the data and thinking carefully about what their distributions tell you beats a mechanistic Winsorizing of tails in a fear of what outliers might do.


            .


            Comment


            • #7
              Dear Nick Cox,

              Many thanks for your explanation. I will consider it.

              Kind regards,
              Rabab

              Comment


              • #8
                Dear Nick

                I have tried to search for references that support your point view (your above explanations #6 whether outlier exists with categorical variables or not) because I would like to consider it into my analysis and methodology chapters but with supportive evidence for that. Unfortunately, I did not find references. Therefore, I would like to ask you please could you please recommend to me papers or books in this regard? I need to support my approach of keeping what may others consider them as outliers with evidence.

                Greatly appreciate your kind support and efforts

                Kind regards,
                Rabab

                Comment


                • #9
                  The arguments here on my side are very simple really, so I don't know why you seek references. I imagine that I am much older than you are but I am aware of many things that seem widely known, or even obvious, but for which it is hard to pull out literature references. Not knowing all the literature is an obvious limitation for us all, but it's wider than that. For example, I regard it as widely known that principal component analysis on social science data is usually a pointless waste of time and effort, but (surprise) most of the literature is written by those who think otherwise.

                  I could add that your inclination to winsorize categorical variables is not one that you substantiate with literature references either.

                  As I've posted elsewhere on this forum, and often, I am puzzled about winsorizing. But let's divide up the cases.

                  Binary variables, say those coded 0 and 1. Winsorizing might winsorize 0 to 1 or 1 to 0 if one category is very rare, but that sounds like something a researcher should not want in any circsumstance whatsoever.

                  Nominal variables for which codes are arbitrary, say race. Here percentiles make no sense to me, and winsorizing no sense either.

                  Ordered categorical variables, say grades 1 to 5. Here percentiles make more sense, but in practice one isn't (or shouldn't be) worried about outliers -- if only because an ordered response will be treated on its own terms and an ordered predictor as a set of indicator variables.

                  If one extreme category say 1 or 5, is very rare indeed, winsorizing might suggest pulling it in to 2 or 4, but that sounds unnecessary on various other grounds.

                  It can be difficult to fit models if one category of a categorical variable is very rare, but winsorizing is far from the only solution -- and not the solution at all if the rare category is not an extreme. A full discussion of what to do best in that circumstance is hard to give at this point.

                  Comment


                  • #10
                    Dear Nick

                    I am so grateful for your kind explanations and being patient with my questions. I agree with your point of view. Thus, I will do my best to clarify it wisely in my chapter.
                    For me as a beginner learner in the approach of econometrics analysis not easy to reach the perfectionism in such issue or to approach for a solution in a very tight time, but I am trying to do all my best and ensure the goodness of fit for the models. I have read some papers regarding categorical outliers but the suggested methods are used for continuous variables (e.g. log transformation) and I am wondering how we could apply it for the categorical variables or maybe it is not clear to me yet.


                    Thank you very much again for your support and help

                    Kind regards,
                    Rabab

                    Comment

                    Working...
                    X