Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming and Winsorizing outliers by group - running regression

    Hello,

    I have some general questions:
    I've read some documents and forum regarding trimming and winsorizing outliers, but i have a bit trouble understanding the concept.
    I understood that they are used for either ignoring or removing the outliers when analyzing the summary statistics.
    However, are they also applicable when we want to run regressions?
    For instance, when we want to run regression without these outliers (let's say p1 and p99 of each group units), after generating trimming/winsorizing variables by the following code (for e.g.)
    Code:
     winsor2 varofinterest, suffix(_w) p(1) by(group)
    for winsorizing, OR something like
    Code:
    gen newvar=var if var>=r(p1) & var<=r(p99)
    for trimming, can we run regression with these new variables? (like reg depvar varofinterest controlvar ...)
    or does it make sense to run with these variables?

    Maybe I misunderstood the whole concept of trimming and winsorizing..

    Thank you in advance






    P.S. I am encountering issues using winsor2 command and Im not sure if it's only me who is experiencing the issue.
    I tried to do
    Code:
    . ssc install winsor2
    connection timed out
    http://fmwww.bc.edu/repec/bocode/w/ either
      1)  is not a valid URL, or
      2)  could not be contacted, or
      3)  is not a Stata download site (has no stata.toc file).
    r(2);
    
    . winsor2 var, suffix(_w) cuts(1 99) by( group)
    command winsor2 is unrecognized
    r(199);
    so i tried other command:
    Code:
    . winsor var, gen(var_w) p(0.01)
    which is working well.
    Someone has any insights on this issue as well?
    Last edited by Anne-Claire Jo; 30 May 2025, 09:54.

  • #2
    I was able to ssc install winsor2 without any problems just now. I think you (or their server) may have experienced a temporary internet connectivity issue; just try again?

    I am generally against removing or winsorising outliers, unless you have a very strong reason to believe they occur because of mistakes in data entry etc. Indeed, depending on the problem, the tails / "outliers" could be the most interesting part of the relationship: if you're an economist, for instance, you would have noticed a lot of recent discussion about the rapidly increasing income and wealth disparities between the top 1% and the rest, or even the top 0.01% and the top 1% of the distribution, in places like the US and India.

    Comment


    • #3
      I strongly agree with Hemanshu Kumar 's remarks about dealing with outliers.

      And I will add that there are additional problems with it in regression analysis. By removing the outliers, you are decreasing the variance of the variables. This has the consequence that for variables used as explanatory (independent, predictor, right hand side) variables, their regression coefficients will be biased away from zero. And if you also do this to the outcome (dependent, left hand side) variable, that will bias all the regression coefficients towards zero. If you have done it to both the outcome and some or all explanatory variables, it is not predictable which bias will "win out" or whether the two biases will cancel each other.

      There is another important consideration in the regression context. If you are doing a regression for the purpose of developing a predictive model of the outcome variable, and you remove outliers from the outcome variable it becomes impossible to generalize the conclusions of the model. This is because your model is then only applicable to observations whose outcome variable values are not outliers. But then you cannot use this to predict the outcome variable in the general population setting, because you necessarily don't know which ones have outlier outcomes. So you never know to which analytic units in the population your model can be properly applied.

      Outliers themselves are not necessarily problems. If your sample is large enough, you should expect there to be some. In fact, if you have a large sample with no outliers, that itself may be a red flag about your data! Outliers should be checked to make sure they are not data errors. But if they are correct, they should not be removed. Sometimes it is desirable to decrease the amount of variation in a variable that has many outliers by, for example, a logarithmic or other transform. Income variables, and health care cost variables are often treated this way, for example. Arguably, winsorizing is also a transformation, but it is more destructive of information than bijective functions like the logarithm, square root, cube root, etc.

      Comment


      • #4
        I agree with both Hemanshu Kumar and Clyde Schechter , but would actually go further - unless know, to at least a close approximation, the process that generated the data, it is not possible to determine what is and what is not an outlier; for example, say the data are actually lognormally distributed but you use a normal distribution-based procedure for determining what is an outlier - you will find lots of "outliers", but most, at least, are nothing of the kind

        on the other hand, if you want to use trimming and/or winsorizing as a sensitivity analysis, that is acceptable as long as you are careful about writing up what you did

        Comment


        • #5
          Thank you all for your reply!
          Actually I don't really have choice on this issue cause my supervisor asked me to do so.. ;(
          Given that I should still implement these methods, to go back to the initial questions, could someone please explain about the questions on some sort of concepts? Maybe I'm not super clear about my question either -- sorry for my bad english.
          Im still confused with the whole idea of trimming&winsorizing + implementing these variables to regression. . are they used to run regression as well?

          Comment


          • #6
            I'd say that #5 is essentially backwards. It's optimistic that people who've expressed firm scepticism and discouragement to take this route at all would feel any commitment to give you more explanation of what it implies. I'd say that we collectively delegate this task back to your supervisor, who surely has some responsibility, either to explain what they want you to do -- and why -- or to give at least minimal advice on how you should find out. It may be that the received model of supervision at wherever you are is supervision by sustained neglect.

            Adding a personal twist given previous work of mine I would say that trimming has a role in robust summarization -- trimmed means are a family with limiting cases the mean and the median -- and that Winsorizing could be used in that way. (I originally wrote winsor from SSC given that someone asked how to do it but I've often regretted that since.)

            Applying either in regression is a different ball game, and I add my own scepticism to that of Hemanshu Kumar Clyde Schechter and Rich Goldstein

            The minimum here applies to include

            1. A rationale for knowing that outliers are present in your data and really should be disregarded or downweighted.

            2. Confidence that other methods -- robust regression, quantile regression, transformation, generalized linear models, and so on -- would not work better.

            3. Confidence that outlier detection can be cast as a univariate problem. The difficulty is that trimming or Winsorizing one variable at a time could zap the wrong observations, as (for example) what appears to be an outlier in a univariate display could make perfect sense in the context of a scatter plot or scatter plot matrix, and conversely that bivariate outliers won't always be univariate extremes, and so on.

            4. Knowing which variables to trim or Winsorize which itself may not be clear. It would be absurd to trim or Winsorize (0, 1) indicator variables or even (1, 2, 3, 4, 5) ordered or graded variables, so where to draw the line?

            5. Being willing to carry out extensive sensitivity analysis of the effects, all the way from no trimming or WInsorizing to very heavy versions of either.

            6. Being confident of how to adjust inferences for what you've done to the data.

            7. That can't be a complete list, so here's everything else to worry about.

            Comment


            • #7
              Anne-Claire:
              I do agree with previous comments against eliminating (the so-called) outliers, unless they are apparent data entry mistakes.
              "Weird" results are perfectly reasonable, especially if data distribution is (positively) skewed (e.g. gamma distribution for costs).
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8
                it's more on checking the data without outliers (or some extreme values).
                On the related question -- it seems like my stata is keep having issues when I am using ssc install winsor or ssc install winsor2 - it always say connection timed out.
                Is there other command I could use without using winsor/winsor2 command in Stat (commands like in #1):
                For instance, when we want to run regression without these outliers (let's say p1 and p99 of each group units), after generating trimming/winsorizing variables by the following code (for e.g.)
                Code:
                winsor2 varofinterest, suffix(_w) p(1) by(group)
                for winsorizing, OR something like
                Code:
                gen newvar=var if var>=r(p1) & var<=r(p99)

                Comment

                Working...
                X