Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • outliers: the variable has some outliers but cannot be winsorred

    Dear:

    Thank you sincerely. I met a problem that some variable has outliers indeed (shown by the graph box), but it seems that winsor (in some p values) does not work. For example,


    sysuse auto, clear
    (1978 Automobile Data)

    . adjacent price

    ------------------------------------------
    price | lower adjacent upper adjacent
    ----------+-------------------------------
    . | 3291 8814
    ------------------------------------------

    . graph box price

    . winsor price, gen(price_n) p(0.01)
    0 values to be Winsorized
    r(198);


    When i change the p value into 0.02, 0.05, it works. In this case, which p value should I use? If I use 0.02, it still has some outliers. If I increase p value into 0.05, the number of outliers reduced.
    But again, it has some outliers. The number of outliers reduced when I improve the p value, but I wonder do I need to change all the outliers into the normal number before I do the further analysis. It is just the case for one variable, should I follow the same logic to the whole variables?

    Best,

    Eddie





  • #2
    You seem to be banging your head against the wall with this preliminary winsorization step. Perhaps you might want to stop for a moment, step back and entertain an alternative tack. Take a look at the following simple, three-step approach and see whether it (or something along the lines of it) might yield more progress toward your eventual goal.

    Step 1. regress price <whatever>

    Step 2. rvfplot

    Step 3. Discuss

    Comment


    • #3
      Thank you Joseph. But the existance of outliers has a strong impact of the regression results. So I wonder whether we need to deal with them before the regression?

      Comment


      • #4
        A couple of points to consider if you haven't already:

        1. In most cases, you don't really know whether you have outliers by looking at the outcome variable in isolation. And you don't really know which of them are the outliers from inspection of the outcome variable alone. A datum that appears to you to be an outlier in the outcome variable might have a very small residual.

        2. In most cases, you never really know whether you have outliers at all by inspection alone. What appear to be outliers in the residuals might very well be your first hint that your model needs further thought (omitted variables, change in functional form of the regression and so on). If you have values that are physically / biologically / economically impossible realizations of the phenomenon under study, then you have measurement issues that need to be addressed further upstream toward the source. Winsorization is probably doing you a disservice in that case.

        There are diagnostic plots that can help hone in on what data have a "strong impact". If you haven't done so already, take a look at the variety of diagnostic plots brought up by
        Code:
        help regress postestimation plots
        But again, Step 3 above is probably what you will want to start anticipating at the outset.

        Comment


        • #5
          Thank you Joseph for your detailed explanations.

          Comment


          • #6
            Joseph explained clearly.
            You might want to take a look at some articles such as Aguinis et al. (2013).

            Comment


            • #7
              Thank you Joseph and Amin!

              Comment

              Working...
              X