Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • winsor2 advice

    Hi,

    I'm starting to use STATA. I read about winsorizing the data. But I have difficulties to chose between cuts(1 99) of cuts(2.5 97.5) for some variables.
    Should I look at the smalles variance?
    My variable GROWTH for example has a variance of 800 and st. Dev of almost 29.

    Kind regards,
    Iliana

  • #2
    The more you Winsorize, the smaller the variance will be. That's not a criterion for choice, but an inevitable side-effect. Your difficulty over choosing is precisely the difficulty that methods of this kind pose, how to choose without arbitrariness.

    Why precisely do you want to Winsorize any way?

    Comment


    • #3
      Nick,
      Thanks for your response. I want to winsorize to disminuish the effect of my outliers. Because I have a lot of outliers, so I can't just delete them all...

      Comment


      • #4
        What is your criterion for outliers? You do need one to apply what you are asking about.

        But you don't have to delete any of them. You can for example use a transformation or a non-identity link. Quite what will work best depends on your data, on which we need further information.

        There are perhaps twenty different approaches to outliers, many of which are listed in
        https://stats.stackexchange.com/ques...iers-with-mean (depending naturally on how you count).

        Here is a clear example in which somebody was convinced they had a problem, but all they needed was to think on logarithmic scale.
        https://stats.stackexchange.com/ques...stributed-data

        Conversely, we don't have more information than you do on which to make arbitrary decisions for your data and given your goals.

        Comment


        • #5
          you say:
          I have a lot of outliers
          I look at outliers in the following way: an outlier is a value that is surprising given my model of the data; note that often the "model of the data" is implicit - if implicit, you need to make it at least partly explicit because, with a lot of outliers, there is a good chance that it is your model that is wrong and you should be using a different one (e.g., you may be expecting your data to be symmetric around the mean and they are actually skewed); for more on this way of looking at outliers, see Barnett, V and Lewis, T (1994), Outliers in Statistical Data, 3rd edition, Wiley

          Comment

          Working...
          X