Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Winsorizing many groups of variables

    Hi

    For each variable of the many variables I have, I would do the following for winsorizing:
    Code:
    replace variable =r(p1) if variable <r(p1)
    replace variable =r(p99) if variable >r(p99) & variable !=.

    However, because I have many variables, I would do the following command:
    Code:
    foreach v of varlist I_* C_CG_* C_ED_* {
    replace ‘v’ =r(p1) if ‘v’ <r(p1)
    }
    
    foreach v of varlist I_* C_CG_* C_ED_* {
    replace ‘v’ =r(p99) if ‘v’ >r(p99) & ‘v’ !=.
    }

    However, results are different for the two methods used. I do not know why. I would like to get the same results as the first method but using the second command. Thanks in advance.
    Last edited by Mo Hos; 11 Dec 2019, 10:25.

  • #2
    If this is really your code then r(p1) and r(p99) are whatever are left over from your last summarize on one particular variable. Even Winsorizing fans (I am not one) would not advocate using the 1% and 99% percentiles for one variable for everything else in their dataset.

    On a different level I would never overwrite original data. At most I would create a Winsorized version in a different variable.

    In fact it can't be your code unless exceptionally you have defined a local macro v. The loop uses var but never refers to it within the loop, so that won't work properly.
    If you are just paraphrasing or simplifying code on our behalf, then thanks for the thought, but it's immensely better to show real code that is legal, even if it doesn't do what you want.

    I can see no reason to loop over variables twice. A better recipe is more like

    Code:
    foreach v of varlist I_* C_CG* C_ED* {
          su `v', detail
          gen `v'_w = cond(`v' &lt; r(p1), r(p1), cond(`v' &gt; r(p99), r(p99), `v')) if `v' &lt; .
    }
    I think Winsorizing raises more problems than it solves and often say so. Posts here on Winsorizing fall mostly into people wanting to do it and not being very articulate about why it is a good idea, let alone a better solution than others, and people who dislike it explaining ad nauseam why it is a bad idea.

    0. Winsorizing, at least as discussed here, is univariate. Outliers often make more sense in multivariate space. Conversely, Winsorizing univariately will miss possible outliers if that is the goal.

    1. Winsorizing is fragile given ties, quite likely with some kinds of variable. Very odd results are possible with e.g. categorical variables.

    2. How much to Winsorize is a dark art.

    3. At least in my field outliers are either impossible data errors, so remove them, or genuine extremes, so accommodate them in your analysis. Transformations or non-identity link functions can help mightily.

    Somewhere I posted another such list. Goodness knows how much overlap there is between the two.

    The existence of winsor (SSC) is not evidence here.

    Comment


    • #3

      Thank you for the valuable information. Firstly, I will not have any issues if I overwrite the original data as I save a copy (dta.) for each step of my data anaylsis and processing. Therfore, is it going to work if I just replace "gen" with "replace" in your suggested command? Secondly, when I used your suggested code, I had the follwoing error:
      Code:
      unknown function gt; r()

      Comment


      • #4
        My comment was that I would never overwrite existing variables, and here's more on why not. If I winsorized, I would surely want to keep track of how much difference it makes. That's a lot easier if original and modified variables are in the same dataset together. As you imply, keeping a copy of the original data is certainly a good idea.

        On the code, sorry about the failure of copy and paste here. The forum software somehow inserted an HTML tag &gt; for > and another HTML tag &lt; for <.

        > and < were in turn just taken from your code.

        It's now late to edit #2. but my apologies that I didn't spot that.
        Last edited by Nick Cox; 11 Dec 2019, 12:02.

        Comment


        • #5
          Thanks @Nick Cox. Very helpful tips.

          Comment

          Working...
          X