Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Quick question: winsor right only & interaction

    Dear members

    I have 2 very quick questions.

    First: I need to winsorize a variable further ONLY on the right side. I think this is done by including 'highonly' in STATA. However it does not seem to work. To be more clear I need to winsorize a winsorized variable again on the right side only.

    The formula I use is as follows: -winsor winsorized_variable, gen(win_var_right) p(0.05) highonly
    What am I doing wrong? I get the exact same result with win_var_right as in winsorized_variable.

    Second: I need to be dealing with an interaction effect as well. I know that I just have to multiply 2 variables. One of these variables is a winsorized one. Do I just multiply the winsorized one & a non-winsorized variable together? It does make sense however I'm not exactly sure.

  • #2
    Never ask a quick question! The question may be quick, but the answer usually isn't,

    The allusion here is to winsor (from SSC, as you are asked to explain: FAQ Advice #12).

    Oddly, or otherwise, despite being its author I have no enthusiasm for winsorizing except in pursuit of winsorized means. Even then, I would want to explore the sensitivity of results to degree of winsorizing. Even then, I would prefer trimming, as at https://www.stata-journal.com/articl...article=st0313

    I just discovered that winsor isn't even installed on the laptop I am currently using. That shows how indifferent I am to one of my own children. (I can be heartless; they don't feel or voice neglect.)

    That said, you give an abstraction of your code but no worked example. I just tested and found no problem. The output here uses extremes (also SSC) to make plain that only the very highest values are affected.

    Code:
    . sysuse auto , clear
    (1978 Automobile Data)
    
    . winsor price , gen(price_w) highonly p(0.05)
    
    . sort price price_w
    
    . extremes price price_w
    
      +------------------------+
      | obs:   price   price_w |
      |------------------------|
      |   1.   3,291     3,291 |
      |   2.   3,299     3,299 |
      |   3.   3,667     3,667 |
      |   4.   3,748     3,748 |
      |   5.   3,798     3,798 |
      +------------------------+
    
      +-------------------------+
      |  70.   12,990    12,990 |
      |  71.   13,466    13,466 |
      |  72.   13,594    13,466 |
      |  73.   14,500    13,466 |
      |  74.   15,906    13,466 |
      +-------------------------+
    Your second question. to me exposes the utter incoherence behind how winsorizing appears to be applied. (Enthusiasm appears to be concentrated in certain fields looking at business data.) In essence, univariate calculations aren't easily made consistent. Positively, I often decide that outliers make perfect sense when I look at variables together. The mishmash of some observations being winsorized on one variable but not others seems just a mess to me, compounded by the arbitrariness of the fraction used.

    55 years ago I was taught logarithms as a scale that made sense for many variables. That still seems the best idea in this territory.

    (Spelling detail: I am usually in favour of Winsorizing, not winsorizing, given the family name involved, but consistency is more important.)

    Comment

    Working...
    X