Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Winsorizing at standardization deviations

    Hello everybody,

    regarding winsorizing I have a little problem understanding a procedure used in literature. I already know and understand the winsor (ssc) command, e.g. to winsorize variables by their 1st and 99th percentile. However, I have to estimate regressions on industry-year sections and to "winsorizing the regression variables at three standardization deviations each year". Now I'm first confused about how to winsor at standardization deviations. Does it maybe mean that I have to standardize the variables for industry-year and winsorize 3 percentiles? I know that's more a problem of statistically understanding but consequently I also don't know how to execute it in Stata. Furthermore, I don't understand how to winsor each industry year in Stata. I think I have to use loops and tried the following:

    Code:
    egen industry_year=group(industry year)
    
    su industry_year
    scalar a=r(min)
    scalar b=r(max)
    
    foreach var of varlist x1 x2 x3 x4 {
    forvalues c=`=scalar(a)'/`=scalar(b)' {
    winsor `var', p(0.03) gen(w_`var') in `c'
    replace `var'=w_`var' in `c' in `c'
    drop w_`var' in `c' in `c'
    }
    However, this code doesn't work since it is not allowed to use in or if combined with winsor and it ignores the standardization-deviation-problem (sorry if the code is completely nonsense, it's one of my first times with loops in Stata ). Maybe anyone has already dealt with such a procedure and may help me?

    Thank you!
    TM

  • #2
    I surmise that you're surely and sorely being misled by poor use of terminology somewhere upstream of you -- by reference(s) you do not give, so we can't assess or comment on those sources.

    The quotation is puzzling as well as mysterious as "three standardization deviations" is at best an extraordinary typo, and at worst a signal suggesting that your source is not authoritative if they could possibly write like that.

    In my book pulling in values to mean +/- so many SDs (if that's what this implies)

    1. is a weird idea, as the SDs are not resistant to the outliers this method is presumably supposed to subdue

    2. is not Winsorization in any sense consistent with standard use of this term over 50+ years in literature.

    But naturally I don't have and don't want the power to stop people being elastic in their use of terrms, which we all sometimes claim for ourselves.

    Yet on the narrower matter of winsor (SSC), which bears my name,

    a. It is quite incorrect to say

    it is not allowed to use in or if combined with winsor
    as a moment's glance at the help or the code will verify.

    Your problem is quite different: Your syntax puts the in qualifier where it does not belong, among the options. (This is nothing to do with using loops.)

    b. I can't possibly tell exactly what procedure is being suggested in the source you don't cite. But I can't see that it will be using 3% and 97% percentiles.
    I guess wildly that you need to calculate means +/- 3 SDs and then pull in any higher or lower values to those limits. Such an operation is not, and will never be, supported by winsor, so you would need to code that differently.
    Last edited by Nick Cox; 06 Mar 2016, 06:22.

    Comment


    • #3
      Sorry that I didn't mention the source more precisely but since the authors do not mention significantly more than my quote I thought it wouldn't be necessary.
      The source is Tucker and Zarowin, The Accounting Review, 2006, Vol. 81, No. 1 (https://warrington.ufl.edu/accountin..._smoothing.pdf) and the quotation is on p. 258. Actually it is a very reputable journal in financial accounting, but I've never heard "winsorize at three standardization deviations".

      But thank you for your thoughts and suggestions, Nick!

      TM

      Comment


      • #4
        Thanks for the reference. As I outlined earlier, I think this is non-standard use of terminology too, but I am certainly not familiar with that literature.

        Comment


        • #5
          I agree with all of Nick's points, but I see something else In the linked article: The data set appears to consist of unbalanced panels, because fiirms appear at different time points. The models, however, contain no firm intercepts. This seems unrealistic.

          Steve Samuels
          Statistical Consulting
          [email protected]

          Stata 14.2

          Comment

          Working...
          X