Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Winsor when values for different observations are very different?

    Hi everyone,

    It may sound stupid but this question comes to my mind these days: at which percentile should I winsorize my variable? And should I do this before or after taking log/ln?
    My aim is to not let extreme value affect my regression.
    For instance, I have the deflated asset for all firms in Compustat (I adopt some filters but not related to financial variables), and they are very different, I cannot persuade myself to winsorize just at 1 and 99 percent, because even they are so different thus even 95 or 90 percentage point value may be the extreme value that affects my regression.
    Also I wonder if it is better to see where to cut before taking log or after taking log, because once I took log, everything seems smaller, though one point difference matters much more, so I suppose there is no difference literally, but in practice, which one do you think would be more helpful?

    I attach a distribution of my asset variable below.
    Click image for larger version

Name:	asset.png
Views:	1
Size:	23.5 KB
ID:	1713684

    Thank you so much!

  • #2
    From your evidence assets on log scale are pretty much symmetrically distributed. That is one piece of information that says nothing about any relationships. Indeed. you have not told us how you want to use it, as outcome or predictor. But it is entirely consistent with a transformation being all that you need to do.

    I don't understand the enthusiasm for Winsorizing unless the goal is to get a robust estimate of the level of an erratic distribution. Anyone who points out that I once wrote a command (now on SSC) called winsor is correct, but that was a programming problem, and doesn't mean that it is a good idea here. Indeed you already are aware of one serious problem with Winsorizing, its utter arbitrariness.

    Posts like this occur intermittently and I typically lay down a friendly challenge, to cite authoritative textbooks or review papers explaining why it is a good thing to do. To date no-one has ever responded. The practice seems localised to some parts of finance and to be a case of people copying papers they have read. I am not an economist but I don't see it mentioned in econometric literature I have sampled.

    once I took log, everything seems smaller
    That's just a side-effect of using a different scale, but meaningless (or to be more positive, no kind of worry) in itself. Choice of base of logarithms is just a convention.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      From your evidence assets on log scale are pretty much symmetrically distributed. That is one piece of information that says nothing about any relationships. Indeed. you have not told us how you want to use it, as outcome or predictor. But it is entirely consistent with a transformation being all that you need to do.

      I don't understand the enthusiasm for Winsorizing unless the goal is to get a robust estimate of the level of an erratic distribution. Anyone who points out that I once wrote a command (now on SSC) called winsor is correct, but that was a programming problem, and doesn't mean that it is a good idea here. Indeed you already are aware of one serious problem with Winsorizing, its utter arbitrariness.

      Posts like this occur intermittently and I typically lay down a friendly challenge, to cite authoritative textbooks or review papers explaining why it is a good thing to do. To date no-one has ever responded. The practice seems localised to some parts of finance and to be a case of people copying papers they have read. I am not an economist but I don't see it mentioned in econometric literature I have sampled.



      That's just a side-effect of using a different scale, but meaningless (or to be more positive, no kind of worry) in itself. Choice of base of logarithms is just a convention.

      Hi Nick,

      Thanks for the reply. Indeed, I'm clear with the part that log is just a monotonic transformation, thus don't really change things. But many times, I'm told to winsorize things, and I did not think carefully before, so this questions come to my mind. I think I would try search if any book mentions more explicit reason.

      Thank you!

      Comment


      • #4
        log is just a monotonic transformation, thus do[es]n't really change things
        Don't say that. First off, most of the transformations you might ever use are monotonic, but they can make a big difference. Otherwise they would all be futile.

        Second, in this case log will dampen outliers and may well improve linearity, which is beyond what can be demonstrated without knowing (again) what for you is outcome and what predictors and what is going on in your full dataset.

        But this is what lies behind the opening statement in #2. I used stripplot from SSC. Using the same axis labels was deliberate.


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input float asset
         .067166
        .0957903
          1.4069
        4.049436
        18.03424
        94.03433
        473.4112
        1912.863
        4515.791
        24160.35
        138610.4
        end
        
        scatter asset asset, ysc(log)
        
        stripplot asset , xla(0.1 1 10 100 1000 10000 100000) name(G1, replace)
        stripplot asset , xsc(log) xla(0.1 1 10 100 1000 10000 100000) name(G2, replace)
        graph combine G1 G2, col(1)
        Click image for larger version

Name:	assets.png
Views:	1
Size:	19.2 KB
ID:	1713768

        Comment


        • #5
          Originally posted by Nick Cox View Post

          Don't say that. First off, most of the transformations you might ever use are monotonic, but they can make a big difference. Otherwise they would all be futile.

          Second, in this case log will dampen outliers and may well improve linearity, which is beyond what can be demonstrated without knowing (again) what for you is outcome and what predictors and what is going on in your full dataset.

          But this is what lies behind the opening statement in #2. I used stripplot from SSC. Using the same axis labels was deliberate.


          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input float asset
          .067166
          .0957903
          1.4069
          4.049436
          18.03424
          94.03433
          473.4112
          1912.863
          4515.791
          24160.35
          138610.4
          end
          
          scatter asset asset, ysc(log)
          
          stripplot asset , xla(0.1 1 10 100 1000 10000 100000) name(G1, replace)
          stripplot asset , xsc(log) xla(0.1 1 10 100 1000 10000 100000) name(G2, replace)
          graph combine G1 G2, col(1)
          [ATTACH=CONFIG]n1713768[/ATTACH]
          Hi Nick,

          Thank you for the reply.
          So do you mean taking log will improve linearity "statistically" since as the graph shows, they are distributed less skewly?
          By "statistically" I mean: actually, unit difference when a variable has been taken log is different from when a variable is in its original form. For instance, 10 and 100 matters a lot when we see them in the original form, the latter is 10 times of the first; whereas if I take log, one become 1, the other becomes 10, but still the latter is 10 times of the first, but when fitting a linear model, the latter fits better?

          To add more information regarding my original post question: my aim is to avoid outliers affecting my result, and my outcome variable is a dummy variable indicating if a firm participate in merger event or not, and I use some financial variables such as asset, ROA, DTA, etc to run a logit model to predict a firm's probability to participate in the event.

          Comment


          • #6
            No; I don't mean that improvements will necessarily happen. All I can say is what I said

            Code:
            may well improve linearity, which is beyond what can be demonstrated without knowing (again) what for you is outcome and what predictors and what is going on in your full dataset
            There is easy scope for you to check here. You can run your model with assets as a predictor or with its logarithm as a predictor. Outliers are likely to bite less in the second case -- that is my guess -- but my speculation is pointless when you can and should check for yourself.

            I can't follow what you are saying about values of logarithms. Logarithms are powers and not on the same scale as the original values. There are many measures for judging goodness of fit for logit models and different authorities emphasise different ways of assessing models.

            Comment


            • #7
              Originally posted by Nick Cox View Post
              No; I don't mean that improvements will necessarily happen. All I can say is what I said

              Code:
              may well improve linearity, which is beyond what can be demonstrated without knowing (again) what for you is outcome and what predictors and what is going on in your full dataset
              There is easy scope for you to check here. You can run your model with assets as a predictor or with its logarithm as a predictor. Outliers are likely to bite less in the second case -- that is my guess -- but my speculation is pointless when you can and should check for yourself.

              I can't follow what you are saying about values of logarithms. Logarithms are powers and not on the same scale as the original values. There are many measures for judging goodness of fit for logit models and different authorities emphasise different ways of assessing models.
              Thank you for the reply, I will think about it

              Comment

              Working...
              X