Winsor when values for different observations are very different?

Lucrecia Lei

Join Date: Feb 2022

Posts: 35
#1

Winsor when values for different observations are very different?

15 May 2023, 07:40

Hi everyone,

It may sound stupid but this question comes to my mind these days: at which percentile should I winsorize my variable? And should I do this before or after taking log/ln?
My aim is to not let extreme value affect my regression.
For instance, I have the deflated asset for all firms in Compustat (I adopt some filters but not related to financial variables), and they are very different, I cannot persuade myself to winsorize just at 1 and 99 percent, because even they are so different thus even 95 or 90 percentage point value may be the extreme value that affects my regression.
Also I wonder if it is better to see where to cut before taking log or after taking log, because once I took log, everything seems smaller, though one point difference matters much more, so I suppose there is no difference literally, but in practice, which one do you think would be more helpful?

I attach a distribution of my asset variable below.

Thank you so much!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35761
#2

15 May 2023, 08:15

From your evidence assets on log scale are pretty much symmetrically distributed. That is one piece of information that says nothing about any relationships. Indeed. you have not told us how you want to use it, as outcome or predictor. But it is entirely consistent with a transformation being all that you need to do.

I don't understand the enthusiasm for Winsorizing unless the goal is to get a robust estimate of the level of an erratic distribution. Anyone who points out that I once wrote a command (now on SSC) called winsor is correct, but that was a programming problem, and doesn't mean that it is a good idea here. Indeed you already are aware of one serious problem with Winsorizing, its utter arbitrariness.

Posts like this occur intermittently and I typically lay down a friendly challenge, to cite authoritative textbooks or review papers explaining why it is a good thing to do. To date no-one has ever responded. The practice seems localised to some parts of finance and to be a case of people copying papers they have read. I am not an economist but I don't see it mentioned in econometric literature I have sampled.

once I took log, everything seems smaller

That's just a side-effect of using a different scale, but meaningless (or to be more positive, no kind of worry) in itself. Choice of base of logarithms is just a convention.
Comment
Lucrecia Lei

Join Date: Feb 2022

Posts: 35
#3

15 May 2023, 13:03

Originally posted by Nick Cox View Post

From your evidence assets on log scale are pretty much symmetrically distributed. That is one piece of information that says nothing about any relationships. Indeed. you have not told us how you want to use it, as outcome or predictor. But it is entirely consistent with a transformation being all that you need to do.

I don't understand the enthusiasm for Winsorizing unless the goal is to get a robust estimate of the level of an erratic distribution. Anyone who points out that I once wrote a command (now on SSC) called winsor is correct, but that was a programming problem, and doesn't mean that it is a good idea here. Indeed you already are aware of one serious problem with Winsorizing, its utter arbitrariness.

Posts like this occur intermittently and I typically lay down a friendly challenge, to cite authoritative textbooks or review papers explaining why it is a good thing to do. To date no-one has ever responded. The practice seems localised to some parts of finance and to be a case of people copying papers they have read. I am not an economist but I don't see it mentioned in econometric literature I have sampled.

That's just a side-effect of using a different scale, but meaningless (or to be more positive, no kind of worry) in itself. Choice of base of logarithms is just a convention.

Hi Nick,

Thanks for the reply. Indeed, I'm clear with the part that log is just a monotonic transformation, thus don't really change things. But many times, I'm told to winsorize things, and I did not think carefully before, so this questions come to my mind. I think I would try search if any book mentions more explicit reason.

Thank you!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35761
#4

16 May 2023, 03:04

log is just a monotonic transformation, thus do[es]n't really change things

Don't say that. First off, most of the transformations you might ever use are monotonic, but they can make a big difference. Otherwise they would all be futile.

Second, in this case log will dampen outliers and may well improve linearity, which is beyond what can be demonstrated without knowing (again) what for you is outcome and what predictors and what is going on in your full dataset.

But this is what lies behind the opening statement in #2. I used stripplot from SSC. Using the same axis labels was deliberate.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float asset .067166 .0957903 1.4069 4.049436 18.03424 94.03433 473.4112 1912.863 4515.791 24160.35 138610.4 end scatter asset asset, ysc(log) stripplot asset , xla(0.1 1 10 100 1000 10000 100000) name(G1, replace) stripplot asset , xsc(log) xla(0.1 1 10 100 1000 10000 100000) name(G2, replace) graph combine G1 G2, col(1)
Comment
Lucrecia Lei

Join Date: Feb 2022

Posts: 35
#5

16 May 2023, 05:00

Originally posted by Nick Cox View Post

Don't say that. First off, most of the transformations you might ever use are monotonic, but they can make a big difference. Otherwise they would all be futile.

Second, in this case log will dampen outliers and may well improve linearity, which is beyond what can be demonstrated without knowing (again) what for you is outcome and what predictors and what is going on in your full dataset.

But this is what lies behind the opening statement in #2. I used stripplot from SSC. Using the same axis labels was deliberate.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float asset .067166 .0957903 1.4069 4.049436 18.03424 94.03433 473.4112 1912.863 4515.791 24160.35 138610.4 end scatter asset asset, ysc(log) stripplot asset , xla(0.1 1 10 100 1000 10000 100000) name(G1, replace) stripplot asset , xsc(log) xla(0.1 1 10 100 1000 10000 100000) name(G2, replace) graph combine G1 G2, col(1)

[ATTACH=CONFIG]n1713768[/ATTACH]

Hi Nick,

Thank you for the reply.
So do you mean taking log will improve linearity "statistically" since as the graph shows, they are distributed less skewly?
By "statistically" I mean: actually, unit difference when a variable has been taken log is different from when a variable is in its original form. For instance, 10 and 100 matters a lot when we see them in the original form, the latter is 10 times of the first; whereas if I take log, one become 1, the other becomes 10, but still the latter is 10 times of the first, but when fitting a linear model, the latter fits better?

To add more information regarding my original post question: my aim is to avoid outliers affecting my result, and my outcome variable is a dummy variable indicating if a firm participate in merger event or not, and I use some financial variables such as asset, ROA, DTA, etc to run a logit model to predict a firm's probability to participate in the event.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35761
#6

16 May 2023, 05:38

No; I don't mean that improvements will necessarily happen. All I can say is what I said

Code:

may well improve linearity, which is beyond what can be demonstrated without knowing (again) what for you is outcome and what predictors and what is going on in your full dataset

There is easy scope for you to check here. You can run your model with assets as a predictor or with its logarithm as a predictor. Outliers are likely to bite less in the second case -- that is my guess -- but my speculation is pointless when you can and should check for yourself.

I can't follow what you are saying about values of logarithms. Logarithms are powers and not on the same scale as the original values. There are many measures for judging goodness of fit for logit models and different authorities emphasise different ways of assessing models.
Comment
Lucrecia Lei

Join Date: Feb 2022

Posts: 35
#7

16 May 2023, 11:34

Originally posted by Nick Cox View Post

No; I don't mean that improvements will necessarily happen. All I can say is what I said

Code:

may well improve linearity, which is beyond what can be demonstrated without knowing (again) what for you is outcome and what predictors and what is going on in your full dataset

There is easy scope for you to check here. You can run your model with assets as a predictor or with its logarithm as a predictor. Outliers are likely to bite less in the second case -- that is my guess -- but my speculation is pointless when you can and should check for yourself.

I can't follow what you are saying about values of logarithms. Logarithms are powers and not on the same scale as the original values. There are many measures for judging goodness of fit for logit models and different authorities emphasise different ways of assessing models.

Thank you for the reply, I will think about it
Comment

Announcement

Winsor when values for different observations are very different?

Comment

Comment

Comment

Comment

Comment

Comment