Running the same do-file twice in a row and getting differing results for the same regressions

Nicole Zander

Join Date: Jan 2016

Posts: 38
#1

Running the same do-file twice in a row and getting differing results for the same regressions

01 Oct 2016, 03:16

Hi all,

I have a somewhat conceptual question.

Currently, I have a long do-file of 2000 lines. It works how I want it to work except my regression results differ every time I run the complete do-file.

What are reasons for this to occur?

Below you can find some of the code that I suspect of being a cause:

Code:

* creating the standard deviation of the stock price for the previous 60 months tsegen double SD_ret = rowsd(L.(1/60).ret) *Need to winsorize the bottom and top 5% of the standard deviations of the stock retruns. ssc install winsor winsor ret_SD, gen(W_ret_SD) p(0.05) drop ret_SD rename W_ret_SD ret_SD

The regressions I run are panel OLS regressions
1. PLain regression
2. adding time fixed effects
3. adding time fixed effects and the option robust
4. adding time fixed effect, robust and clustering of the errors

I am curious to see wether this is indeed the case?

Regards,
Nicole

Last edited by Nicole Zander; 01 Oct 2016, 03:19.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35694
#2

01 Oct 2016, 04:21

The problem is likely to be that you expect far too much from winsorizing. You're winsorizing a response without reference to the covariates. With ties on the response, different observations may be winsorized with different calls, as sorting entails some random shuffling. To show the problem in an extreme form consider

Code:

y x 1 23 1 42 2 666 3 666 3 42

Suppose you winsorize the top and bottom 20% of the response y. (I know you will use a different fraction, but focus on the principle.)

Sometimes the first observation will be winsorized. sometimes the second, as they tie on the response. Similarly with the fourth and fifth. But they have different covariate values (see x) and regressions downstream will differ.

Oddly I wrote winsor because someone wanted to do it and it was a straightforward programming problem. But I never, ever use it to produce modified variables for modelling. At most I would use it for summarizing univariate distributions.

We've had better robust regression methods for decades (arguably, centuries).

I know that people in some fields use this a lot. What can I say beyond: Sounds like a very bad idea to me. I don't know how far this rather elementary pitfall is well publicised.
2 likes
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#3

01 Oct 2016, 04:34

"It works except it produces different results every time I run it" = it doesn't work
The whole idea of a do file is that the results are directly reproducible. the commands you mention as suspicious indeed are, as some sorting order or tie-breaking can cause this. the first thing i would check is the number of observations used in each regression in each of the runs - if these are different - then there is indeed some problem - probably from winsor that you should investigate further.
Also, note that regression 3 & 4 are identical if your'e clustering the SEs on the panel variable. more info on that is in the help file for xtreg.
1 like
Comment
Nicole Zander

Join Date: Jan 2016

Posts: 38
#4

01 Oct 2016, 04:42

Thanks both for the clear answers. I will dive deeper into the world of outlier handling and choose a different method to winsor.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#5

03 Oct 2016, 03:46

Originally posted by Nick Cox View Post

The problem is likely to be that you expect far too much from winsorizing. You're winsorizing a response without reference to the covariates. With ties on the response, different observations may be winsorized with different calls, as sorting entails some random shuffling. To show the problem in an extreme form consider

Code:

y x 1 23 1 42 2 666 3 666 3 42

Suppose you winsorize the top and bottom 20% of the response y. (I know you will use a different fraction, but focus on the principle.)

Sometimes the first observation will be winsorized. sometimes the second, as they tie on the response. Similarly with the fourth and fifth. But they have different covariate values (see x) and regressions downstream will differ.

Oddly I wrote winsor because someone wanted to do it and it was a straightforward programming problem. But I never, ever use it to produce modified variables for modelling. At most I would use it for summarizing univariate distributions.

We've had better robust regression methods for decades (arguably, centuries).

I know that people in some fields use this a lot. What can I say beyond: Sounds like a very bad idea to me. I don't know how far this rather elementary pitfall is well publicised.

Can you expand a bit? I feel in economics outlier handling is one of the murkiest parts of the discipline, but I wouldn't know a better solution myself either...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#6

03 Oct 2016, 03:53

What I think is a bad idea is winsorizing the response and then using that in models. It is arbitrary and it is not reproducible.

Robust regression, quantile regression, transformation of response, generalized linear models with (e.g.) log link are all better alternatives in my view.

I don't claim to speak for economics, which I have not studied since the last millennium in high school. But I am reminded of a statement (which I have not checked) that the Festschrift for Oskar Morgenstern was full of appreciation for his work on game theory with von Neumann but nowhere mentioned that he had written an entire book on how lousy economic data are.

(I have no animus against economics here: you could say how lousy almost all data are.)
3 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#7

03 Oct 2016, 13:12

Dear Nicole,

As an economist, I would like to support Nick's comment that Winsorizing is generally a bad idea and that there are much better methods to handle the so called outliers.

As for your comment that outlier handling is one of the murkiest parts of economics, I would say that economics does not even address that problem and that most modern econometric textbooks do not mention outliers at all. This creates a vacuum that some researchers try to fill by importing techniques from other disciplines, and this generally does not work well.

Anyway, I think the point to make is that we should always be aware of the potential problems with our data but Winsorizing, trimming, and other such approaches only compound the problem.

Joao
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1132
#8

03 Oct 2016, 13:25

At the risk of stirring the pot a bit, here are two comments on some issues that have come up in this thread.

First, from #3:

"It works except it produces different results every time I run it" = it doesn't work

Following that logic, one would have to argue that bootstrapping doesn't work (except when one sets a seed to guarantee the same results every time).

Second, note that computation of a median can be viewed as an extreme form of Winsorizing where all but the middle (or middle 2) scores are discarded.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Ariel Karlinsky

Join Date: Jun 2015

Posts: 491
#9

03 Oct 2016, 13:33

The comment was specifically regarding a do file of course
The same would have hold true for bootstrapping though. had I set the seed at some value and ran it a fixed number of replications and got different results each "run" - that would have been discouraging as well
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#10

03 Oct 2016, 14:27

Bruce, if we want to compute the median we do not need to Winsorize because the result does not change :-)

But the problem here is more complex because we are talking about regressions. If we Winsorize the response and then do regression (even median or quantile regression) we are likely to mess up things because the Winsorizing was not conditional on the regressors. So, what looks like an outlier may not be an outlier at all in the context of a regression.

Joao
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1132
#11

03 Oct 2016, 15:43

Hi Joao. I completely agree with the point (in #10 and earlier) about the Winsorizing not being conditional on the regressors.

My comment about the median being an extreme case of Winsorizing was meant to apply to the univariate context only. I read something like that in a textbook years ago, and found it an interesting way to understand the median.

Cheers,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#12

03 Oct 2016, 15:48

Thanks for clarifying, Bruce. So, there is actually a good use for Winsorizing: to explain what is the median!

Best wishes,

Joao
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#13

03 Oct 2016, 17:04

I've found the idea of trimming interesting and useful in the context of univariate summary, as http://www.stata-journal.com/article...article=st0313 attests. The same point arises there: the family of trimmed means stretches from the standard mean to the median and includes slightly more exotic means such as the midmean, the mean of the middle half of the data (loosely, the mean of values in the box of a box plot). Rather than trying to think of some magically efficacious fraction for trimming, trimming all possible fractions yields informative displays of a distribution.
1 like
Comment

Announcement

Running the same do-file twice in a row and getting differing results for the same regressions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment