Baffled by some methods used in papers - anyone an idea...

Ines Simac

Join Date: Apr 2014

Posts: 10
#1

Baffled by some methods used in papers - anyone an idea...

24 Oct 2014, 03:14

Dear Stata-users,

As a researcher I came across a method used in some empirical studies where they use a two-step regression, where:

FIRST: y is regressed on a set of variables x (establishing a "normal" level - and the residual is the "abnormal" level)
SECOND: the residual is then regressed on a second set of variables z to test whether some of these variables can explain the "abnormal" level.

Now, my question is: isn't this residual a constructed variable - thus, making inferences invalid? You kind of give the first set of variables x the right to explain y, then less is left for the variables of interest z...

Can anyone help clarify why someone would choose to adopt this two-step regression?

Thanks in advance,

Ines
Tags: None
Aljar Meesters

Join Date: Apr 2014

Posts: 30
#2

24 Oct 2014, 03:41

People probably do these kind of things in order to "test" a theory they believe in. That such a "test" violates basic assumptions is in the best case overlooked and in the worst case conveniently ignored.
You are correct that the inference is invalid in this case, but not because they use a constructed variable in the second stage. The problem is that in the first case the authors need to assume that the residuals are independently identically distributed while in the second step the assume that the residuals depend on some other characteristics. If this is indeed the case, they have omitted variable bias in their first regression. Only when the second set of variables z is uncorrelated with the first set x, there is no bias, but it that case to could have estimated everything in one step. You may want to look at the Frisch–Waugh–Lovell theorem for some more background on this.
Best,

Aljar
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#3

24 Oct 2014, 03:47

If being constructed invalidates inference, then nothing goes. Is "GNP per head" constructed? Or "suspended sediment yield in tonnes per square kilometre per year"? The first corrects, or better adjusts, for population size; the second for catchment area and length of time.

There is a difference in that one process is deterministic, while the other brings in its wake questions of model specification, estimation and so forth. Also, things could be done differently. In my examples, one could try regressing GNP on population size or sediment yield totals on area and length of time. In your case (no details, so no comment possible) perhaps there is a deterministic adjustment possible.

In short, I don't see an objection in principle to this. It could be a sensible way of trying to adjust for controls you don't care about. After all, much of statistics amounts to trying to do what many scientists do through experimental design.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#4

24 Oct 2014, 04:19

I see Aljar's comments. People love to write textbooks saying what this is what you should do: specify a perfect model with the implication that your error term is nothing but pure noise. It's good advice. I've never been able to match it.
Comment
Ines Simac

Join Date: Apr 2014

Posts: 10
#5

24 Oct 2014, 04:34

Well, this method is commonly used in the accounting literature - where they try to split "total accruals" into "normal accruals" and "abnormal accruals", with the "abnormal accruals" being the residuals. Afterwards, they test an additional set of variables (which are definitely not uncorrelated with the first-step variables) on those residuals. Sometimes it kind of looks like it is arbitrarily chosen which variables go into the first step and which ones in the second - making me assume that the final t-test are not really reliable. Maybe my choice of "constructed" is somewhat inappropriate in this context, but I hope you get my point in why I find this residual a bit strange...?

By the way, putting all in one-regression was also the point I was kind of hinting at. Wouldn't it indeed be more "statistically correct" if you put all the variables in one regression?

Some quick numerical analyses (e.g. with a y and 5 x-regressors) showed me that some significance and signs of the variables change when doing a two-step compared to the one-step regression. Especially when you change the correlations among the variables. Even when you do a first-step with, for example, regressing the first 2 x's on y and a second-step regressing the final 3 x's on the residuals compared to a first-step with the first 3 x's and a second-step with the remaining 2 x's shows big differences. Suggesting that even a design-choice can significantly impact your results.

That's why it surprises me how these two-step regression are still commonly used in that literature, with simply higher correlations or even design-choice possibly having big effects on your results...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#6

24 Oct 2014, 04:57

It's not clear what advice you want. At one extreme, this is what people do in your field, and you may be obliged to do it, but watch out. At another extreme, you are evidently in good company in regarding it as a tricky method, so don't do it then.

You are absolutely right that applying a regression to residuals from the first will in practice give a different answer from putting everything together. One, in that circumstance, there is no way the second regression knows about the first. Two, it's only under Utopian conditions that you would expect anything else. Three, degrees of freedom are different any way, although that may be trivial. There is probably more to add on why here.

Frederick Mosteller and John Tukey wrote a splendidly quirky book Data analysis and regression (Addison-Wesley, Reading, MA) in 1977. It pushes the idea that fitting simple models, and then working with the residuals from those models, is not odd, it's a way of thinking about the whole regression process.

I think the problem here is thinking there's statistically right and wrong. There's more right and more wrong, in the view of anyone confident enough to give a view, but data analysis away from the texts is grey and dirty, not black and white. Also, if you want two different opinions, ask two different statisticians (or statistically-minded researchers, as I'm not a statistician).

Last edited by Nick Cox; 24 Oct 2014, 05:07.
2 likes
Comment
Aljar Meesters

Join Date: Apr 2014

Posts: 30
#7

24 Oct 2014, 05:13

Originally posted by Nick Cox View Post

I see Aljar's comments. People love to write textbooks saying what this is what you should do: specify a perfect model with the implication that your error term is nothing but pure noise. It's good advice. I've never been able to match it.

I agree with you, specifying a correct model is in practice impossible. Yet if you know that certain factors have an influence but you deliberately leave them out and later on use these as explanatory variables of the residuals is on the other side of the spectrum I would say.
Comment
Aljar Meesters

Join Date: Apr 2014

Posts: 30
#8

24 Oct 2014, 05:16

Originally posted by Ines Simac View Post

By the way, putting all in one-regression was also the point I was kind of hinting at. Wouldn't it indeed be more "statistically correct" if you put all the variables in one regression?

Yes, you will indeed get more "statistically correct" estimates. The story is of course less clear whether the effects of some variables are "normal" or "abnormal", but if you have a theory for each variable you at least measure the effect in a proper way.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35708
#9

24 Oct 2014, 05:39

Aljar: We're close. I think the difference between (a) prior adjustment by simple arithmetic (dividing by population size, or whatever) and (b) prior adjustment by a previous regression is not that great in many cases. People just get rather theological about this, to the extent of identifying states of sin.

In fact, the first (a) can do quite as much harm, but people don't complain about it statistically because you aren't in the territory of specification, estimation, inference. For example, there can be a division of labour that (e.g.) applied economists are supposed to measure the right variables and then econometricians lay down the rules on how to model those data, but treat the measurement and definition as somebody else's problem.
Comment

Announcement

Baffled by some methods used in papers - anyone an idea...

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment