Dropping variables in regression due to collinearity

HeeSung Kim

Join Date: Nov 2016

Posts: 41
#1

Dropping variables in regression due to collinearity

22 Feb 2017, 08:28

Dear Statalisters,

I've encountered an interesting result when I was performing two following regression:

Code:

regress y var1 var2, r regress y var1* var2*, r

where var1* is the growth of var1, and var2* is the growth of var2, defined by var1* = (var1/var1[_n-1]) - 1.

var1 and var2 is related as following: var1 = var3 - var4 and var2 = var3 - var5.

When I run the aforementioned regression, Stata drops var1* in the second regression, but Stata does not drop var1 nor var2 in first regression. I understand that Stata randomly drops a variable that contains the same information because of collinearity, but I don't understand why the variable is dropped only for second regression.

Does anyone have any insights into why Stata is behaving in this way? Thank you so much in advance!

Regards,
Hee Sung
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35691
#2

22 Feb 2017, 08:36

1. var1* is legal Stata syntax for any (all!) variables whose names begin with var1. But you would not be allowed to use the asterisk * in an individual variable name, precisely because of this syntax.

2. "Drop" here is better phrased as "omit".

2. is just a matter of phrasing, but 1. could be part of your problem here. I find it hard to make sense of the question here because it's not clear exactly what you typed, nor is there a data-based example.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35691
#3

22 Feb 2017, 11:41

Cross-posted at http://stats.stackexchange.com/quest...ticollinearity

Please note our cross-posting policy, which is that you are asked to tell us about it:

http://www.statalist.org/forums/help#crossposting

8. May I cross-post to other forums?

People posting on Statalist may also post the same question on other listservers or in web forums. There is absolutely no rule against doing that.

But if you do post elsewhere, we ask that you provide cross-references in URL form to searchable archives. That way, people interested in your question can quickly check what has been said elsewhere and avoid posting similar comments. Being open about cross-posting saves everyone time.

If your question was answered well elsewhere, please post a cross-reference to that answer on Statalist.
Comment
HeeSung Kim

Join Date: Nov 2016

Posts: 41
#4

22 Feb 2017, 11:49

Dear Nick,

I'v sorry for the lack of clarity on this question. I've changed the variables names on the example above, just to make it general, and it is not the actual name of the variables in the data.

I agree that I should have used the word "omit" instead of "drop". I couldn't find anyway to edit my previous post, so if you would like it to be edited, let me know how I can do that.

I couldn't post the data example here, nor come up with minimal working example because I couldn't recreate such problem generally.

Let me try to describe the problem in detail here:

I have two models, as following:

Code:

regress y x1 x2 x3 x4 ... x13 i.year, r regress y v1 v2 v3 v4 ... v13 i.year, r

where the y is measured in growth, and v1 is the growth rate of x1.

There are two independent variables, x1 and x2, that are linearly correlated, and their relationship is as following:

Code:

x1 = w3 - w4 x2 = w3 - w5

Interestingly, I'd think that first regression will encounter multicollinearity problem, since they are measured in level, and x1 and x2 is linearly related in level, but only second regression omitted variables due to multicollinearity problem. I believe that all x1 ... x13 are correlated, but none are perfectly correlated. Same goes for the relationship between w4 and w5.

What could be the possible reason why Stata will only drop variables in the second regression and not the first?
Comment
HeeSung Kim

Join Date: Nov 2016

Posts: 41
#5

23 Feb 2017, 14:50

After looking into this issue, I've found another interesting property in this problem. I've tried to see if v1 is a linear combination of all the other variables, and as suspected, it is. However, x1 is ALSO a linear combination of all the other variables, but Stata has not dropped x1. I suspect that it is dependent on how Stata omits variables that are correlated. Any insights as to why Stata is behaving in such a way?
Comment

Announcement

Dropping variables in regression due to collinearity

Comment

Comment

Comment

Comment