Assigning the coefficients of a collinear dummy variable kept in the model to the omitted dummy variables

Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#1

Assigning the coefficients of a collinear dummy variable kept in the model to the omitted dummy variables

16 Jan 2019, 09:45

I am doing rolling regressions with many dummy variables and retrieving coefficients of those dummy variables at each iteration (using rangerun). Some of the dummies are collinear in some ranges. Therefore, there is no coefficient produced. I would like to assign the coefficient of the variable kept in the model to those omitted for further processing (different models). That requires finding out which variable was collinear but kept in the model at each step and which variables were omitted.

I tried to use _rmdcoll but did not get anywhere. It produces a message something like dummy A1 is collinear with A2, A3, A4, A5, but it does not say which one is kept and which one is omitted. It is even stranger that some of the variables identified as collinear by _rmdcoll (e.g. A3 and A5) has coefficients estimated. I would appreciate any guidance on how to even start tackling this issue. Thanks
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

17 Jan 2019, 11:07

You didn't get a quick answer.

Please look at the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex. Set up a small data set that illustrates the problem and shows us what you want. Your problem involves a level of programming that we can't attempt without more information.

I'd start by finding one range that has the problem and try to debug your problem using that subset of the data. That is, run the regression for just that range so you can really look at the data and the parameters in detail. Often, when you run piles of regressions or whatever, it is hard to focus in on problems.
Comment
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#3

22 Jan 2020, 16:39

I was reading about the "asif" option in predict. That would have solved my problem. Apparently stata keeps a record of covariance signatures and allow one to produce a prediction even if that observation has not been included in a model like logit. I am impressed! However, that option turned out to be not available for the estimation procedure I was using.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#4

22 Jan 2020, 17:02

I will admit that I don't know how I would go about doing this. But I will say that the whole enterprise sounds misguided.

Let's review what collinearity is about. Suppose variables A1, A2, and A3 are colinear. That means that there is some relationship of the form c1A1 + c2A2 + c3A3 = 0, with c1, c2, and c3 not all zero. When that occurs, the results of a regression including A1, A2, and A3 are indeterminate: you could assign any arbitrary value to the coefficient of A1 in the regression and then adjust the values of A2 and A3's coefficients to make things come out right.* This is often summarized by saying that the regression model is unidentified. To identify the model, hence insure a unique set of regression coefficients, it is necessary to disrupt the collinearity relationship in some way. One way of doing this, and the most commonly used, is to omit one or more of the variables until the collinearity disappears. Now, omitting a variable is exactly the same thing as constraining its coefficient in the regression to be zero. So another way of thinking about Stata's omitting variables is that it is imposing a constraint on those variables that their coefficients be zero. Consequently the coefficients that are obtained for whatever variable(s) remain unomitted are conditional on the assumption that the coefficient(s) of the omitted variable(s) is (are) zero. So imputing to those omitted coefficients some other value inherently invalidates the coefficient(s) of the unomitted variable(s). The set of coefficients you would end up with would not be a proper regression fit to the data at all. If you are going to impute some coefficient to the omitted variable, the only one that makes any sense to impute is zero.

It's important to also understand the implications of this. When a regression model starts out with a collinear set of variables, the results obtained for the non-omitted variables in the colinear set are dependent on the choice of which variable was omitted. So that none of those coefficients has any meaning on its own. Those coefficients can only be interpreted as representing effects relative to the omitted variable(s). They do not represent the effects of those variables in the absolute sense. So if you have the same regression model applied to two data sets, and one contains a collinearity that the data does not create in the other, you cannot make any direct comparison between the coefficient of one of the collinearity-involved variables in one data set to the corresponding coefficient in the other. They do not represent the same thing: one is an absolute coefficient, the other is relative to an omitted variable. So any attempt to, for example, compare such coefficients (i.e. test them for equality, or calculate their sum or difference) would be nonsense.

It is even stranger that some of the variables identified as collinear by _rmdcoll (e.g. A3 and A5) has coefficients estimated.

No, there's nothing strange about that. Stata only removes from the set of colinaer variables the minimum number necessary to break the colinearity. Consider a simple example: if you have a five-level categorical variable (like rep78 in the built-in auto.dta), it is represented in regressions by indicator ("dummy") variables for the five levels the variable takes on. Let's call those variables dum1, dum2, dum3, dum4, and dum5. Since one, and only one, of those variables is 1 in every observation and the others are all zero, we have the relationship: dum1 + dum2 + dum3 + dum4 + dum5 = 1 in every observation. If the regression has a constant term (as nearly all do), then we get the colinearity: 1*dum1 + 1*dum2 + 1*dum3 + 1*dum4 + 1*dum5 - 1*_cons = 0. To break this colinearity, all you have to do is remove one of those variables. This can be done by running the regression without a constant term and retaining all five indicators, or it can be done by removing any one of the dum* variables from the model. In either case, from among the 6 colinear variables, _cons, dum1, dum2, dum3, dum4, and dum5, one will be omitted and five will remain.

*Added: To see this clearly in a simple example, suppose that there are two variables A1 and A2 and they have exactly the same values in every observation, i.e. A1 = A2. Then they are colinear because 1*A1 - 1*A2 = 0. Now, suppose that you ran a regression with A1 but not A2 as a predictor variable. Assuming that no other variables are colinear with A1, you will get some regression coefficient for A1. Let's call it b. It is clear that if you re-run the regression including A2 but not A1, the coefficient of A2 will also be b. But we can go farther than this. Pick any arbitrary number C. We can put A1 and A2 in the model together and impose the constraint that the coefficient of A1 must be C. This will work, because the regression will simply assign b-C as the coefficient of A2 and everything works out just as it would in a model with A1 alone and having coefficient b. So the coefficient of A1 is completely arbitrary: it cold be anything, and the coefficient of A2 will adjust accordingly. The mathematics is clear and simple in this simple example, with more variables or a more complicated form of collinearity it takes a little more work to figure it out but the result is the same. Whenever you have a set of colinear variables, the coefficients of those variables are indeterminate, and, in fact, you can stipulate any arbitrary value for one of them by just adjusting the coefficients of the others.

Last edited by Clyde Schechter; 22 Jan 2020, 17:10.
Comment
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#5

23 Jan 2020, 15:16

So basically some previously collinear variables might end up having a coefficient because after the omission they were no longer collinear. As far as the enterprise being misguided all I am trying to do is to assign a coefficient to dum1 which is omitted and was collinear with dum2 because their coeffcients should be identical.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#6

23 Jan 2020, 17:10

all I am trying to do is to assign a coefficient to dum1 which is omitted and was collinear with dum2 because their coeffcients should be identical

No, no, no, no, no!!! Please re-read what I wrote in #4. The coefficients of dum1 and dum2 should not be identical. There is no reason to think that they should be and every reason to think they should not. The only reasonable assignment of a coefficient to an omitted variable is zero.
Comment
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#7

23 Jan 2020, 19:11

I should not have said "identical" but about the same in my case. I have a reason to believe the coefficient values are about the same. So, I prefer being little off in prediction to have none at all. Below is an extreme case where dum3 =dum2 but what I have is close to that. Sorry about skipping over problem specific details.

clear
set obs 60
generate Y = runiform()
gen dum1 =1
gen dum2 =1
replace dum1 =0 if _n>30
replace dum2 =0 if _n<31
gen dum3 =dum2

reg Y dum1 dum2 dum3, nocons
reg Y dum1 dum3, nocons
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30358
#8

23 Jan 2020, 21:37

Below is an extreme case where dum3 =dum2 but what I have is close to that. Sorry about skipping over problem specific details.

No, you are still very, very wrong. In fact your own example is a great illustration of just how wrong it is.

Let's see how the predicted values work out with dum2 omitted, and how they work out when we assign the coefficient of dum3 to dum2:

Code:

clear set obs 60 set seed 1234 generate Y = runiform() gen dum1 =1 gen dum2 =1 replace dum1 =0 if _n>30 replace dum2 =0 if _n<31 gen dum3 =dum2 reg Y dum1 dum2 dum3, nocons reg Y dum1 dum3, nocons predict actual_xb, xb label var actual_xb "Based on dum1 dum3" predict ozfidan_xb, xb replace ozfidan_xb = ozfidan_xb + _b[dum3] * dum2 label var ozfidan_xb "With dum2 coef = dum3 coef" summ Y *_xb

And, here's the final output:

Code:

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- Y | 60 .5177649 .3379132 .0153918 .9874877 actual_xb | 60 .5177649 .0122495 .5056179 .5299119 ozfidan_xb | 60 .7705738 .2426929 .5299119 1.011236

As you can see, the expected values of actual_xb and Y are exactly the same, as should be the case with a regression. But look at the expected value of ozfidan_xb: it's way, way off. You have introduced an enormous bias, greater than 50%, by doing this. With the coefficients tampered with in the way you are trying to do, the regression is no longer a regression in any meaningful sense of the word.
Comment

Announcement

Assigning the coefficients of a collinear dummy variable kept in the model to the omitted dummy variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment