Decide which variables to be omitted in OLS regression

Patrick Balles

Join Date: Nov 2014

Posts: 22
#1

Decide which variables to be omitted in OLS regression

19 Jan 2015, 10:34

hi,

is there a possibility to create something like a ranking after that Stata decides which variables to omit in a OLS regression with perfect multicollinearity?

thanks,
Patrick

Last edited by Patrick Balles; 19 Jan 2015, 10:36.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

19 Jan 2015, 10:45

Patrick:
unortunately, this usually is a researcher's task (no white magic to avoid it, as far as I know).
I don't know if -help stepwise- may be another answer to your question but, if it was, please consider that, in general, it has more cons than pros (please see http://www.statalist.org/forums/foru...ise-regression).

Kind regards,
Carlo
(Stata 19.0)
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#3

19 Jan 2015, 11:01

If a variable is collinear with another variable, statistically it does not make a difference which of the variables to exclude from the regression model, as the resulting model will be identical. You might want to elaborate on your specific problem and what you are trying to achieve.

I have no answer on the general question. There are different commands Stata uses to omit collinear variables and they migth lead to different answers, as to which of the variables is excluded.

Best
Daniel
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#4

19 Jan 2015, 12:57

The closest I can come to a solution to your problem (apart from what Carlo and Daniel said) is to maybe run regressions of each independent variable upon the others with the ",be" option set. If there is one variable or a small set of variables that create the collinearity problem for numerous other variables, then ditch the fewest possible. This pre-supposes that every variable is equally important to you on theoretical grounds, and that including more rather than fewer is preferable.

One important thing to note: you don't say if it's dummy variable regression or all continuous. If it's kicking out one each from sets of dummies, then you don't gain anything from running the regressions I suggested. In this case, pick a contrast that is most meaningful. For example, with race coded as "black, white, other," excluding black or white is usually best, since the meaning of other is fuzzy and the contrast between backs and whites is generally the contrast of interest.

In either case, you should keep in mind theoretical interest; my first approach assumes all are of equal importance substantively. However, if you have on that is very important substantively, it might be worth sacrificing several that are just in there as "control" variables to keep the substantively important one(s).
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

19 Jan 2015, 15:31

Hello, Patrick,

As Carlo, Daniel and Ben already pointed out, you're talking about selecting the best model, and that involves scientific rationale and plausibility.

Just to add a few notes on the matter: if I were to choose among variables with multicollinearity:

I'd rather drop the one with too many outlyers, because OLS regression as at its best if we select variables not so much skewed.
I'd stick to the variable without missing values.
If it's a clinical research, for example, I'd choose the variable that gives more information to the clinician, or the one easier to apply in practice, or the one with the highest accuracy and reliability.
I'd compare the results of an F- test for the joint effect of two variables and also t-tests between these two variables.
I'd check if the collinearity involves the "main" predictor variable. If not, I would consider it a potential confounder and..., well... keep it
Checking and comparing changes in the predicted values from different models will be surely helpful.
Think about centering a non-transformed variable, and aftwerwards adding a quadratic term, rather then adding both before centering the non-transformed variable.
I'd also think about excluding variables with non-significant p-values, fundamentally if there is no great rationale for including them in the model.
Keep an eye on the changes in the coefficients during the modeling.
All in all, I'd perform postestimation tests (such as AIC and BIC) in order to check if a given model really "shows up".
Of course, life should sometimes get simple: I'd try to reap a parsimonous model whenever possible.

In short, "no secret", but hard work and reflection.

Hopefully it helps!

Marcos

Best regards,

Marcos
Comment
Patrick Balles

Join Date: Nov 2014

Posts: 22
#6

20 Jan 2015, 10:16

Thanks Carlo, Daniel, Ben and Marcos for your kind responses!!! I better should have mentioned that my regresssion includes not any continuous variables, but only dummies.

Actually, I'm dealing with 3 sets of dummy variables. the dependent variable is bilateral trade flow at industry level, the explaining variables exporter-importer, importer-industry and exporter-industry dummies for all combinations of the considered countries and industries. the main interest is to identify the coefficients on the exporter-industry fixed effects. unfortunately, Stata drops some of them before the regression due to collinearity.

I was thinking about a way to keep as many dummies of interest as possible, telling Stata something like "drop as many as possible from the first 2 sets of dummies" or "keep as many exporter-industry dummies as possible".

Thanks a lot!
Patrick
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3459
#7

20 Jan 2015, 10:26

I would start with staring at a lot of cross tabulation untill I figured out why these are dropped, and then make an informed decision on which to include and which to omit. It is hard and very frustrating work, but this is about what makes sense and computers don't know and cannot know what makes sense. On the bright side, this is why we still have jobs... More seriously, if you make those decisions yourself after hard work, you really understand your own model and are much better at communicating the results to your audience.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#8

20 Jan 2015, 11:49

Hello, Patrick,

You said you have only binary variables as predictors. Ok, I see your point. In spite of that, I still believe some of the items I mentioned also apply to the process of dealing with binary variables and modeling.

Best,
Marcos

Best regards,

Marcos
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#9

20 Jan 2015, 13:26

Given the substantive nature of your question, is it possible to "coarsen" your definition of industr(ies)? Sounds like you likely have certain industries with a single importer and/or exporter. You might do this in a simple and fairly defensible way by moving up a level in the SIC/NAICS/whatever codes. Alternatively, you could look at things carefully, and manually combine industries.

If you're using NAICS, moving up a digit is pretty straightforward, and seems quite defensible. Manually tweaking it could be messy.

Manual or not, you'd be losing information in a way, but gaining it in another, since at least you wouldn't have all the small industries lumped together into the excluded category as you do now.
Comment
Patrick Balles

Join Date: Nov 2014

Posts: 22
#10

21 Jan 2015, 11:29

Thanks, again.

Maarten, I have created a simple example and tried to understand which dummies and why they are dropped by Stata. This works very well for the easy case, but I'm not yet able to replicate this for the real case, where I have over 5,000 dummies. What I observed is that obviously the order in which the regressors are put in Stata OLS regression command matters for the "omitting procedure". It seems like Stata tries to keep as many regressors typed in first as possible. Is this true?

Ben, I'm using ISIC classification and I'm already using two digit codes with 12 industries in total. So, for each industry, there is a large number of importers/exporters.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#11

22 Jan 2015, 09:21

Dear Patrick,

Sorry to arrive late to this party! I can totally understand your problem because I often struggle with it.

Anyway, I want to reiterate what Maarten said: you have to think carefully about what dummies to drop and what dummies to keep. Additionally, you need to keep in mind that the interpretation of the coefficients you estimate will depend critically on the dummies you drop. That is, your work does not finish when you manage to find a model that keeps the variables you want, you then need to see what variables were drop to be able to interpret your results.

Best of luck with it,

Joao
Comment

Announcement

Decide which variables to be omitted in OLS regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment