How to solve the issue of collinearity among categorical variables?

Laiy Kho

Join Date: Oct 2022

Posts: 48
#1

How to solve the issue of collinearity among categorical variables?

16 Dec 2022, 13:07

Hello,

I have two categorical variables X1 and X2 in my regression. Stata drops one of the categories of X1 when I include X2 in the regression. I am not referring to the base category that is automatically dropped, an additional category is dropped due to collinearity. How to get around solving this issue? Would it be acceptable to drop some observations from the highly collinear category or that particular category from my dataset in research? Is it a sound reason to do so? Are there any other solutions?

Thank you

Last edited by Laiy Kho; 16 Dec 2022, 13:49.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#2

16 Dec 2022, 13:50

Given no details, it is impossible to provide specific advice. Here are a few generic considerations:

First, is this even a problem? If X1 and X2 are not explanatory variables of interest, but are merely put into the regression to adjust for their potential confounding effects (so-called "control variables"), then the colinearity between X1 and X2 is irrelevant. Ignore it, do nothing, and move on.

You need to arrive at an understanding of just how the colinearity between X1 and X2 works. I think the most commonly encountered situation is where X2's levels define mutually exclusive and exhaustive subsets of the levels of X1. A good example is if X1 denotes a series of years, and X2 is a dichotomous variable defining some era of time. (E.g. X1 designates years 2000-2020, and X2 distinguishes pre-2008 from 2008 and after.) If you situation is like that, removing observations with the "highly colinear category" won't work, because there is no single level of year that is highly colinear. They are all equally colinear, and the one Stata chose to omit is just arbitrary. If you remove that year's observations from the data, you will still have colinearity, and Stata will omit yet another of the year indicators.

But there are other colinearities that really do arise because one particular level of one variable is readily predicted exactly from the values of the other. In that case, eliminating either those observations, or re-defining the variable so as to break the colinearity can be workable.

If both variables are key explanatory variables of interest, then you have a dilemma. Because it is a mathematical impossibility to estimate their separate effects in a single model. So you may be forced to either leave one of them out, or redefine one so as to make it non-colinear. Either way, it means that you are unable to achieve your original research goals, and must modify them. Another alternative available if the colinearity is not inherent in the definitions of the variables but merely arose for some reason in the data set you are using is to get different data. This last approach is often impractical, but in the toughest cases may be the only way to actually achieve the research goals.
Comment

Announcement

How to solve the issue of collinearity among categorical variables?

Comment