''omitted because of collinearity"

stata sd

Join Date: Aug 2021

Posts: 6
#1

''omitted because of collinearity"

23 Aug 2021, 05:22

While I am running regression in Stata, many times a variable is dropped due to collinearity. In the model results, the variable shows a 0 coefficient. When I am checking the correlation between the variables in the models, the omitted variable does not have a very high correlation with any of the variables. Any explanations for this phenomenon?

Emily Albom
Tags: collinearity
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

23 Aug 2021, 05:35

Show exactly what you typed at Stata, and exactly what Stata returned to you.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#3

23 Aug 2021, 06:03

...and please act on https://www.statalist.org/forums/help#realnames. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

23 Aug 2021, 07:54

tl;dr: en.wikipedia.org/w/index.php?title=Collinearity_(statistics)

Suppose your independent variables are named x1, x2, ..., x10 and suppose

Code:

regress y x1-x10

results in x10 being dropped from the model for collinearity.

It doesn't matter that x10 is not highly correlated with any individual variable x1 through x9. If you

Code:

regress x10 x1-x9

you will find that x10 is predicted perfectly by x1 through x9: the definition of collinearity, which the Wikipedia article tells us is also known as multicollinearity.

Last edited by William Lisowski; 23 Aug 2021, 08:14.
1 like
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1133

23 Aug 2021, 08:09

Here is an example of what William described in #4. HTH.

Code:

// Several years ago, Jerry Dallal posted the following dataset
// to one of the sci.stat.* usenet groups to illustrate that
// one can have complete linear dependence despite the absence
// of any high pairwise correlations.

clear *
input x1 x2 x3 y
18 88 106 13
72 45 117 43
36 63 99 50
75 26 101 77
22 83 105 23
99 71 170 68
69 53 122 6
6 49 55 51
86 99 185 37
85 64 149 10
87 7 94 32
93 32 125 69
44 88 132 4
34 34 68 13
84 28 112 18
end
pwcorr x1-y
regress y x1 x2 x3
regress x3 x2 x1

* x3 = x1 + x2
generate new = x1 + x2
generate diff = x3-new
summarize new x3 diff

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Comment

Ken Chui

Join Date: Aug 2014
Posts: 1058

23 Aug 2021, 08:12

When I am checking the correlation between the variables in the models, the omitted variable does not have a very high correlation with any of the variables.

Correlation matrix among the independent variables can only help if two variables alone are collinear. But in fact, collinearity is not limited to only two variables, it can be three or more. Two "newbie mistakes" that often cause this error are:

1.) Manually creating many binary indicators (also called "dummies") for a single categorical variable, and then proceed to put all of them into the model:

Code:

webuse nhanes2f, clear
tab region, gen(r)

* No perfect correlation
pwcorr r1 r2 r3 r4

reg weight r1 r2 r3 r4

Results:

Code:

. * No perfect correlation
. pwcorr r1 r2 r3 r4

             |       r1       r2       r3       r4
-------------+------------------------------------
          r1 |   1.0000
          r2 |  -0.3044   1.0000
          r3 |  -0.3104  -0.3738   1.0000
          r4 |  -0.2933  -0.3532  -0.3602   1.0000

.
. reg weight r1 r2 r3 r4
note: r1 omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =    10,337
-------------+----------------------------------   F(3, 10333)     =      0.39
       Model |  277.312703         3  92.4375678   Prob > F        =    0.7588
    Residual |  2436749.94    10,333  235.822117   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0002
       Total |  2437027.25    10,336    235.7805   Root MSE        =    15.357

------------------------------------------------------------------------------
      weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          r1 |          0  (omitted)
          r2 |   .3998743   .4450754     0.90   0.369    -.4725598    1.272308
          r3 |   .3804926   .4423884     0.86   0.390    -.4866743     1.24766
          r4 |   .1325042   .4504297     0.29   0.769    -.7504252    1.015434
       _cons |   71.65495    .336229   213.11   0.000     70.99587    72.31402
------------------------------------------------------------------------------

On that, using i. to delineate categorical variable would be a better alternative. Use command -help fvvarlist- to learn more.

2.) Not realizing that the sum of a group of variables equal to a constant:

This is actually just an extension of above (notice that r1 + r2 + r3 + r4 = 1), but applied to continuous variables. For example, in a time-use study, total hours of sleeping, awake active, and awake sedentary time add up to 24 hours. If we put all three of them into the model as independent variables, error like the one above will happen as well.

Last edited by Ken Chui; 23 Aug 2021, 08:37.

Announcement