Omitted because of collinearity problem

Jean Jacques

Join Date: Sep 2020

Posts: 97
#1

Omitted because of collinearity problem

24 Jan 2022, 04:46

Hi guys,

I'm running an areg regression:

Code:

areg y i.country i.gender, absorb(vd3)

where all the dependent variables are categoricals. Doing this I have results. However when I add "language" as another dependent variable, I have the "omitted because of collinearity" problem. Note that "language" is the official language of the country so even if there's no variation within the country, the same language can be spoken in several countries (when there's more than one official language in a county, I took the one used by the majority, i.e. Belgium -> Dutch, Canada-> English) .

I wonder first whether this is a problem related to my dataset, which is quiet big (24 millions of observations) but not enough to deal with 100 possible values of countries, more than 40 different languages and 794,914 possible values for vd3.

Secondly, even if all the possible values of language are omitted because of collinearity, I do observe a change (small) in the coefficients of the other variables. I.e the coefficient of Uruguay is 0.164 without language in my model and it changes to 0167 when languages is added (but omitted dure collinearity).

Best!
Jean
Tags: areg, collinearity, fixed effects

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

24 Jan 2022, 07:44

Jean:
I tried to reproduce your problem but I was unsuccessful:

Code:

use "https://www.stata-press.com/data/r17/nlswork.dta"

. areg ln_wage i.race age, abs(idcode)
note: 2.race omitted because of collinearity.
note: 3.race omitted because of collinearity.

Linear regression, absorbing indicators            Number of obs     =  28,510
Absorbed variable: idcode                          No. of categories =   4,710
                                                   F(1, 23799)       = 2720.20
                                                   Prob > F          =  0.0000
                                                   R-squared         =  0.6636
                                                   Adj R-squared     =  0.5970
                                                   Root MSE          =  0.3035

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      Black  |          0  (omitted)
      Other  |          0  (omitted)
             |
         age |   .0181349   .0003477    52.16   0.000     .0174534    .0188164
       _cons |   1.148214   .0102579   111.93   0.000     1.128107     1.16832
------------------------------------------------------------------------------
F test of absorbed indicators: F(4709, 23799) = 8.808         Prob > F = 0.000

. g tender=1

. areg ln_wage i.race age i.tender, abs(idcode)
note: 2.race omitted because of collinearity.
note: 3.race omitted because of collinearity.
note: 1.tender omitted because of collinearity.

Linear regression, absorbing indicators            Number of obs     =  28,510
Absorbed variable: idcode                          No. of categories =   4,710
                                                   F(1, 23799)       = 2720.20
                                                   Prob > F          =  0.0000
                                                   R-squared         =  0.6636
                                                   Adj R-squared     =  0.5970
                                                   Root MSE          =  0.3035

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      Black  |          0  (omitted)
      Other  |          0  (omitted)
             |
         age |   .0181349   .0003477    52.16   0.000     .0174534    .0188164
    1.tender |          0  (omitted)
       _cons |   1.148214   .0102579   111.93   0.000     1.128107     1.16832
------------------------------------------------------------------------------
F test of absorbed indicators: F(4709, 23799) = 8.808         Prob > F = 0.000


.

That said, is the number of observations in both -areg- codes the same?
What above is linked with the request of posting what you typed and what Stata gave you back (as per FAQ) that makes everything smoother, avoiding posting back anf forth. Thanks.

Kind regards,
Carlo
(Stata 19.0)

Comment

Jean Jacques

Join Date: Sep 2020
Posts: 97

24 Jan 2022, 08:54

Ciao Carlo!

Thanks for your answer and for trying to reproduce my problem. Indeed the number of observations is the same.
I'll try to summarize my output below. Instead of using "language" I run it with "continent" (the problem is the same but as I have 6 continents instead of 40 languages, the output is shorter). I hope this helps.

Code:

. areg score i.guest_country_code, absorb(hotel_name_x_room)

Linear regression, absorbing indicators         Number of obs     = 26,380,460
Absorbed variable: hotel_name_x_room            No. of categories =    794,914
                                                F( 248,25585298)  =     588.77
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1913
                                                Adj R-squared     =     0.1661
                                                Root MSE          =     1.7042

---------------------------------------------------------------------------------------------------------------
                                        score |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------------------------------+----------------------------------------------------------------
                           guest_country_code |
                                     Albania  |   .4192879   .0514439     8.15   0.000     .3184596    .5201162
                                     Algeria  |   .0205716   .0518289     0.40   0.691    -.0810111    .1221543
                              American Samoa  |   .1464801   .1515708     0.97   0.334    -.1505933    .4435535
                                     Andorra  |   .0842967   .0567297     1.49   0.137    -.0268915    .1954849
.
.
.
                                        _cons |   8.018476     .04909   163.34   0.000     7.922261     8.11469
---------------------------------------------------------------------------------------------------------------
F test of absorbed indicators: F(794913, 25585298) = 6.954    Prob > F = 0.000

Code:

. areg score i.guest_country_code i.guest_continent_code, absorb(hotel_name_x_room)
note: 2.guest_continent_code omitted because of collinearity
note: 3.guest_continent_code omitted because of collinearity
note: 4.guest_continent_code omitted because of collinearity
note: 5.guest_continent_code omitted because of collinearity
note: 6.guest_continent_code omitted because of collinearity
note: 7.guest_continent_code omitted because of collinearity

Linear regression, absorbing indicators         Number of obs     = 26,380,460
Absorbed variable: hotel_name_x_room            No. of categories =    794,914
                                                F( 248,25585298)  =     588.77
                                                Prob > F          =     0.0000
                                                R-squared         =     0.1913
                                                Adj R-squared     =     0.1661
                                                Root MSE          =     1.7042

---------------------------------------------------------------------------------------------------------------
                                        score |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------------------------------+----------------------------------------------------------------
                           guest_country_code |
                                     Albania  |   .4192879   .0514439     8.15   0.000     .3184596    .5201162
                                     Algeria  |   .0205716   .0518289     0.40   0.691    -.0810111    .1221543
                              American Samoa  |   .1464801   .1515708     0.97   0.334    -.1505933    .4435535
                                     Andorra  |   .0842967   .0567297     1.49   0.137    -.0268915    .1954849
.
.
.
                         guest_continent_code |
                               North America  |          0  (omitted)
                                        Asia  |          0  (omitted)
                                   Antartica  |          0  (omitted)
                                      Europe  |          0  (omitted)
                                     Oceania  |          0  (omitted)
                               South America  |          0  (omitted)
                                              |
                                        _cons |   8.018476     .04909   163.34   0.000     7.922261     8.11469
---------------------------------------------------------------------------------------------------------------
F test of absorbed indicators: F(794913, 25585298) = 6.954    Prob > F = 0.000

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#4

24 Jan 2022, 09:32

Jean:
unless I'm missing out on something, the two outputs seem identical.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jean Jacques

Join Date: Sep 2020

Posts: 97
#5

24 Jan 2022, 09:36

Yes, in this case they are. As I said if there's a difference is minor. My main point is about why, in this case, arises this collinearity problem.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#6

24 Jan 2022, 09:45

Jean:
-continent- is time-invariant; therefore -areg- does not return any coefficient, as it is perfectly collinear with -hotel_name_x_room-.
You may want to check what happens if you code:

Code:

xtset hotel_name_x_room xtreg score i.guest_country_code i.guest_continent_code, fe

Kind regards,
Carlo
(Stata 19.0)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

24 Jan 2022, 09:49

Once you know guest_country_code you know guest_continent_code, so each continent is exactly equal to the sum of the country indicators for the countries within that continent.

You could instead do

Code:

areg scorei.guest_continent_code i.guest_country_code, absorb(hotel_name_x_room)

which would give fixed effects for each continent included in your data, but would omit one country for each continent, and the fixed country effects would be relative to the omitted country on that continent (I think).
1 like
Comment
Jean Jacques

Join Date: Sep 2020

Posts: 97
#8

24 Jan 2022, 09:59

Thanks Carlo! I was scared of calling it a fixed effect estimation given there's not time dimension in the model, but I see the point. However it is still unclear why this problem appears when I include continent but not when I include gender for example.

When I run the code above clearly the output is the same than when I run "areg".
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#9

24 Jan 2022, 10:08

it is still unclear why this problem appears when I include continent but not when I include gender for example.

Unlike continent, which is the same for every resident in a country, gender differs from resident to resident.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

#10

24 Jan 2022, 10:08

Jean Jacques:
the only explanation that springs to my mind is that -i.continent- has at least one missing values, as Stata takes missingness into account before perfect collinearity, as you can see from the following toy-example, where the time-invariant predictor -tender- has 1 missing values in 2:

Code:

. g tender=1 in 1/28534

. replace tender=. in 2

. areg ln_wage age i.race, abs(idcode)
note: 2.race omitted because of collinearity.
note: 3.race omitted because of collinearity.

Linear regression, absorbing indicators            Number of obs     =  28,510
Absorbed variable: idcode                          No. of categories =   4,710
                                                   F(1, 23799)       = 2720.20
                                                   Prob > F          =  0.0000
                                                   R-squared         =  0.6636
                                                   Adj R-squared     =  0.5970
                                                   Root MSE          =  0.3035

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0181349   .0003477    52.16   0.000     .0174534    .0188164
             |
        race |
      Black  |          0  (omitted)
      Other  |          0  (omitted)
             |
       _cons |   1.148214   .0102579   111.93   0.000     1.128107     1.16832
------------------------------------------------------------------------------
F test of absorbed indicators: F(4709, 23799) = 8.808         Prob > F = 0.000

. areg ln_wage age i.race i.tender, abs(idcode)
note: 2.race omitted because of collinearity.
note: 3.race omitted because of collinearity.
note: 1.tender omitted because of collinearity.

Linear regression, absorbing indicators            Number of obs     =  28,509
Absorbed variable: idcode                          No. of categories =   4,710
                                                   F(1, 23798)       = 2718.15
                                                   Prob > F          =  0.0000
                                                   R-squared         =  0.6637
                                                   Adj R-squared     =  0.5972
                                                   Root MSE          =  0.3034

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0181257   .0003477    52.14   0.000     .0174442    .0188071
             |
        race |
      Black  |          0  (omitted)
      Other  |          0  (omitted)
             |
    1.tender |          0  (omitted)
       _cons |   1.148498   .0102567   111.98   0.000     1.128394    1.168602
------------------------------------------------------------------------------
F test of absorbed indicators: F(4709, 23798) = 8.812         Prob > F = 0.000

.

Actually, coefficients are similar but not identical as the sample in the second code is slightly smaller due to 1 missing value.
I would double check whether this is not the culprit in your dataset, too.

Kind regards,
Carlo
(Stata 19.0)

Announcement