Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Omitted because of collinearity problem

    Hi guys,

    I'm running an areg regression:

    Code:
    areg y i.country i.gender, absorb(vd3)
    where all the dependent variables are categoricals. Doing this I have results. However when I add "language" as another dependent variable, I have the "omitted because of collinearity" problem. Note that "language" is the official language of the country so even if there's no variation within the country, the same language can be spoken in several countries (when there's more than one official language in a county, I took the one used by the majority, i.e. Belgium -> Dutch, Canada-> English) .

    I wonder first whether this is a problem related to my dataset, which is quiet big (24 millions of observations) but not enough to deal with 100 possible values of countries, more than 40 different languages and 794,914 possible values for vd3.

    Secondly, even if all the possible values of language are omitted because of collinearity, I do observe a change (small) in the coefficients of the other variables. I.e the coefficient of Uruguay is 0.164 without language in my model and it changes to 0167 when languages is added (but omitted dure collinearity).

    Best!
    Jean

  • #2
    Jean:
    I tried to reproduce your problem but I was unsuccessful:
    Code:
    use "https://www.stata-press.com/data/r17/nlswork.dta"
    
    . areg ln_wage i.race age, abs(idcode)
    note: 2.race omitted because of collinearity.
    note: 3.race omitted because of collinearity.
    
    Linear regression, absorbing indicators            Number of obs     =  28,510
    Absorbed variable: idcode                          No. of categories =   4,710
                                                       F(1, 23799)       = 2720.20
                                                       Prob > F          =  0.0000
                                                       R-squared         =  0.6636
                                                       Adj R-squared     =  0.5970
                                                       Root MSE          =  0.3035
    
    ------------------------------------------------------------------------------
         ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
            race |
          Black  |          0  (omitted)
          Other  |          0  (omitted)
                 |
             age |   .0181349   .0003477    52.16   0.000     .0174534    .0188164
           _cons |   1.148214   .0102579   111.93   0.000     1.128107     1.16832
    ------------------------------------------------------------------------------
    F test of absorbed indicators: F(4709, 23799) = 8.808         Prob > F = 0.000
    
    . g tender=1
    
    . areg ln_wage i.race age i.tender, abs(idcode)
    note: 2.race omitted because of collinearity.
    note: 3.race omitted because of collinearity.
    note: 1.tender omitted because of collinearity.
    
    Linear regression, absorbing indicators            Number of obs     =  28,510
    Absorbed variable: idcode                          No. of categories =   4,710
                                                       F(1, 23799)       = 2720.20
                                                       Prob > F          =  0.0000
                                                       R-squared         =  0.6636
                                                       Adj R-squared     =  0.5970
                                                       Root MSE          =  0.3035
    
    ------------------------------------------------------------------------------
         ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
            race |
          Black  |          0  (omitted)
          Other  |          0  (omitted)
                 |
             age |   .0181349   .0003477    52.16   0.000     .0174534    .0188164
        1.tender |          0  (omitted)
           _cons |   1.148214   .0102579   111.93   0.000     1.128107     1.16832
    ------------------------------------------------------------------------------
    F test of absorbed indicators: F(4709, 23799) = 8.808         Prob > F = 0.000
    
    
    .
    That said, is the number of observations in both -areg- codes the same?
    What above is linked with the request of posting what you typed and what Stata gave you back (as per FAQ) that makes everything smoother, avoiding posting back anf forth. Thanks.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      Ciao Carlo!

      Thanks for your answer and for trying to reproduce my problem. Indeed the number of observations is the same.
      I'll try to summarize my output below. Instead of using "language" I run it with "continent" (the problem is the same but as I have 6 continents instead of 40 languages, the output is shorter). I hope this helps.

      Code:
      . areg score i.guest_country_code, absorb(hotel_name_x_room)
      
      Linear regression, absorbing indicators         Number of obs     = 26,380,460
      Absorbed variable: hotel_name_x_room            No. of categories =    794,914
                                                      F( 248,25585298)  =     588.77
                                                      Prob > F          =     0.0000
                                                      R-squared         =     0.1913
                                                      Adj R-squared     =     0.1661
                                                      Root MSE          =     1.7042
      
      ---------------------------------------------------------------------------------------------------------------
                                              score |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      ----------------------------------------------+----------------------------------------------------------------
                                 guest_country_code |
                                           Albania  |   .4192879   .0514439     8.15   0.000     .3184596    .5201162
                                           Algeria  |   .0205716   .0518289     0.40   0.691    -.0810111    .1221543
                                    American Samoa  |   .1464801   .1515708     0.97   0.334    -.1505933    .4435535
                                           Andorra  |   .0842967   .0567297     1.49   0.137    -.0268915    .1954849
      .
      .
      .
                                              _cons |   8.018476     .04909   163.34   0.000     7.922261     8.11469
      ---------------------------------------------------------------------------------------------------------------
      F test of absorbed indicators: F(794913, 25585298) = 6.954    Prob > F = 0.000
      Code:
      . areg score i.guest_country_code i.guest_continent_code, absorb(hotel_name_x_room)
      note: 2.guest_continent_code omitted because of collinearity
      note: 3.guest_continent_code omitted because of collinearity
      note: 4.guest_continent_code omitted because of collinearity
      note: 5.guest_continent_code omitted because of collinearity
      note: 6.guest_continent_code omitted because of collinearity
      note: 7.guest_continent_code omitted because of collinearity
      
      Linear regression, absorbing indicators         Number of obs     = 26,380,460
      Absorbed variable: hotel_name_x_room            No. of categories =    794,914
                                                      F( 248,25585298)  =     588.77
                                                      Prob > F          =     0.0000
                                                      R-squared         =     0.1913
                                                      Adj R-squared     =     0.1661
                                                      Root MSE          =     1.7042
      
      ---------------------------------------------------------------------------------------------------------------
                                              score |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      ----------------------------------------------+----------------------------------------------------------------
                                 guest_country_code |
                                           Albania  |   .4192879   .0514439     8.15   0.000     .3184596    .5201162
                                           Algeria  |   .0205716   .0518289     0.40   0.691    -.0810111    .1221543
                                    American Samoa  |   .1464801   .1515708     0.97   0.334    -.1505933    .4435535
                                           Andorra  |   .0842967   .0567297     1.49   0.137    -.0268915    .1954849
      .
      .
      .
                               guest_continent_code |
                                     North America  |          0  (omitted)
                                              Asia  |          0  (omitted)
                                         Antartica  |          0  (omitted)
                                            Europe  |          0  (omitted)
                                           Oceania  |          0  (omitted)
                                     South America  |          0  (omitted)
                                                    |
                                              _cons |   8.018476     .04909   163.34   0.000     7.922261     8.11469
      ---------------------------------------------------------------------------------------------------------------
      F test of absorbed indicators: F(794913, 25585298) = 6.954    Prob > F = 0.000

      Comment


      • #4
        Jean:
        unless I'm missing out on something, the two outputs seem identical.
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment


        • #5
          Yes, in this case they are. As I said if there's a difference is minor. My main point is about why, in this case, arises this collinearity problem.

          Comment


          • #6
            Jean:
            -continent- is time-invariant; therefore -areg- does not return any coefficient, as it is perfectly collinear with -hotel_name_x_room-.
            You may want to check what happens if you code:
            Code:
            xtset hotel_name_x_room
            xtreg score i.guest_country_code i.guest_continent_code, fe
            Kind regards,
            Carlo
            (Stata 18.0 SE)

            Comment


            • #7
              Once you know guest_country_code you know guest_continent_code, so each continent is exactly equal to the sum of the country indicators for the countries within that continent.

              You could instead do
              Code:
              areg scorei.guest_continent_code  i.guest_country_code,  absorb(hotel_name_x_room)
              which would give fixed effects for each continent included in your data, but would omit one country for each continent, and the fixed country effects would be relative to the omitted country on that continent (I think).

              Comment


              • #8
                Thanks Carlo! I was scared of calling it a fixed effect estimation given there's not time dimension in the model, but I see the point. However it is still unclear why this problem appears when I include continent but not when I include gender for example.

                When I run the code above clearly the output is the same than when I run "areg".

                Comment


                • #9
                  it is still unclear why this problem appears when I include continent but not when I include gender for example.
                  Unlike continent, which is the same for every resident in a country, gender differs from resident to resident.

                  Comment


                  • #10
                    Jean Jacques:
                    the only explanation that springs to my mind is that -i.continent- has at least one missing values, as Stata takes missingness into account before perfect collinearity, as you can see from the following toy-example, where the time-invariant predictor -tender- has 1 missing values in 2:
                    Code:
                    . g tender=1 in 1/28534
                    
                    . replace tender=. in 2
                    
                    . areg ln_wage age i.race, abs(idcode)
                    note: 2.race omitted because of collinearity.
                    note: 3.race omitted because of collinearity.
                    
                    Linear regression, absorbing indicators            Number of obs     =  28,510
                    Absorbed variable: idcode                          No. of categories =   4,710
                                                                       F(1, 23799)       = 2720.20
                                                                       Prob > F          =  0.0000
                                                                       R-squared         =  0.6636
                                                                       Adj R-squared     =  0.5970
                                                                       Root MSE          =  0.3035
                    
                    ------------------------------------------------------------------------------
                         ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                    -------------+----------------------------------------------------------------
                             age |   .0181349   .0003477    52.16   0.000     .0174534    .0188164
                                 |
                            race |
                          Black  |          0  (omitted)
                          Other  |          0  (omitted)
                                 |
                           _cons |   1.148214   .0102579   111.93   0.000     1.128107     1.16832
                    ------------------------------------------------------------------------------
                    F test of absorbed indicators: F(4709, 23799) = 8.808         Prob > F = 0.000
                    
                    . areg ln_wage age i.race i.tender, abs(idcode)
                    note: 2.race omitted because of collinearity.
                    note: 3.race omitted because of collinearity.
                    note: 1.tender omitted because of collinearity.
                    
                    Linear regression, absorbing indicators            Number of obs     =  28,509
                    Absorbed variable: idcode                          No. of categories =   4,710
                                                                       F(1, 23798)       = 2718.15
                                                                       Prob > F          =  0.0000
                                                                       R-squared         =  0.6637
                                                                       Adj R-squared     =  0.5972
                                                                       Root MSE          =  0.3034
                    
                    ------------------------------------------------------------------------------
                         ln_wage | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                    -------------+----------------------------------------------------------------
                             age |   .0181257   .0003477    52.14   0.000     .0174442    .0188071
                                 |
                            race |
                          Black  |          0  (omitted)
                          Other  |          0  (omitted)
                                 |
                        1.tender |          0  (omitted)
                           _cons |   1.148498   .0102567   111.98   0.000     1.128394    1.168602
                    ------------------------------------------------------------------------------
                    F test of absorbed indicators: F(4709, 23798) = 8.812         Prob > F = 0.000
                    
                    .
                    Actually, coefficients are similar but not identical as the sample in the second code is slightly smaller due to 1 missing value.
                    I would double check whether this is not the culprit in your dataset, too.
                    Kind regards,
                    Carlo
                    (Stata 18.0 SE)

                    Comment

                    Working...
                    X