Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • ''omitted because of collinearity"

    While I am running regression in Stata, many times a variable is dropped due to collinearity. In the model results, the variable shows a 0 coefficient. When I am checking the correlation between the variables in the models, the omitted variable does not have a very high correlation with any of the variables. Any explanations for this phenomenon?
    Emily Albom

  • #2
    Show exactly what you typed at Stata, and exactly what Stata returned to you.

    Comment


    • #3
      ...and please act on https://www.statalist.org/forums/help#realnames. Thanks.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        tl;dr: en.wikipedia.org/w/index.php?title=Collinearity_(statistics)

        Suppose your independent variables are named x1, x2, ..., x10 and suppose
        Code:
        regress y x1-x10
        results in x10 being dropped from the model for collinearity.

        It doesn't matter that x10 is not highly correlated with any individual variable x1 through x9. If you
        Code:
        regress x10 x1-x9
        you will find that x10 is predicted perfectly by x1 through x9: the definition of collinearity, which the Wikipedia article tells us is also known as multicollinearity.
        Last edited by William Lisowski; 23 Aug 2021, 08:14.

        Comment


        • #5
          Here is an example of what William described in #4. HTH.

          Code:
          // Several years ago, Jerry Dallal posted the following dataset
          // to one of the sci.stat.* usenet groups to illustrate that
          // one can have complete linear dependence despite the absence
          // of any high pairwise correlations.
          
          clear *
          input x1 x2 x3 y
          18 88 106 13
          72 45 117 43
          36 63 99 50
          75 26 101 77
          22 83 105 23
          99 71 170 68
          69 53 122 6
          6 49 55 51
          86 99 185 37
          85 64 149 10
          87 7 94 32
          93 32 125 69
          44 88 132 4
          34 34 68 13
          84 28 112 18
          end
          pwcorr x1-y
          regress y x1 x2 x3
          regress x3 x2 x1
          
          * x3 = x1 + x2
          generate new = x1 + x2
          generate diff = x3-new
          summarize new x3 diff
          --
          Bruce Weaver
          Email: [email protected]
          Version: Stata/MP 18.5 (Windows)

          Comment


          • #6
            When I am checking the correlation between the variables in the models, the omitted variable does not have a very high correlation with any of the variables.
            Correlation matrix among the independent variables can only help if two variables alone are collinear. But in fact, collinearity is not limited to only two variables, it can be three or more. Two "newbie mistakes" that often cause this error are:

            1.) Manually creating many binary indicators (also called "dummies") for a single categorical variable, and then proceed to put all of them into the model:

            Code:
            webuse nhanes2f, clear
            tab region, gen(r)
            
            * No perfect correlation
            pwcorr r1 r2 r3 r4
            
            reg weight r1 r2 r3 r4
            Results:
            Code:
            . * No perfect correlation
            . pwcorr r1 r2 r3 r4
            
                         |       r1       r2       r3       r4
            -------------+------------------------------------
                      r1 |   1.0000
                      r2 |  -0.3044   1.0000
                      r3 |  -0.3104  -0.3738   1.0000
                      r4 |  -0.2933  -0.3532  -0.3602   1.0000
            
            .
            . reg weight r1 r2 r3 r4
            note: r1 omitted because of collinearity
            
                  Source |       SS           df       MS      Number of obs   =    10,337
            -------------+----------------------------------   F(3, 10333)     =      0.39
                   Model |  277.312703         3  92.4375678   Prob > F        =    0.7588
                Residual |  2436749.94    10,333  235.822117   R-squared       =    0.0001
            -------------+----------------------------------   Adj R-squared   =   -0.0002
                   Total |  2437027.25    10,336    235.7805   Root MSE        =    15.357
            
            ------------------------------------------------------------------------------
                  weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                      r1 |          0  (omitted)
                      r2 |   .3998743   .4450754     0.90   0.369    -.4725598    1.272308
                      r3 |   .3804926   .4423884     0.86   0.390    -.4866743     1.24766
                      r4 |   .1325042   .4504297     0.29   0.769    -.7504252    1.015434
                   _cons |   71.65495    .336229   213.11   0.000     70.99587    72.31402
            ------------------------------------------------------------------------------
            On that, using i. to delineate categorical variable would be a better alternative. Use command -help fvvarlist- to learn more.

            2.) Not realizing that the sum of a group of variables equal to a constant:

            This is actually just an extension of above (notice that r1 + r2 + r3 + r4 = 1), but applied to continuous variables. For example, in a time-use study, total hours of sleeping, awake active, and awake sedentary time add up to 24 hours. If we put all three of them into the model as independent variables, error like the one above will happen as well.
            Last edited by Ken Chui; 23 Aug 2021, 08:37.

            Comment

            Working...
            X