Collinearity error when including continuous variable in dummy regression

Ernestina delPiero

Join Date: May 2016
Posts: 62

Collinearity error when including continuous variable in dummy regression

30 May 2019, 14:49

I am trying to run the following two regressions to compare the coefficients on the 'iso_str' dummies. The only difference between the two is that the second one includes the variable 'shr',

1.

Code:

reg lncost ib6.iso_str i.var_str, eform(exp_coeff) baselevels

Code:

reg lncost ib6.iso_str shr i.var_str, eform(exp_coeff) baselevels

When I run regression (2) above, Stata omits the 'shr' variable because of collinearity.

Then, I tried an alternative formulation of the above two regressions to see if this way I could compare their coefficients. Again, the only difference is the inclusion of the variable 'shr' in the second regression.

3.

Code:

reg lncost ib6.iso_str ibn.var_str, noconstant eform(exp_coeff) baselevels

Code:

reg lncost ib6.iso_str shr ibn.var_str, noconstant eform(exp_coeff) baselevels

Notice that the outputted coefficients for the 'iso_str' dummies in (3) are identical to those in (4). However, I still can't compare 3 vs.4, as this time Stata doesn't omit the 'shr' variable in reg 4, but it omits one of the 'var_str' dummies in 4 (again, because of collinearity)..even though I used the 'ibn' command so that none would be dropped!

How can I compare the 'iso_str' coefficients outputted by these two regressions, with and without the variable 'shr'? Perhaps there is a way around the collinearity issue I am facing, e.g. rearranging my data differently?

Thank you. An excerpt of my data is below.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3 iso str5 var double(cost shr) long(iso_str var_str) float lncost
"CIV" "x1105" 11458.3333333333             49.674 1 1  9.346473
"COD" "x1105" 44083.2888217523              56.12 2 1 10.693836
"MRT" "x1105"              540             47.176 3 1  6.291569
"NGA" "x1105" 16842.1052631579             50.481 4 1  9.731637
"TGO" "x1105" 5590.76923076923             58.838 5 1  8.628872
"TZA" "x1105"            48000             66.947 6 1 10.778956
"ZAF" "x1105" 904.655301204819 34.150000000000006 7 1  6.807554
"CIV" "x1106" 10441.1764705882             49.674 1 2  9.253512
"COD" "x1106" 39391.0340285401              56.12 2 2 10.581293
"MRT" "x1106"              520             47.176 3 2  6.253829
"NGA" "x1106" 11834.3195266272             50.481 4 2  9.378759
"TGO" "x1106"  4398.8603988604             58.838 5 2  8.389101
"TZA" "x1106"            45000             66.947 6 2 10.714417
"ZAF" "x1106"  608.84493902439 34.150000000000006 7 2  6.411563
"CIV" "x1107" 12032.0855614973             49.674 1 3  9.395332
"MRT" "x1107" 463.636363636364             47.176 3 3  6.139101
"NGA" "x1107" 17391.3043478261             50.481 4 3  9.763725
"TGO" "x1107" 5015.38461538462             58.838 5 3  8.520266
"TZA" "x1107" 43636.3636363636             66.947 6 3 10.683646
"ZAF" "x1107"          984.375 34.150000000000006 7 3  6.892007
end
label values iso_str iso_str
label def iso_str 1 "CIV", modify
label def iso_str 2 "COD", modify
label def iso_str 3 "MRT", modify
label def iso_str 4 "NGA", modify
label def iso_str 5 "TGO", modify
label def iso_str 6 "TZA", modify
label def iso_str 7 "ZAF", modify
label values var_str var_str
label def var_str 1 "x1105", modify
label def var_str 2 "x1106", modify
label def var_str 3 "x1107", modify

Last edited by Ernestina delPiero; 30 May 2019, 14:51.

Tags: None

Ernestina delPiero

Join Date: May 2016

Posts: 62
#2

30 May 2019, 16:31

One aspect I didn't mention and that could be the potential issue is that the 'shr' values are constant for every instance of 'iso_str'. That is, the same 'shr' appears as many times as there are 'iso_str' observations for each country. Not sure how to deal with this though. Could I perhaps have 'shr' appear only once for each 'iso_str'?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#3

30 May 2019, 17:33

Your remark in #2 is exactly the source of the problem. Since shr takes on the same value for all observations of a given iso_str, there is no difference in the information conveyed by the i.iso_str indicators and the values of shr. So it is mathematically impossible to have all of those variables in the regression at the same time. It's linear algebra and there is no way around it.

I am not able to imagine any sensible way to "have `shr' appear only once for each `iso_str'." If you replace shr with missing values in some of the observations, those observations will simply be omitted from the regression that includes shr. Moreover, even if you reduce it to a single observation for each iso_str (which would, of course obliterate much of the data in yhour data set) that still won't change the fact that shr is colinear with the iso_str indicators and will still be omitted (or one of the iso_str indicators will).

So, if your constancy of values of shr within each value of iso_str is not a data error, the problem you are trying to solve is meaningless and has no solution. You should carefully reconsider what the purpose of attempting this analysis was, what the underlying conceptual question is that you are trying to answer. Think about why you wanted to do this analysis. Perhaps if you do that you will come up with a very different analytic approach that answers that underlying conceptual question.
Comment

Ernestina delPiero

Join Date: May 2016
Posts: 62

30 May 2019, 19:07

Thank you. Your inputs are always top notch and clear, so I'm glad to see you respond to my question.

If I may, I would like your input on a follow-up question. In my initial post I stated that I can't compare regressions 3 vs.4 not because Stata omits the 'shr' variable in reg 4, but because it omits one of the 'var_str' dummies in 4 (again, because of collinearity). However, if I'm only interested in the coefficients on 'iso_str', can I still compare the the 'iso_str' coefficients in reg 3 with those from 4, or am I missing something? (As a side note, I must admit that I am somewhat uneasy removing the intercept, which I do in regression's 3 and 4; but not in 1 and 2, even though the estimated coefficients and SEs for 'iso_str' are the same in both 3 and 4..)

Regression 3 (and its output):

Code:

. reg lncost ib6.iso_str ibn.var_str, noconstant eform(exp_coeff) baselevels

      Source |       SS           df       MS      Number of obs   =        20
-------------+----------------------------------   F(9, 11)        =  13058.92
       Model |  1579.31439         9  175.479377   Prob > F        =    0.0000
    Residual |  .147812564        11  .013437506   R-squared       =    0.9999
-------------+----------------------------------   Adj R-squared   =    0.9998
       Total |  1579.46221        20  78.9731104   Root MSE        =    .11592

------------------------------------------------------------------------------
      lncost |  exp_coeff   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     iso_str |
        CIV  |   .2481056   .0234828   -14.73   0.000     .2014486    .3055686
        COD  |   .9362869   .1007159    -0.61   0.553     .7389004    1.186402
        MRT  |   .0111367   .0010541   -47.52   0.000     .0090424    .0137161
        NGA  |   .3325496   .0314753   -11.63   0.000     .2700128    .4095705
        TGO  |     .10938   .0103527   -23.38   0.000     .0888108    .1347132
        TZA  |          1  (base)
        ZAF  |   .0179177   .0016959   -42.49   0.000     .0145482    .0220676
             |
     var_str |
      x1105  |   48825.09   3722.334   141.61   0.000     41282.77    57745.38
      x1106  |    40570.5    3093.02   139.18   0.000     34303.32    47982.68
      x1107  |   47582.67   3677.197   139.37   0.000     40140.11    56405.19
------------------------------------------------------------------------------

Regression 4 (and its output):

Code:

. reg lncost ib6.iso_str shr ibn.var_str, noconstant eform(exp_coeff) baselevels
note: 3.var_str omitted because of collinearity

      Source |       SS           df       MS      Number of obs   =        20
-------------+----------------------------------   F(9, 11)        =  13058.92
       Model |  1579.31439         9  175.479377   Prob > F        =    0.0000
    Residual |  .147812564        11  .013437506   R-squared       =    0.9999
-------------+----------------------------------   Adj R-squared   =    0.9998
       Total |  1579.46221        20  78.9731104   Root MSE        =    .11592

------------------------------------------------------------------------------
      lncost |  exp_coeff   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     iso_str |
        CIV  |   3.994528   .3352683    16.50   0.000     3.320756    4.805006
        COD  |   5.344032   .5481234    16.34   0.000       4.2641     6.69747
        MRT  |   .2679887   .0221535   -15.93   0.000     .2234079    .3214657
        NGA  |   4.702209   .3966688    18.35   0.000     3.905406    5.661581
        TGO  |   .4031782   .0359731   -10.18   0.000     .3312915    .4906636
        TZA  |          1  (base)
        ZAF  |    3.50543   .2715976    16.19   0.000      2.95584    4.157208
             |
         shr |    1.17454   .0013558   139.37   0.000      1.17156    1.177528
             |
     var_str |
      x1105  |   1.026111   .0674366     0.39   0.702     .8879193     1.18581
      x1106  |   .8526318   .0560355    -2.43   0.034     .7378036    .9853312
      x1107  |          1  (omitted)
------------------------------------------------------------------------------

Last edited by Ernestina delPiero; 30 May 2019, 19:10.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#5

30 May 2019, 20:42

As a side note, I must admit that I am somewhat uneasy removing the intercept

As well you should be. The removal of the constant term is appropriate only if there is either a strong theoretical explanation for why the constant term must be zero, or, less convincingly, if the constant terms estimated in regressions 1 and 2 are extremely close to zero As you don't show those outputs, and as I don't even know what these variables are about (and probably they are about some domain in which I have little knowledge any way) I can't advise you more specifically than that.

However, if I'm only interested in the coefficients on 'iso_str', can I still compare the the 'iso_str' coefficients in reg 3 with those from 4, or am I missing something?

No they cannot be compared. There is still colinearity among the iso_str effects and the variable shr. It is just that with the constant term removed, that colinearity can be broken with just the omission of one level of iso_str--so, if you ignore the absence of the constant terms, everything "looks OK." But everything is not OK. If you were to choose a different base level for iso_str, all of these coefficients would change, perhaps drastically, and the difference between a regression 3 and a regression 4 using that different base level for iso_str would also look different, perhaps drastically. (Try it--you'll see. All of these coefficients are just artifacts of the choice of base level.) Not only is it not possible to compare the iso_str coefficients from regression to 3 to regression 4, it isn't even possible to meaningfully interpret any of those coefficients in either regression.
Comment
Ernestina delPiero

Join Date: May 2016

Posts: 62
#6

31 May 2019, 14:33

Thanks, Clyde. But I am a little confused when you say "But everything is not OK. If you were to choose a different base level for iso_str, all of these coefficients would change, perhaps drastically, and the difference between a regression 3 and a regression 4 using that different base level for iso_str would also look different, perhaps drastically.".

Wouldn't this always apply, in the sense that choosing a different base level of iso_str will always lead to different coefficients, regardless of whether I include 'shr'. Or is your statement only applicable to cases to the case when I include 'shr' in the regression (as it happens to have colinearity with iso_str)?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#7

31 May 2019, 19:02

Yes, coefficients of the indicators always change when you choose a different base level--you are correct. I think I didn't make my point clearly, though. In this case, because shr is colinear with those indicators, adding shr to the model is just like changing the base level.
Comment
Ernestina delPiero

Join Date: May 2016

Posts: 62
#8

01 Jun 2019, 21:35

Originally posted by Clyde Schechter View Post

Yes, coefficients of the indicators always change when you choose a different base level--you are correct. I think I didn't make my point clearly, though. In this case, because shr is colinear with those indicators, adding shr to the model is just like changing the base level.

Thanks for clarifying. But just to be clear. If I understand correctly, you are saying that even though I dropped the same iso_str variable ("tza") in both regressions 3 and 4, the iso_str coefficients in these regression cannot be compared. Why? Because, if I understand correctly, the inclusion of shr in regression 4 has an impact on the iso_str coefficients, similar to that of dropping an iso_str variable (even though I a only specified the tza iso_shr variable to be dropped).

Is this it? If so, it is not clear to me why this is the case. Even if one changes the base of the var_stri the coefficients on iso_str would remain the same. Moreover, I thought that colinearity between variables would inflates the variance of the OLS estimates rather than change their estimated value.

Last edited by Ernestina delPiero; 01 Jun 2019, 21:56.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#9

02 Jun 2019, 10:02

I think this is hard to explain but easy to see. Re-run regressions 3 and 4, but instead of using ib6.iso_str, choose a different base value for iso_str (but use the same base value in both). You will see that that everything involving iso_str and shr changes, but they do so differently from when you used ib6. Whatever result you see, it is an artifact of the choice of base value for iso_str.

(Also, as pointed out before, except under very strong conditions, inferences made from the -noconstant- model are not valid in any case.)
Comment
Ernestina delPiero

Join Date: May 2016

Posts: 62
#10

05 Jun 2019, 11:51

Thanks again, Clyde. I really want to control for the effect of 'shr' in my regression (it's variable measuring the degree of urban development in each country), and I am still figuring out how to do so.

Do you think I could break up this continuous variable into bins? E.g. 5 bins: 0-20pc, 21-40pc, 41-60pc, 61-80pc, 81-100pc. The regression will run this way, but the downside of binning a continuous variable this way isn't entirely clear to me.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#11

05 Jun 2019, 12:56

Even when it is feasible, breaking up a continuous variable into bins is usually a bad idea. It discards information and introduces noise. I challenge your assertion that "the regression will run this way." Since shr itself is constant within iso_str, so will the binned values be constant within iso_str and you will have the same colinearity problem as before. If you have coded something that appears to accomplish this, I would like to see it--it probably includes some trick such as eliminating the constant term, which, as previously noted, is illegitimate, or perhaps you are not examining your outputs carefully enough to see that the problem has cropped up elsewhere with other things be omitted to break colinearity.

I appreciate your strong desire to adjust for the effect of shr in your regression. But all the desire in the world cannot overcome the realities of linear algebra. It is impossible to have shr and iso_str both present in the model. If you prefer to see the effects of shr in your analysis, then you can do that, but you have to omit iso_str. If inclusion of iso_str is crucial, then effects of shr simply cannot be estimated.
Comment
Ernestina delPiero

Join Date: May 2016

Posts: 62
#12

07 Jun 2019, 16:03

Thanks, like always Clyde. Can you explain why breaking a continuos variable discards information and introduces noise? I recall you once posted a very detailed and clear post on this, but I can't find it. I believe you were warning about the even greater danger of dichotimizing a value, rather than binning it.

That said, I tested, for the sake of it, to divide my shr values into two variables (so I dichotimized it): above 40==1 and below or equal to 40.00==2. Then I interacted this binary shr variable with each iso_str and I must say results -at least initially - are more or less what's expected. But... did I just do something illegitimate and nonsensical?

Any help on both points I raise would be extremely valuable. I thank you once again for your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#13

07 Jun 2019, 16:19

Binning with a large number of bins is not very problematic. But using a small number of bins (with 2 being the worst case) does discard information and introduces noise. Here's why. Let's say, for the sake of discussion, that variable X ranges from 0 to 100. If you bin it into, say, 0-33, 34-67, and 68-100, then you are saying that an entity with X = 68 is, for the purposes of the analysis, indistinguishable from an entity with X = 100, but is radically different from an entity with X = 67. So unless there really is something about the real-world nature of X that makes X's relationships with other things discontinuous at X = 33 or 67, and the relationships are also flat between 0 and 33, between 34 and 67, and between 68-100, then the binning leads to a gross mis-specification. All of the information carried by values inside the bins is discarded, and the jumps introduced at the bin boundaries are just noise.

Now, evidently if you take 0-100 range variable and break it into 20 bins, then these same phenomena are less severe and the resulting variable, though still, in principle, a mis-specification, may well be useful.

That said, I tested, for the sake of it, to divide my shr values into two variables (so I dichotimized it): above 40==1 and below or equal to 40.00==2. Then I interacted this binary shr variable with each iso_str and I must say results -at least initially - are more or less what's expected. But... did I just do something illegitimate and nonsensical?

Well, the dichotomization is probably a bad idea, for reasons just explained. But the worst part is judging it by observing that it produced the results you expected. Never choose a model specification on the basis of whether it produces the results you expected! That's not science: that's confirmation bias.
Comment

Announcement

Collinearity error when including continuous variable in dummy regression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment