Multicollinearity or Omitted bias is to blame?

Cuong Hoang

Join Date: Jan 2018
Posts: 13

Multicollinearity or Omitted bias is to blame?

13 Sep 2021, 12:22

Dear Statalist's experts,

I am struggling to understand some of my regression results, I hope you can help me to make sense out of this.

To be specific, I regress a production function with y as firm performance, x1 x2 x3 x4 as production factors, and city1 (say, log of local labor force) and city2 (says, share of local workers attended a labor-training course) as city factors which are my variables of interest.

The problem is: the p-value and coefficient of city2 changes with the inclusion of city1, and the method I use.
I suspect the reason behind this is multicollinearity and/ or omitted bias. Since the two are my variables of interest, I want to clarify which one is the true driver of the change.

Please take a look at two separate situations as following:
1. The regression method is first-difference, using a two-year panel data set.

Notice that, I add -d- before each variable to show that it is a first-differenced variable.

HTML Code:

. quietly: eststo A: reghdfe dy dx1 dx2 dx3 dx4 dCity1 if sample==1, absorb(industry_id) vce(cluster city_id)
. quietly: eststo B: reghdfe dy dx1 dx2 dx3 dx4 dCity2 if sample==1, absorb(industry_id) vce(cluster city_id)
. quietly: eststo C: reghdfe dy dx1 dx2 dx3 dx4 dCity1 dCity2 if sample==1, absorb(industry_id) vce(cluster city_id)
--------------------------------------------------------------------
                                A               B               C   
                             b/se            b/se            b/se   
--------------------------------------------------------------------
dx1                         0.286***        0.286***        0.286***
                          (0.006)         (0.006)         (0.006)   
dx2                         0.048***        0.048***        0.048***
                          (0.013)         (0.013)         (0.013)   
dx3                         0.001           0.001           0.001   
                          (0.008)         (0.008)         (0.008)   
dx4                         0.310***        0.309***        0.308***
                          (0.042)         (0.041)         (0.041)   
dCity1                      0.097***                        0.121***
                          (0.026)                         (0.027)   
dCity2                                      0.318           0.637** 
                                          (0.281)         (0.289)   
--------------------------------------------------------------------
R-squared                   0.132           0.132           0.133   
No. of obs                  67269           67269           67269   
No. of firms                   78              78              78   
F-Test                    864.566***      900.341***      771.040***
--------------------------------------------------------------------
* p<0.1, ** p<0.05, *** p<0.01

As you can see, the coefficient of dCity2 becomes significant only if dCity1 is also concluded.
More information about the correlation, confidence intervals of variables of interest, and vif:

HTML Code:

 corr dCity1 dCity2 if sample==1
(obs=67,273)
             |   dCity1   dCity2
-------------+------------------
      dCity1 |   1.0000
      dCity2 |  -0.3322   1.0000

HTML Code:

reg dy dx1 dx2 dx3 dx4 dCity1 dCity2 i.industry_id if sample==1,  vce(cluster city_id)
Linear regression                               Number of obs     =     67,273
                                                F(61, 62)         =          .
                                                Prob > F          =          .
                                                R-squared         =     0.1326
                                                Root MSE          =     1.1117
                               (Std. Err. adjusted for 63 clusters in city_id)
------------------------------------------------------------------------------
             |               Robust
          dy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         dx1 |   .2858925   .0062215    45.95   0.000      .273456     .298329
         dx2 |    .047732   .0126253     3.78   0.000     .0224944    .0729697
         dx3 |   .0011598   .0084235     0.14   0.891    -.0156786    .0179982
         dx4 |   .3076466   .0406867     7.56   0.000      .226315    .3889782
      dCity1 |   .1207645   .0268654     4.50   0.000     .0670613    .1744678
      dCity2 |   .6366201   .2886639     2.21   0.031     .0595892    1.213651
             |
 industry_id |
          2  |   .3489494   .0854239     4.08   0.000     .1781894    .5197094
          3  |   .3463035   .1045211     3.31   0.002     .1373688    .5552383
          4  |  -.0076944   .1247283    -0.06   0.951    -.2570228    .2416339
...
        84  |   .0234158   .1504068     0.16   0.877    -.2772431    .3240746
             |
       _cons |  -.1327411   .0772556    -1.72   0.091    -.2871729    .0216907
------------------------------------------------------------------------------

. vif
    Variable |       VIF       1/VIF  
-------------+----------------------
         dx1 |      1.05    0.952265
         dx2 |      1.08    0.926017
         dx3 |      1.01    0.993091
         dx4 |      1.04    0.961174
      dCity1 |      1.14    0.879511
      dCity2 |      1.17    0.853717
 industry_id |
          2  |      2.49    0.401872
          3  |      1.24    0.805846
          4  |      1.01    0.990048
...

As you can see: the corr between the two variable is quite high (-0.33), the confidence interval of dCity2 is quite wide if dCity1 is included, and the dramatic change of p-value of dCity2 between the models B and C, these might reflect the impact of multicollinearity?
However, the SE of dCity2 is just a little "inflated" when moving from model B to C, and similarly, vif (I know, you might complain that vif is just a hype, but this might be helpful to some extents) of dCity2 is only 1.17, and City2 = number of workers attended a training course / City1, so without controlling dCity1, model B clearly suffers from omitted bias. Could I just blame the insignificance of dCity2 in model B to omitted bias? I mean, model C is better specified, and its results should be more reliable?

In another word, could I just interpret that: "the results from the model C suggest that the effects of City2 is only shown if City1 is held constant"?

2. The regression method is 2SLS, using a cross-sectional data set.

Now, if I treat city1 and city2 as endogenous variables, I instrument for city1 with exogenous z1 z2, and for city2 with z3 z4. I also display OLS results for the sake of comparisons.

Due to the suspicion of multi-collinearity between city1 and city2, an alternative of city1 is city1b (say, log of employed workers rather than labor force) is also used in separate regressions.

HTML Code:

quietly: eststo TSLS1: ivreghdfe y x1 x2 x3 x4 (city1 city2= z1 z2 z3 z4) if sample==1, absorb(city_id industry_id) cluster(city_id)

quietly: eststo TSLS2: ivreghdfe y x1 x2 x3 x4 (city1= z1 z2) if sample==1, absorb(city_id industry_id) cluster(city_id)

quietly: eststo TSLS3: ivreghdfe y x1 x2 x3 x4 (city2= z3 z4) if sample==1, absorb(city_id industry_id) cluster(city_id)

quietly: eststo TSLS4: ivreghdfe y x1 x2 x3 x4 (city1b city2= z1 z2 z3 z4) if sample==1, absorb(city_id industry_id) cluster(city_id)

quietly: eststo TSLS5: ivreghdfe y x1 x2 x3 x4 (city1b= z1 z2) if sample==1, absorb(city_id industry_id) cluster (city_id)
----------------------------------------------------------------------------------------------------
                            TSLS1           TSLS2           TSLS3           TSLS4           TSLS5   
                             b/se            b/se            b/se            b/se            b/se   
----------------------------------------------------------------------------------------------------
city1                       0.033**         0.037***                                                
                          (0.015)         (0.012)                                                   
city2                       0.225                           0.488***        0.474***                
                          (0.211)                         (0.158)         (0.166)                   
x1                          0.278***        0.277***        0.277***        0.278***        0.277***
                          (0.006)         (0.006)         (0.006)         (0.006)         (0.006)   
x2                          0.031           0.031           0.031           0.031           0.031   
                          (0.035)         (0.035)         (0.035)         (0.035)         (0.035)   
x3                          0.072***        0.076***        0.079***        0.077***        0.086***
                          (0.014)         (0.013)         (0.015)         (0.015)         (0.013)   
x4                          0.690***        0.702***        0.687***        0.690***        0.719***
                          (0.064)         (0.072)         (0.064)         (0.064)         (0.072)   
city1b                                                                      0.038*          0.049** 
                                                                          (0.020)         (0.022)   
----------------------------------------------------------------------------------------------------
R-squared                   0.134           0.134           0.133           0.133           0.132   
No. of obs                 164343          164343          164343          164343          164343   
No. of firms                  217             217             217             217             217   
F-Test                   1056.476***     1218.133***     1322.428***     1008.251***     1184.922***
----------------------------------------------------------------------------------------------------
* p<0.1, ** p<0.05, *** p<0.01

Now, if city1 and city2 stand alone in TSLS2 and TSLS3, they are significant, if together, city2 becomes insignificant in TSLS1.
Given that:
- Correlation between city1b and city2 is less strong compared to city1 and city2.

HTML Code:

 corr city1 city2 city1b if sample==1
(obs=164,352)
             |    city1    city2   city1b
-------------+---------------------------
       city1 |   1.0000
       city2 |   0.4940   1.0000
      city1b |   0.7240   0.2064   1.0000

- If city1b is used to substitute for city1, city2 is (strongly) significant, it might suggest that the problem of multi-collinearity is not severe if city1b is used.
- OLS results of city1 and city2 always positive and significant regardless of standing alone or together (although... city1b is insignificant)
- the fitted values of city1 and city2 obtained from the first-stage regression of 2SLS are even more correlated then their original versions, and more with other exogenous variables (the same problem arise for any pairs of endogenous variables in 2SLS estimation)

I come to conclusion that city2 loses its significant meaning in TSLS1 due to the problem of multi-collinearity between city1 and city2, and that, to see the more accurate estimation results of city2, it is the best to look at TSLS4. Do you think these are convincing arguments?

Thank you!

Best regards,
Cuong

Tags: None

Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

13 Sep 2021, 12:48

No, I do not think these are good arguments. You are overthinking a very simple matter.

With 63k observations I can generate you examples where two regressors are correlated at 99.99% and you still can estimate the separate coefficients on the two regressors just fine.

Correlation of -.30% is moderate association, this is not even strong association.

Your results from OLS are just fine where you include the two variables, and yes, significance changes for the reasons that you explain: to see that the second plays a role you need to control for the first.

If I were you I would stop at the OLS regression with the two included, and start thinking whether my instruments are good enough, i.e. , why the IV regression does not replicate the OLS.
1 like
Comment
Cuong Hoang

Join Date: Jan 2018

Posts: 13
#3

13 Sep 2021, 13:36

Hi Joro,

thanks (again) for answering my questions.
So basically, if I understand you right, for the case of FD estimation, you agree that the change in coef and p-value of dCity2 is normal, meaning that dCity2 in model B suffers from omitted bias?

For the case of cross-sectional data, the correlation between city1 and city2 in fact is higher: 0.494, and between their fitted value is about 0.55. I understand that 164K obs in 2SLS or OLS estimation is large enough to put the worry of multi-collinearity aside, and intended to stop at OLS and FD, but, although some instruments are not perfect (to satisfy all the questions on why they are exogenous), they all pass testable IV tests, hence I haven't figured out yet why the IV regression does not replicate OLS (and FD). In addition, the results looks good for city1b and single endogenous variables makes it even harder to understand...

Last edited by Cuong Hoang; 13 Sep 2021, 14:02.
Comment

Announcement

Multicollinearity or Omitted bias is to blame?

Comment

Comment