Dear Statalist's experts,
I am struggling to understand some of my regression results, I hope you can help me to make sense out of this.
To be specific, I regress a production function with y as firm performance, x1 x2 x3 x4 as production factors, and city1 (say, log of local labor force) and city2 (says, share of local workers attended a labor-training course) as city factors which are my variables of interest.
The problem is: the p-value and coefficient of city2 changes with the inclusion of city1, and the method I use.
I suspect the reason behind this is multicollinearity and/ or omitted bias. Since the two are my variables of interest, I want to clarify which one is the true driver of the change.
Please take a look at two separate situations as following:
1. The regression method is first-difference, using a two-year panel data set.
Notice that, I add -d- before each variable to show that it is a first-differenced variable.
As you can see, the coefficient of dCity2 becomes significant only if dCity1 is also concluded.
More information about the correlation, confidence intervals of variables of interest, and vif:
As you can see: the corr between the two variable is quite high (-0.33), the confidence interval of dCity2 is quite wide if dCity1 is included, and the dramatic change of p-value of dCity2 between the models B and C, these might reflect the impact of multicollinearity?
However, the SE of dCity2 is just a little "inflated" when moving from model B to C, and similarly, vif (I know, you might complain that vif is just a hype, but this might be helpful to some extents) of dCity2 is only 1.17, and City2 = number of workers attended a training course / City1, so without controlling dCity1, model B clearly suffers from omitted bias. Could I just blame the insignificance of dCity2 in model B to omitted bias? I mean, model C is better specified, and its results should be more reliable?
In another word, could I just interpret that: "the results from the model C suggest that the effects of City2 is only shown if City1 is held constant"?
2. The regression method is 2SLS, using a cross-sectional data set.
Now, if I treat city1 and city2 as endogenous variables, I instrument for city1 with exogenous z1 z2, and for city2 with z3 z4. I also display OLS results for the sake of comparisons.
Due to the suspicion of multi-collinearity between city1 and city2, an alternative of city1 is city1b (say, log of employed workers rather than labor force) is also used in separate regressions.
Now, if city1 and city2 stand alone in TSLS2 and TSLS3, they are significant, if together, city2 becomes insignificant in TSLS1.
Given that:
- Correlation between city1b and city2 is less strong compared to city1 and city2.
- If city1b is used to substitute for city1, city2 is (strongly) significant, it might suggest that the problem of multi-collinearity is not severe if city1b is used.
- OLS results of city1 and city2 always positive and significant regardless of standing alone or together (although... city1b is insignificant)
- the fitted values of city1 and city2 obtained from the first-stage regression of 2SLS are even more correlated then their original versions, and more with other exogenous variables (the same problem arise for any pairs of endogenous variables in 2SLS estimation)
I come to conclusion that city2 loses its significant meaning in TSLS1 due to the problem of multi-collinearity between city1 and city2, and that, to see the more accurate estimation results of city2, it is the best to look at TSLS4. Do you think these are convincing arguments?
Thank you!
Best regards,
Cuong
I am struggling to understand some of my regression results, I hope you can help me to make sense out of this.
To be specific, I regress a production function with y as firm performance, x1 x2 x3 x4 as production factors, and city1 (say, log of local labor force) and city2 (says, share of local workers attended a labor-training course) as city factors which are my variables of interest.
The problem is: the p-value and coefficient of city2 changes with the inclusion of city1, and the method I use.
I suspect the reason behind this is multicollinearity and/ or omitted bias. Since the two are my variables of interest, I want to clarify which one is the true driver of the change.
Please take a look at two separate situations as following:
1. The regression method is first-difference, using a two-year panel data set.
Notice that, I add -d- before each variable to show that it is a first-differenced variable.
HTML Code:
. quietly: eststo A: reghdfe dy dx1 dx2 dx3 dx4 dCity1 if sample==1, absorb(industry_id) vce(cluster city_id) . quietly: eststo B: reghdfe dy dx1 dx2 dx3 dx4 dCity2 if sample==1, absorb(industry_id) vce(cluster city_id) . quietly: eststo C: reghdfe dy dx1 dx2 dx3 dx4 dCity1 dCity2 if sample==1, absorb(industry_id) vce(cluster city_id) -------------------------------------------------------------------- A B C b/se b/se b/se -------------------------------------------------------------------- dx1 0.286*** 0.286*** 0.286*** (0.006) (0.006) (0.006) dx2 0.048*** 0.048*** 0.048*** (0.013) (0.013) (0.013) dx3 0.001 0.001 0.001 (0.008) (0.008) (0.008) dx4 0.310*** 0.309*** 0.308*** (0.042) (0.041) (0.041) dCity1 0.097*** 0.121*** (0.026) (0.027) dCity2 0.318 0.637** (0.281) (0.289) -------------------------------------------------------------------- R-squared 0.132 0.132 0.133 No. of obs 67269 67269 67269 No. of firms 78 78 78 F-Test 864.566*** 900.341*** 771.040*** -------------------------------------------------------------------- * p<0.1, ** p<0.05, *** p<0.01
More information about the correlation, confidence intervals of variables of interest, and vif:
HTML Code:
corr dCity1 dCity2 if sample==1 (obs=67,273) | dCity1 dCity2 -------------+------------------ dCity1 | 1.0000 dCity2 | -0.3322 1.0000
HTML Code:
reg dy dx1 dx2 dx3 dx4 dCity1 dCity2 i.industry_id if sample==1, vce(cluster city_id) Linear regression Number of obs = 67,273 F(61, 62) = . Prob > F = . R-squared = 0.1326 Root MSE = 1.1117 (Std. Err. adjusted for 63 clusters in city_id) ------------------------------------------------------------------------------ | Robust dy | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dx1 | .2858925 .0062215 45.95 0.000 .273456 .298329 dx2 | .047732 .0126253 3.78 0.000 .0224944 .0729697 dx3 | .0011598 .0084235 0.14 0.891 -.0156786 .0179982 dx4 | .3076466 .0406867 7.56 0.000 .226315 .3889782 dCity1 | .1207645 .0268654 4.50 0.000 .0670613 .1744678 dCity2 | .6366201 .2886639 2.21 0.031 .0595892 1.213651 | industry_id | 2 | .3489494 .0854239 4.08 0.000 .1781894 .5197094 3 | .3463035 .1045211 3.31 0.002 .1373688 .5552383 4 | -.0076944 .1247283 -0.06 0.951 -.2570228 .2416339 ... 84 | .0234158 .1504068 0.16 0.877 -.2772431 .3240746 | _cons | -.1327411 .0772556 -1.72 0.091 -.2871729 .0216907 ------------------------------------------------------------------------------ . vif Variable | VIF 1/VIF -------------+---------------------- dx1 | 1.05 0.952265 dx2 | 1.08 0.926017 dx3 | 1.01 0.993091 dx4 | 1.04 0.961174 dCity1 | 1.14 0.879511 dCity2 | 1.17 0.853717 industry_id | 2 | 2.49 0.401872 3 | 1.24 0.805846 4 | 1.01 0.990048 ...
However, the SE of dCity2 is just a little "inflated" when moving from model B to C, and similarly, vif (I know, you might complain that vif is just a hype, but this might be helpful to some extents) of dCity2 is only 1.17, and City2 = number of workers attended a training course / City1, so without controlling dCity1, model B clearly suffers from omitted bias. Could I just blame the insignificance of dCity2 in model B to omitted bias? I mean, model C is better specified, and its results should be more reliable?
In another word, could I just interpret that: "the results from the model C suggest that the effects of City2 is only shown if City1 is held constant"?
2. The regression method is 2SLS, using a cross-sectional data set.
Now, if I treat city1 and city2 as endogenous variables, I instrument for city1 with exogenous z1 z2, and for city2 with z3 z4. I also display OLS results for the sake of comparisons.
Due to the suspicion of multi-collinearity between city1 and city2, an alternative of city1 is city1b (say, log of employed workers rather than labor force) is also used in separate regressions.
HTML Code:
quietly: eststo TSLS1: ivreghdfe y x1 x2 x3 x4 (city1 city2= z1 z2 z3 z4) if sample==1, absorb(city_id industry_id) cluster(city_id) quietly: eststo TSLS2: ivreghdfe y x1 x2 x3 x4 (city1= z1 z2) if sample==1, absorb(city_id industry_id) cluster(city_id) quietly: eststo TSLS3: ivreghdfe y x1 x2 x3 x4 (city2= z3 z4) if sample==1, absorb(city_id industry_id) cluster(city_id) quietly: eststo TSLS4: ivreghdfe y x1 x2 x3 x4 (city1b city2= z1 z2 z3 z4) if sample==1, absorb(city_id industry_id) cluster(city_id) quietly: eststo TSLS5: ivreghdfe y x1 x2 x3 x4 (city1b= z1 z2) if sample==1, absorb(city_id industry_id) cluster (city_id) ---------------------------------------------------------------------------------------------------- TSLS1 TSLS2 TSLS3 TSLS4 TSLS5 b/se b/se b/se b/se b/se ---------------------------------------------------------------------------------------------------- city1 0.033** 0.037*** (0.015) (0.012) city2 0.225 0.488*** 0.474*** (0.211) (0.158) (0.166) x1 0.278*** 0.277*** 0.277*** 0.278*** 0.277*** (0.006) (0.006) (0.006) (0.006) (0.006) x2 0.031 0.031 0.031 0.031 0.031 (0.035) (0.035) (0.035) (0.035) (0.035) x3 0.072*** 0.076*** 0.079*** 0.077*** 0.086*** (0.014) (0.013) (0.015) (0.015) (0.013) x4 0.690*** 0.702*** 0.687*** 0.690*** 0.719*** (0.064) (0.072) (0.064) (0.064) (0.072) city1b 0.038* 0.049** (0.020) (0.022) ---------------------------------------------------------------------------------------------------- R-squared 0.134 0.134 0.133 0.133 0.132 No. of obs 164343 164343 164343 164343 164343 No. of firms 217 217 217 217 217 F-Test 1056.476*** 1218.133*** 1322.428*** 1008.251*** 1184.922*** ---------------------------------------------------------------------------------------------------- * p<0.1, ** p<0.05, *** p<0.01
Given that:
- Correlation between city1b and city2 is less strong compared to city1 and city2.
HTML Code:
corr city1 city2 city1b if sample==1 (obs=164,352) | city1 city2 city1b -------------+--------------------------- city1 | 1.0000 city2 | 0.4940 1.0000 city1b | 0.7240 0.2064 1.0000
- OLS results of city1 and city2 always positive and significant regardless of standing alone or together (although... city1b is insignificant)
- the fitted values of city1 and city2 obtained from the first-stage regression of 2SLS are even more correlated then their original versions, and more with other exogenous variables (the same problem arise for any pairs of endogenous variables in 2SLS estimation)
I come to conclusion that city2 loses its significant meaning in TSLS1 due to the problem of multi-collinearity between city1 and city2, and that, to see the more accurate estimation results of city2, it is the best to look at TSLS4. Do you think these are convincing arguments?
Thank you!
Best regards,
Cuong
Comment