Hello, Statalist.
When I ran regress in Stata, it excludes subsequent variables after the third variable because of collinearity, as shown below(AQI represents characteristics of Big 4 audit firms, with a total of 18 variables, each having only 4 discrete values within a year; DA_abs is derived from financial figures of each company):
I suspect collinearity issues may arise from high correlations among the AQI variables. Below are the results of Pearson correlation coefficient analysis:
Please note the significant positive correlation of 0.998 between SRTH, representing manager training hours, and TWI_QCS, representing the number of quality control deficiencies. However, these two variables seem unrelated on a literal basis. Why is this?
Furthermore, many variables exhibit correlation coefficients exceeding 0.7, despite most of them not having strong apparent correlations on a literal basis.
I am truly puzzled by this.
Here is my dataset:
The total sample size is 1,528, with the composition of Big 4 audit firms in the sample as follows: D: 37%, P: 27%, K: 24%, E: 11%. Only the 2 mentioned AQIs are presented here.
In short, I suspect that my issue may arise from high correlations among the explanatory variables rather than being a technical problem with Stata.
However, I'm unsure how to confirm and explain this.
When I ran regress in Stata, it excludes subsequent variables after the third variable because of collinearity, as shown below(AQI represents characteristics of Big 4 audit firms, with a total of 18 variables, each having only 4 discrete values within a year; DA_abs is derived from financial figures of each company):
Code:
. global AQI "CPAEXP QCEXP SREXP CPATH SRTH TR SUP_R IH CLT WH EQCR QC QCS TWI_QC TWI USI PUN DEF" . global CV "CFO GROWTH LEV LOSS SIZE PREACC AGE" //control variables . reg DA_abs $AQI $CV i.Industry note: CPATH omitted because of collinearity. note: SRTH omitted because of collinearity. note: TR omitted because of collinearity. note: SUP_R omitted because of collinearity. note: IH omitted because of collinearity. note: CLT omitted because of collinearity. note: WH omitted because of collinearity. note: EQCR omitted because of collinearity. note: QC omitted because of collinearity. note: QCS omitted because of collinearity. note: TWI_QC omitted because of collinearity. note: TWI omitted because of collinearity. note: USI omitted because of collinearity. note: PUN omitted because of collinearity. note: DEF omitted because of collinearity. Source | SS df MS Number of obs = 1,528 -------------+---------------------------------- F(26, 1501) = 5.15 Model | .340563159 26 .013098583 Prob > F = 0.0000 Residual | 3.82082189 1,501 .002545518 R-squared = 0.0818 -------------+---------------------------------- Adj R-squared = 0.0659 Total | 4.16138505 1,527 .002725203 Root MSE = .05045 ------------------------------------------------------------------------------ DA_abs | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- CPAEXP | .0014317 .0043636 0.33 0.743 -.0071277 .0099912 QCEXP | -.0007106 .0016029 -0.44 0.658 -.0038547 .0024336 SREXP | .000113 .0018543 0.06 0.951 -.0035242 .0037502 CPATH | 0 (omitted) SRTH | 0 (omitted) TR | 0 (omitted) SUP_R | 0 (omitted) IH | 0 (omitted) CLT | 0 (omitted) WH | 0 (omitted) EQCR | 0 (omitted) QC | 0 (omitted) QCS | 0 (omitted) TWI_QC | 0 (omitted) TWI | 0 (omitted) USI | 0 (omitted) PUN | 0 (omitted) DEF | 0 (omitted) CFO | -.0538528 .0132678 -4.06 0.000 -.0798782 -.0278274 GROWTH | .0131254 .0034793 3.77 0.000 .0063006 .0199502 LEV | .021764 .0081384 2.67 0.008 .0058001 .0377279 LOSS | .0034688 .0039552 0.88 0.381 -.0042895 .0112271 SIZE | -.00372 .0010153 -3.66 0.000 -.0057115 -.0017284 PREACC | .0340575 .0138054 2.47 0.014 .0069775 .0611375 AGE | -.006149 .0028621 -2.15 0.032 -.0117631 -.0005349 | Industry | 12 | .0183602 .0182958 1.00 0.316 -.0175279 .0542482 13 | .0064839 .0176731 0.37 0.714 -.0281826 .0411505 14 | .0056012 .0169048 0.33 0.740 -.0275583 .0387608 15 | .0118865 .0159979 0.74 0.458 -.0194941 .043267 16 | -.003379 .0235514 -0.14 0.886 -.0495761 .0428181 17 | .00394 .015913 0.25 0.804 -.0272741 .0351542 18 | -.0112597 .0257108 -0.44 0.661 -.0616927 .0391733 20 | .0236537 .0169087 1.40 0.162 -.0095134 .0568209 21 | -.0028518 .0227249 -0.13 0.900 -.0474277 .0417241 22 | .0288204 .0200851 1.43 0.152 -.0105774 .0682182 23 | .0226668 .0154395 1.47 0.142 -.0076185 .0529521 25 | .0302591 .0165336 1.83 0.067 -.0021724 .0626906 26 | .011962 .0180431 0.66 0.507 -.0234304 .0473544 27 | .0224664 .0177375 1.27 0.205 -.0123265 .0572593 29 | .0124128 .0185279 0.67 0.503 -.0239305 .048756 99 | .0175747 .0163523 1.07 0.283 -.014501 .0496505 | _cons | .1004619 .0570138 1.76 0.078 -.0113733 .2122971 ------------------------------------------------------------------------------
Code:
pwcorr $AQI
| CPAEXP QCEXP SREXP CPATH SRTH TR SUP_R
-------------+---------------------------------------------------------------
CPAEXP | 1.0000
QCEXP | 0.8452 1.0000
SREXP | -0.6573 -0.5525 1.0000
CPATH | -0.3554 -0.7896 0.0858 1.0000
SRTH | -0.5103 -0.8831 0.4499 0.9252 1.0000
TR | 0.3813 0.5086 -0.8939 -0.2927 -0.6248 1.0000
SUP_R | -0.1781 -0.3338 -0.5800 0.5518 0.2348 0.6022 1.0000
IH | 0.8675 0.9494 -0.3498 -0.7334 -0.7388 0.2240 -0.5602
CLT | 0.2120 0.5490 0.3847 -0.8393 -0.5838 -0.2689 -0.9116
WH | -0.4382 -0.7723 0.6549 0.7509 0.9372 -0.8509 -0.1182
EQCR | 0.3514 -0.1848 -0.0461 0.6527 0.6228 -0.3963 0.0142
QC | -0.2309 -0.5983 0.6107 0.6562 0.8527 -0.8772 -0.2670
QCS | -0.2155 -0.6922 0.2920 0.9111 0.9494 -0.5919 0.1751
TWI_QC | -0.4956 -0.8761 0.4338 0.9309 0.9998 -0.6155 0.2436
TWI | 0.1047 -0.0396 -0.7975 0.3648 -0.0036 0.7828 0.9538
USI | 0.8647 0.6194 -0.9139 -0.0300 -0.3373 0.6482 0.3390
PUN | -0.5925 -0.3956 -0.1981 0.1781 0.0122 0.5115 0.7583
DEF | -0.0516 -0.5709 -0.0882 0.9507 0.8358 -0.2312 0.4903
| IH CLT WH EQCR QC QCS TWI_QC
-------------+---------------------------------------------------------------
IH | 1.0000
CLT | 0.6636 1.0000
WH | -0.5418 -0.2733 1.0000
EQCR | 0.0131 -0.3753 0.6481 1.0000
QC | -0.3276 -0.1466 0.9710 0.7539 1.0000
QCS | -0.5132 -0.5633 0.9139 0.8369 0.8967 1.0000
TWI_QC | -0.7327 -0.5936 0.9343 0.6350 0.8518 0.9543 1.0000
TWI | -0.2855 -0.8098 -0.3410 0.0095 -0.4366 0.0066 0.0086
USI | 0.5335 -0.2763 -0.4472 0.3767 -0.3232 -0.0818 -0.3186
PUN | -0.6604 -0.5285 -0.2731 -0.5827 -0.4918 -0.2217 0.0082
DEF | -0.4887 -0.8003 0.6883 0.8324 0.6610 0.9207 0.8465
| TWI USI PUN DEF
-------------+------------------------------------
TWI | 1.0000
USI | 0.5815 1.0000
PUN | 0.6447 -0.1965 1.0000
DEF | 0.3803 0.2318 -0.0428 1.0000
Furthermore, many variables exhibit correlation coefficients exceeding 0.7, despite most of them not having strong apparent correlations on a literal basis.
I am truly puzzled by this.
Here is my dataset:
Code:
coid year audit firm DA_abs SRTH TWI_QC 2535 2022 D 0.063792 101.7 4 4551 2022 D 0.003987 101.7 4 8932 2022 D 0.040716 101.7 4 6689 2022 K 0.018856 12.5 1 3138 2022 K 0.026391 12.5 1 6284 2022 K 0.01487 12.5 1 1736 2022 E 0.005425 10.5 2 1445 2022 E 0.010722 10.5 2 3416 2022 E 0.077717 10.5 2 1215 2022 P 0.025043 11.6 0 2597 2022 P 0.023182 11.6 0 4966 2022 P 0.05779 11.6 0
In short, I suspect that my issue may arise from high correlations among the explanatory variables rather than being a technical problem with Stata.
However, I'm unsure how to confirm and explain this.
Comment