Correlation and collinearity

Winston Wu

Join Date: May 2024
Posts: 5

Correlation and collinearity

17 May 2024, 03:59

Hello, Statalist.
When I ran regress in Stata, it excludes subsequent variables after the third variable because of collinearity, as shown below(AQI represents characteristics of Big 4 audit firms, with a total of 18 variables, each having only 4 discrete values within a year; DA_abs is derived from financial figures of each company):

Code:

. global AQI "CPAEXP QCEXP SREXP CPATH SRTH TR SUP_R IH CLT WH EQCR QC QCS TWI_QC TWI USI PUN DEF"

. global CV "CFO GROWTH LEV LOSS SIZE PREACC AGE"  //control variables

. reg DA_abs $AQI $CV i.Industry
note: CPATH omitted because of collinearity.
note: SRTH omitted because of collinearity.
note: TR omitted because of collinearity.
note: SUP_R omitted because of collinearity.
note: IH omitted because of collinearity.
note: CLT omitted because of collinearity.
note: WH omitted because of collinearity.
note: EQCR omitted because of collinearity.
note: QC omitted because of collinearity.
note: QCS omitted because of collinearity.
note: TWI_QC omitted because of collinearity.
note: TWI omitted because of collinearity.
note: USI omitted because of collinearity.
note: PUN omitted because of collinearity.
note: DEF omitted because of collinearity.

      Source |       SS           df       MS      Number of obs   =     1,528
-------------+----------------------------------   F(26, 1501)     =      5.15
       Model |  .340563159        26  .013098583   Prob &gt; F        =    0.0000
    Residual |  3.82082189     1,501  .002545518   R-squared       =    0.0818
-------------+----------------------------------   Adj R-squared   =    0.0659
       Total |  4.16138505     1,527  .002725203   Root MSE        =    .05045

------------------------------------------------------------------------------
      DA_abs | Coefficient  Std. err.      t    P&gt;|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      CPAEXP |   .0014317   .0043636     0.33   0.743    -.0071277    .0099912
       QCEXP |  -.0007106   .0016029    -0.44   0.658    -.0038547    .0024336
       SREXP |    .000113   .0018543     0.06   0.951    -.0035242    .0037502
       CPATH |          0  (omitted)
        SRTH |          0  (omitted)
          TR |          0  (omitted)
       SUP_R |          0  (omitted)
          IH |          0  (omitted)
         CLT |          0  (omitted)
          WH |          0  (omitted)
        EQCR |          0  (omitted)
          QC |          0  (omitted)
         QCS |          0  (omitted)
      TWI_QC |          0  (omitted)
         TWI |          0  (omitted)
         USI |          0  (omitted)
         PUN |          0  (omitted)
         DEF |          0  (omitted)
         CFO |  -.0538528   .0132678    -4.06   0.000    -.0798782   -.0278274
      GROWTH |   .0131254   .0034793     3.77   0.000     .0063006    .0199502
         LEV |    .021764   .0081384     2.67   0.008     .0058001    .0377279
        LOSS |   .0034688   .0039552     0.88   0.381    -.0042895    .0112271
        SIZE |    -.00372   .0010153    -3.66   0.000    -.0057115   -.0017284
      PREACC |   .0340575   .0138054     2.47   0.014     .0069775    .0611375
         AGE |   -.006149   .0028621    -2.15   0.032    -.0117631   -.0005349
             |
    Industry |
         12  |   .0183602   .0182958     1.00   0.316    -.0175279    .0542482
         13  |   .0064839   .0176731     0.37   0.714    -.0281826    .0411505
         14  |   .0056012   .0169048     0.33   0.740    -.0275583    .0387608
         15  |   .0118865   .0159979     0.74   0.458    -.0194941     .043267
         16  |   -.003379   .0235514    -0.14   0.886    -.0495761    .0428181
         17  |     .00394    .015913     0.25   0.804    -.0272741    .0351542
         18  |  -.0112597   .0257108    -0.44   0.661    -.0616927    .0391733
         20  |   .0236537   .0169087     1.40   0.162    -.0095134    .0568209
         21  |  -.0028518   .0227249    -0.13   0.900    -.0474277    .0417241
         22  |   .0288204   .0200851     1.43   0.152    -.0105774    .0682182
         23  |   .0226668   .0154395     1.47   0.142    -.0076185    .0529521
         25  |   .0302591   .0165336     1.83   0.067    -.0021724    .0626906
         26  |    .011962   .0180431     0.66   0.507    -.0234304    .0473544
         27  |   .0224664   .0177375     1.27   0.205    -.0123265    .0572593
         29  |   .0124128   .0185279     0.67   0.503    -.0239305     .048756
         99  |   .0175747   .0163523     1.07   0.283     -.014501    .0496505
             |
       _cons |   .1004619   .0570138     1.76   0.078    -.0113733    .2122971
------------------------------------------------------------------------------

I suspect collinearity issues may arise from high correlations among the AQI variables. Below are the results of Pearson correlation coefficient analysis:

Code:

 pwcorr $AQI

             |   CPAEXP    QCEXP    SREXP    CPATH     SRTH       TR    SUP_R
-------------+---------------------------------------------------------------
      CPAEXP |   1.0000
       QCEXP |   0.8452   1.0000
       SREXP |  -0.6573  -0.5525   1.0000
       CPATH |  -0.3554  -0.7896   0.0858   1.0000
        SRTH |  -0.5103  -0.8831   0.4499   0.9252   1.0000
          TR |   0.3813   0.5086  -0.8939  -0.2927  -0.6248   1.0000
       SUP_R |  -0.1781  -0.3338  -0.5800   0.5518   0.2348   0.6022   1.0000
          IH |   0.8675   0.9494  -0.3498  -0.7334  -0.7388   0.2240  -0.5602
         CLT |   0.2120   0.5490   0.3847  -0.8393  -0.5838  -0.2689  -0.9116
          WH |  -0.4382  -0.7723   0.6549   0.7509   0.9372  -0.8509  -0.1182
        EQCR |   0.3514  -0.1848  -0.0461   0.6527   0.6228  -0.3963   0.0142
          QC |  -0.2309  -0.5983   0.6107   0.6562   0.8527  -0.8772  -0.2670
         QCS |  -0.2155  -0.6922   0.2920   0.9111   0.9494  -0.5919   0.1751
      TWI_QC |  -0.4956  -0.8761   0.4338   0.9309   0.9998  -0.6155   0.2436
         TWI |   0.1047  -0.0396  -0.7975   0.3648  -0.0036   0.7828   0.9538
         USI |   0.8647   0.6194  -0.9139  -0.0300  -0.3373   0.6482   0.3390
         PUN |  -0.5925  -0.3956  -0.1981   0.1781   0.0122   0.5115   0.7583
         DEF |  -0.0516  -0.5709  -0.0882   0.9507   0.8358  -0.2312   0.4903

             |       IH      CLT       WH     EQCR       QC      QCS   TWI_QC
-------------+---------------------------------------------------------------
          IH |   1.0000
         CLT |   0.6636   1.0000
          WH |  -0.5418  -0.2733   1.0000
        EQCR |   0.0131  -0.3753   0.6481   1.0000
          QC |  -0.3276  -0.1466   0.9710   0.7539   1.0000
         QCS |  -0.5132  -0.5633   0.9139   0.8369   0.8967   1.0000
      TWI_QC |  -0.7327  -0.5936   0.9343   0.6350   0.8518   0.9543   1.0000
         TWI |  -0.2855  -0.8098  -0.3410   0.0095  -0.4366   0.0066   0.0086
         USI |   0.5335  -0.2763  -0.4472   0.3767  -0.3232  -0.0818  -0.3186
         PUN |  -0.6604  -0.5285  -0.2731  -0.5827  -0.4918  -0.2217   0.0082
         DEF |  -0.4887  -0.8003   0.6883   0.8324   0.6610   0.9207   0.8465

             |      TWI      USI      PUN      DEF
-------------+------------------------------------
         TWI |   1.0000
         USI |   0.5815   1.0000
         PUN |   0.6447  -0.1965   1.0000
         DEF |   0.3803   0.2318  -0.0428   1.0000

Please note the significant positive correlation of 0.998 between SRTH, representing manager training hours, and TWI_QCS, representing the number of quality control deficiencies. However, these two variables seem unrelated on a literal basis. Why is this?
Furthermore, many variables exhibit correlation coefficients exceeding 0.7, despite most of them not having strong apparent correlations on a literal basis.
I am truly puzzled by this.

Here is my dataset:

Code:

The total sample size is 1,528, with the composition of Big 4 audit firms in the sample as follows: D: 37%, P: 27%, K: 24%, E: 11%. Only the 2 mentioned AQIs are presented here.

In short, I suspect that my issue may arise from high correlations among the explanatory variables rather than being a technical problem with Stata.
However, I'm unsure how to confirm and explain this.

Last edited by Winston Wu; 17 May 2024, 04:01.

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35696
#2

17 May 2024, 04:20

I am not sure where to pitch this explanation but nothing here indicates to me a technical problem with Stata.

You're being given a message about your data and the model you're building.

Collinearity can mean more than just that predictor variables are strongly correlated pairwise. It means here that a variable is dropped from the model because it is redundant given what else is on offer.

What does "on a literal basis" mean here? Sometimes, for example, high correlations are artefacts of massive outliers. As in your introductory course, look at a scatter plot (matrix) to see what is going on. Here it seems more likely that you have many duplicate observations as far as essentially categorical variables are concerned, especially given your use of industry as a factor variable.
1 like
Comment
Winston Wu

Join Date: May 2024

Posts: 5
#3

17 May 2024, 06:37

Thanks for your explanation, Nick.

"on a literal basis" refers to the definition of variables, suggesting that these variables seems not have such strong correlations on the surface.

After examining the scatter plots of the independent variables, I noticed that each plot only had four points, as expected.
However, not all variables with high correlations exhibit a linear pattern. How should this be interpreted?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#4

17 May 2024, 06:57

The essence is what does this variable add to a model given (a linear combination of) the other variables in the model. It's not just a story of pairwise correlations.

But what I think you most need is advice on what kind of model makes sense for your kind of data. As I don't work in your field I can't comment helpfully.

I don't know whether an R-square of 8% with that many predictors is just about what can be expected, or if there is scope to do much better.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#5

17 May 2024, 07:37

Winston:
as an aside, I would start off with a more parsimonious model and increase the focus on the data generating process you're interested in.
In addition, with such a large sample, it is difficult to believe that you can make it with default standard errors.
Eventually, you can investigate your model specification via -linktest-.

Kind regards,
Carlo
(Stata 19.0)
Comment

coid	year	audit firm	DA_abs	SRTH	TWI_QC
2535	2022	D	0.063792	101.7	4
4551	2022	D	0.003987	101.7	4
8932	2022	D	0.040716	101.7	4
6689	2022	K	0.018856	12.5	1
3138	2022	K	0.026391	12.5	1
6284	2022	K	0.01487	12.5	1
1736	2022	E	0.005425	10.5	2
1445	2022	E	0.010722	10.5	2
3416	2022	E	0.077717	10.5	2
1215	2022	P	0.025043	11.6	0
2597	2022	P	0.023182	11.6	0
4966	2022	P	0.05779	11.6	0

Announcement

Correlation and collinearity

Comment

Comment

Comment

Comment