Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation and collinearity

    Hello, Statalist.
    When I ran regress in Stata, it excludes subsequent variables after the third variable because of collinearity, as shown below(AQI represents characteristics of Big 4 audit firms, with a total of 18 variables, each having only 4 discrete values within a year; DA_abs is derived from financial figures of each company):

    Code:
    . global AQI "CPAEXP QCEXP SREXP CPATH SRTH TR SUP_R IH CLT WH EQCR QC QCS TWI_QC TWI USI PUN DEF"
    
    . global CV "CFO GROWTH LEV LOSS SIZE PREACC AGE"  //control variables
    
    . reg DA_abs $AQI $CV i.Industry
    note: CPATH omitted because of collinearity.
    note: SRTH omitted because of collinearity.
    note: TR omitted because of collinearity.
    note: SUP_R omitted because of collinearity.
    note: IH omitted because of collinearity.
    note: CLT omitted because of collinearity.
    note: WH omitted because of collinearity.
    note: EQCR omitted because of collinearity.
    note: QC omitted because of collinearity.
    note: QCS omitted because of collinearity.
    note: TWI_QC omitted because of collinearity.
    note: TWI omitted because of collinearity.
    note: USI omitted because of collinearity.
    note: PUN omitted because of collinearity.
    note: DEF omitted because of collinearity.
    
          Source |       SS           df       MS      Number of obs   =     1,528
    -------------+----------------------------------   F(26, 1501)     =      5.15
           Model |  .340563159        26  .013098583   Prob > F        =    0.0000
        Residual |  3.82082189     1,501  .002545518   R-squared       =    0.0818
    -------------+----------------------------------   Adj R-squared   =    0.0659
           Total |  4.16138505     1,527  .002725203   Root MSE        =    .05045
    
    ------------------------------------------------------------------------------
          DA_abs | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          CPAEXP |   .0014317   .0043636     0.33   0.743    -.0071277    .0099912
           QCEXP |  -.0007106   .0016029    -0.44   0.658    -.0038547    .0024336
           SREXP |    .000113   .0018543     0.06   0.951    -.0035242    .0037502
           CPATH |          0  (omitted)
            SRTH |          0  (omitted)
              TR |          0  (omitted)
           SUP_R |          0  (omitted)
              IH |          0  (omitted)
             CLT |          0  (omitted)
              WH |          0  (omitted)
            EQCR |          0  (omitted)
              QC |          0  (omitted)
             QCS |          0  (omitted)
          TWI_QC |          0  (omitted)
             TWI |          0  (omitted)
             USI |          0  (omitted)
             PUN |          0  (omitted)
             DEF |          0  (omitted)
             CFO |  -.0538528   .0132678    -4.06   0.000    -.0798782   -.0278274
          GROWTH |   .0131254   .0034793     3.77   0.000     .0063006    .0199502
             LEV |    .021764   .0081384     2.67   0.008     .0058001    .0377279
            LOSS |   .0034688   .0039552     0.88   0.381    -.0042895    .0112271
            SIZE |    -.00372   .0010153    -3.66   0.000    -.0057115   -.0017284
          PREACC |   .0340575   .0138054     2.47   0.014     .0069775    .0611375
             AGE |   -.006149   .0028621    -2.15   0.032    -.0117631   -.0005349
                 |
        Industry |
             12  |   .0183602   .0182958     1.00   0.316    -.0175279    .0542482
             13  |   .0064839   .0176731     0.37   0.714    -.0281826    .0411505
             14  |   .0056012   .0169048     0.33   0.740    -.0275583    .0387608
             15  |   .0118865   .0159979     0.74   0.458    -.0194941     .043267
             16  |   -.003379   .0235514    -0.14   0.886    -.0495761    .0428181
             17  |     .00394    .015913     0.25   0.804    -.0272741    .0351542
             18  |  -.0112597   .0257108    -0.44   0.661    -.0616927    .0391733
             20  |   .0236537   .0169087     1.40   0.162    -.0095134    .0568209
             21  |  -.0028518   .0227249    -0.13   0.900    -.0474277    .0417241
             22  |   .0288204   .0200851     1.43   0.152    -.0105774    .0682182
             23  |   .0226668   .0154395     1.47   0.142    -.0076185    .0529521
             25  |   .0302591   .0165336     1.83   0.067    -.0021724    .0626906
             26  |    .011962   .0180431     0.66   0.507    -.0234304    .0473544
             27  |   .0224664   .0177375     1.27   0.205    -.0123265    .0572593
             29  |   .0124128   .0185279     0.67   0.503    -.0239305     .048756
             99  |   .0175747   .0163523     1.07   0.283     -.014501    .0496505
                 |
           _cons |   .1004619   .0570138     1.76   0.078    -.0113733    .2122971
    ------------------------------------------------------------------------------
    I suspect collinearity issues may arise from high correlations among the AQI variables. Below are the results of Pearson correlation coefficient analysis:
    Code:
     pwcorr $AQI
    
                 |   CPAEXP    QCEXP    SREXP    CPATH     SRTH       TR    SUP_R
    -------------+---------------------------------------------------------------
          CPAEXP |   1.0000
           QCEXP |   0.8452   1.0000
           SREXP |  -0.6573  -0.5525   1.0000
           CPATH |  -0.3554  -0.7896   0.0858   1.0000
            SRTH |  -0.5103  -0.8831   0.4499   0.9252   1.0000
              TR |   0.3813   0.5086  -0.8939  -0.2927  -0.6248   1.0000
           SUP_R |  -0.1781  -0.3338  -0.5800   0.5518   0.2348   0.6022   1.0000
              IH |   0.8675   0.9494  -0.3498  -0.7334  -0.7388   0.2240  -0.5602
             CLT |   0.2120   0.5490   0.3847  -0.8393  -0.5838  -0.2689  -0.9116
              WH |  -0.4382  -0.7723   0.6549   0.7509   0.9372  -0.8509  -0.1182
            EQCR |   0.3514  -0.1848  -0.0461   0.6527   0.6228  -0.3963   0.0142
              QC |  -0.2309  -0.5983   0.6107   0.6562   0.8527  -0.8772  -0.2670
             QCS |  -0.2155  -0.6922   0.2920   0.9111   0.9494  -0.5919   0.1751
          TWI_QC |  -0.4956  -0.8761   0.4338   0.9309   0.9998  -0.6155   0.2436
             TWI |   0.1047  -0.0396  -0.7975   0.3648  -0.0036   0.7828   0.9538
             USI |   0.8647   0.6194  -0.9139  -0.0300  -0.3373   0.6482   0.3390
             PUN |  -0.5925  -0.3956  -0.1981   0.1781   0.0122   0.5115   0.7583
             DEF |  -0.0516  -0.5709  -0.0882   0.9507   0.8358  -0.2312   0.4903
    
                 |       IH      CLT       WH     EQCR       QC      QCS   TWI_QC
    -------------+---------------------------------------------------------------
              IH |   1.0000
             CLT |   0.6636   1.0000
              WH |  -0.5418  -0.2733   1.0000
            EQCR |   0.0131  -0.3753   0.6481   1.0000
              QC |  -0.3276  -0.1466   0.9710   0.7539   1.0000
             QCS |  -0.5132  -0.5633   0.9139   0.8369   0.8967   1.0000
          TWI_QC |  -0.7327  -0.5936   0.9343   0.6350   0.8518   0.9543   1.0000
             TWI |  -0.2855  -0.8098  -0.3410   0.0095  -0.4366   0.0066   0.0086
             USI |   0.5335  -0.2763  -0.4472   0.3767  -0.3232  -0.0818  -0.3186
             PUN |  -0.6604  -0.5285  -0.2731  -0.5827  -0.4918  -0.2217   0.0082
             DEF |  -0.4887  -0.8003   0.6883   0.8324   0.6610   0.9207   0.8465
    
                 |      TWI      USI      PUN      DEF
    -------------+------------------------------------
             TWI |   1.0000
             USI |   0.5815   1.0000
             PUN |   0.6447  -0.1965   1.0000
             DEF |   0.3803   0.2318  -0.0428   1.0000
    Please note the significant positive correlation of 0.998 between SRTH, representing manager training hours, and TWI_QCS, representing the number of quality control deficiencies. However, these two variables seem unrelated on a literal basis. Why is this?
    Furthermore, many variables exhibit correlation coefficients exceeding 0.7, despite most of them not having strong apparent correlations on a literal basis.
    I am truly puzzled by this.

    Here is my dataset:
    Code:
     
    coid year audit firm DA_abs SRTH TWI_QC
    2535 2022 D 0.063792 101.7 4
    4551 2022 D 0.003987 101.7 4
    8932 2022 D 0.040716 101.7 4
    6689 2022 K 0.018856 12.5 1
    3138 2022 K 0.026391 12.5 1
    6284 2022 K 0.01487 12.5 1
    1736 2022 E 0.005425 10.5 2
    1445 2022 E 0.010722 10.5 2
    3416 2022 E 0.077717 10.5 2
    1215 2022 P 0.025043 11.6 0
    2597 2022 P 0.023182 11.6 0
    4966 2022 P 0.05779 11.6 0
    The total sample size is 1,528, with the composition of Big 4 audit firms in the sample as follows: D: 37%, P: 27%, K: 24%, E: 11%. Only the 2 mentioned AQIs are presented here.

    In short, I suspect that my issue may arise from high correlations among the explanatory variables rather than being a technical problem with Stata.
    However, I'm unsure how to confirm and explain this.
    Last edited by Winston Wu; 17 May 2024, 04:01.

  • #2
    I am not sure where to pitch this explanation but nothing here indicates to me a technical problem with Stata.

    You're being given a message about your data and the model you're building.

    Collinearity can mean more than just that predictor variables are strongly correlated pairwise. It means here that a variable is dropped from the model because it is redundant given what else is on offer.

    What does "on a literal basis" mean here? Sometimes, for example, high correlations are artefacts of massive outliers. As in your introductory course, look at a scatter plot (matrix) to see what is going on. Here it seems more likely that you have many duplicate observations as far as essentially categorical variables are concerned, especially given your use of industry as a factor variable.

    Comment


    • #3
      Thanks for your explanation, Nick.

      "on a literal basis" refers to the definition of variables, suggesting that these variables seems not have such strong correlations on the surface.

      After examining the scatter plots of the independent variables, I noticed that each plot only had four points, as expected.
      However, not all variables with high correlations exhibit a linear pattern. How should this be interpreted?

      Comment


      • #4
        The essence is what does this variable add to a model given (a linear combination of) the other variables in the model. It's not just a story of pairwise correlations.

        But what I think you most need is advice on what kind of model makes sense for your kind of data. As I don't work in your field I can't comment helpfully.

        I don't know whether an R-square of 8% with that many predictors is just about what can be expected, or if there is scope to do much better.

        Comment


        • #5
          Winston:
          as an aside, I would start off with a more parsimonious model and increase the focus on the data generating process you're interested in.
          In addition, with such a large sample, it is difficult to believe that you can make it with default standard errors.
          Eventually, you can investigate your model specification via -linktest-.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment

          Working...
          X