Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Bayes regression to handle collinearity among independent variables?

    Dear Stata users,
    I am analysing the association between exposure to heavy metals in pregnancy and childhood outcomes. This question involves an analysis with one continuous outcome and three exposures (three different toxic metals). I am interested in whether there are any correlations between these three metals. However, the metal variables are somewhat correlated (0.10-0.50), and when including interaction terms in the model, some of the coefficients exhibit correlations > .70. In this analysis, PCA is not an option, since I am interested in the effect of the interactions between the variables.

    Code:
    regress outcome c.logAs##c.logCd##c.logPb
    vif
    which gives:
    Code:
        Variable |       VIF       1/VIF 
    -------------+----------------------
           logAs |    143.41    0.006973
           logCd |     17.56    0.056961
         c.logAs#|
         c.logCd |    122.00    0.008197
           logPb |      4.84    0.206595
         c.logAs#|
         c.logPb |    135.32    0.007390
         c.logCd#|
         c.logPb |     17.51    0.057119
         c.logAs#|
         c.logCd#|
         c.logPb |    114.29    0.008750
    -------------+----------------------
        Mean VIF |     79.27
    I played around with bayesian regression a bit, and ended up with mean centering the exposure variables and the following code :
    Code:
    bayes, gibbs: regress outcome c.c_logAs##c.c_logCd##c.c_logPb
    bayesstats ess
    bayesgraph diagnostics _all
    bayesgraph matrix _all
    which gave me the following outputs:
    Code:
    Model summary
    -----------------------------------------------------------------------------------------------
    Likelihood:
      outcome ~ normal(xb_outcome,{sigma2})
    
    Priors:
                            {outcome:c_logAs} ~ normal(0,10000)                                 (1)
                            {outcome:c_logCd} ~ normal(0,10000)                                 (1)
                {outcome:c.c_logAs#c.c_logCd} ~ normal(0,10000)                                 (1)
                            {outcome:c_logPb} ~ normal(0,10000)                                 (1)
                {outcome:c.c_logAs#c.c_logPb} ~ normal(0,10000)                                 (1)
                {outcome:c.c_logCd#c.c_logPb} ~ normal(0,10000)                                 (1)
      {outcome:c.c_logAs#c.c_logCd#c.c_logPb} ~ normal(0,10000)                                 (1)
                              {outcome:_cons} ~ normal(0,10000)                                 (1)
                                     {sigma2} ~ igamma(.01,.01)
    -----------------------------------------------------------------------------------------------
    (1) Parameters are elements of the linear form xb_outcome.
    
    Bayesian linear regression                                        MCMC iterations  =     12,500
    Gibbs sampling                                                    Burn-in          =      2,500
                                                                      MCMC sample size =     10,000
                                                                      Number of obs    =        784
                                                                      Acceptance rate  =          1
                                                                      Efficiency:  min =      .9792
                                                                                   avg =      .9977
    Log marginal likelihood = -1166.9975                                           max =          1
     
    -----------------------------------------------------------------------------------------------
                                  |                                                Equal-tailed
                                  |      Mean   Std. Dev.     MCSE     Median  [95% Cred. Interval]
    ------------------------------+----------------------------------------------------------------
    outcome                       |
                          c_logAs | -.0134398   .0435736   .000436   -.013836  -.1000772   .0706683
                          c_logCd |  .0280181   .0536855   .000537   .0278446  -.0759983   .1332477
                                  |
              c.c_logAs#c.c_logCd |  .0042652   .0625478   .000625   .0043268  -.1183597   .1272451
                                  |
                          c_logPb | -.0868145   .0832914   .000818  -.0873019  -.2495841   .0755957
                                  |
              c.c_logAs#c.c_logPb |  .0670613   .1036493   .001036   .0664074  -.1356799   .2720652
                                  |
              c.c_logCd#c.c_logPb | -.1586534   .0944886   .000955  -.1576177  -.3435851     .02337
                                  |
    c.c_logAs#c.c_logCd#c.c_logPb | -.0640544   .1351075   .001351  -.0640332  -.3267847   .2029934
                                  |
                            _cons | -.0330334   .0366403   .000366  -.0332321  -.1041869   .0400286
    ------------------------------+----------------------------------------------------------------
                           sigma2 |   .985651   .0500573   .000501   .9842162   .8922606   1.088576
    -----------------------------------------------------------------------------------------------
    Note: Default priors are used for model parameters.
    
    
    Efficiency summaries                     MCMC sample size =    10,000
     
    ---------------------------------------------------------------------
                                  |        ESS   Corr. time    Efficiency
    ------------------------------+--------------------------------------
    outcome                       |
                          c_logAs |   10000.00         1.00        1.0000
                          c_logCd |   10000.00         1.00        1.0000
                                  |
              c.c_logAs#c.c_logCd |   10000.00         1.00        1.0000
                                  |
                          c_logPb |   10000.00         1.00        1.0000
                                  |
              c.c_logAs#c.c_logPb |   10000.00         1.00        1.0000
                                  |
              c.c_logCd#c.c_logPb |    9792.00         1.02        0.9792
                                  |
    c.c_logAs#c.c_logCd#c.c_logPb |   10000.00         1.00        1.0000
                                  |
                            _cons |   10000.00         1.00        1.0000
    ------------------------------+--------------------------------------
                           sigma2 |   10000.00         1.00        1.0000
    ---------------------------------------------------------------------
    Click image for larger version

Name:	Graph__7.png
Views:	1
Size:	120.2 KB
ID:	1460694

    (The diagnostic plots all look similar to the ones above)

    Click image for larger version

Name:	Graph__1.png
Views:	1
Size:	309.2 KB
ID:	1460695


    According to the outputs, does it seem that I have (mostly) got rid of the collinearity problem and obtained potentially useful results?


    Best,
    Kjell Weyde

  • #2
    Kjell:
    interactions usually come with a large dose of quasi-extreme multicollinearity, as you can see from this meanigless toy-example:
    Code:
    . sysuse auto.dta
    (1978 Automobile Data)
    
    . regress price c.mpg##c.mpg##c.displacement
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(5, 68)        =      8.07
           Model |   236571490         5    47314298   Prob > F        =    0.0000
        Residual |   398493906        68   5860204.5   R-squared       =    0.3725
    -------------+----------------------------------   Adj R-squared   =    0.3264
           Total |   635065396        73  8699525.97   Root MSE        =    2420.8
    
    --------------------------------------------------------------------------------------------
                         price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    ---------------------------+----------------------------------------------------------------
                           mpg |  -190.4126   671.0174    -0.28   0.777    -1529.407    1148.582
                               |
                   c.mpg#c.mpg |  -.6939216   13.61563    -0.05   0.960    -27.86349    26.47564
                               |
                  displacement |     64.969   38.09187     1.71   0.093    -11.04214    140.9801
                               |
          c.mpg#c.displacement |  -5.996425   4.086512    -1.47   0.147    -14.15093    2.158083
                               |
    c.mpg#c.mpg#c.displacement |   .1398279   .1089167     1.28   0.204    -.0775119    .3571678
                               |
                         _cons |   9593.572   8672.604     1.11   0.273    -7712.339    26899.48
    --------------------------------------------------------------------------------------------
    
    . estat vif
    
        Variable |       VIF       1/VIF 
    -------------+----------------------
             mpg |    187.74    0.005326
     c.mpg#c.mpg |    183.73    0.005443
    displacement |    152.44    0.006560
           c.mpg#|
              c. |
    displacement |    338.67    0.002953
     c.mpg#c.mpg#|
              c. |
    displacement |    141.87    0.007049
    Quasi-extreme multicollinearity is also suspected from the very large 95 CIs and the lack of statistical significance of coefficients despite the F-test rejects the null that all predictors (but the constant) are jointly zero.
    As usual, the main issue is checking whether your model specification gives a fair and true view of the data generating process (in your case: does the right-hand side of your regerssion equation include only those three predictors??, regardless the frequentist or the bayesian flavour.

    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Carlo, thank you for your reply!
      I have now done the following:
      Code:
      regress outcome c_logAs c_logCd c_logPb logAslogCd logAslogPb logPblogCd logAslogCdlogPb
      collin c_logAs c_logCd c_logPb logAslogCd logAslogPb logPblogCd logAslogCdlogPb if !missing(outcome)
      It gave this output:
      Code:
        Collinearity Diagnostics
      
                              SQRT                   R-
        Variable      VIF     VIF    Tolerance    Squared
      ----------------------------------------------------
         c_logAs      1.06    1.03    0.9400      0.0600
         c_logCd      1.09    1.04    0.9163      0.0837
         c_logPb      1.08    1.04    0.9291      0.0709
      logAslogCd      1.08    1.04    0.9225      0.0775
      logAslogPb      1.06    1.03    0.9409      0.0591
      logPblogCd      1.06    1.03    0.9473      0.0527
      logAslogCdlogPb      1.10    1.05    0.9107      0.0893
      ----------------------------------------------------
        Mean VIF      1.08
      
                                 Cond
              Eigenval          Index
      ---------------------------------
          1     1.4189          1.0000
          2     1.3130          1.0396
          3     1.2402          1.0697
          4     1.0191          1.1800
          5     0.8727          1.2751
          6     0.7694          1.3580
          7     0.6914          1.4326
          8     0.6752          1.4496
      ---------------------------------
       Condition Number         1.4496
       Eigenvalues & Cond Index computed from scaled raw sscp (w/ intercept)
       Det(correlation matrix)    0.7730
      Does not this low condition number indicate that there is no overall or "joint" collinearity problems in the model?

      The model is going to have a few more covariates, such as age and gender.


      Best regards,
      Kjell

      Comment


      • #4
        Kjell:
        your new regression model actually show no evidence of quasi-extreme multicollinearity: go that way.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment

        Working...
        X