Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predominant zeros, but not zero-inflated

    Hello Statalisters.

    I hope that you might be able to advise on a problem with many zero values that come from observations of zero response, not missing observations. This is a long post, but contains a lot of analysis thus far.

    The problem concerns a developmental toxicity study, in which groups of animals were administered graded doses of a test substance. The highest doses were associated with an increase in post-implantation loss (lost embryos), but also severe body weight reduction. My hypothesis is that the former observation is related to the latter.

    scatter post_r.png


    Post-Implantation loss can be expressed as a count value (i.e. the number of lost implants), but also, and probably more usefully, as a proportion of the total number of implantation sites. In the dataset below, these are designated post_c and post_r respectively, while bwg is bodyweight gain and is the number of implants:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(group id is) float(bwg post_r post_c)
    0  1  7  194         0  0
    0  2  8   42         0  0
    0  3  4  -63         1  4
    0  4  8  171         0  0
    0  5  8   99         0  0
    0  6  7  126         0  0
    0  7  6  119         0  0
    0  9  7  235         0  0
    0 10  8  147         0  0
    0 11  7  230         0  0
    0 12  8  234      .125  1
    0 13  6  187         0  0
    0 14  8  149         0  0
    0 15  3   64         0  0
    0 16 10  174         0  0
    0 17 12  162 .16666667  2
    0 18  7  112         0  0
    0 19 11  157  .1818182  2
    0 20 10  233        .1  1
    0 21 11   86         0  0
    1 23  5  104         0  0
    1 24  7  124         0  0
    1 25  7  232         0  0
    1 26 10  297         0  0
    1 27  8   63         0  0
    1 28  8  201         0  0
    1 29  6  211         0  0
    1 31  9  164         0  0
    1 32 12  158       .25  3
    1 33  9  123         0  0
    1 34  9  264         0  0
    1 35  8  216        .5  4
    1 36  7  182         0  0
    1 37  8  269         0  0
    1 38  9  241         0  0
    1 39  5   75        .2  1
    1 40  8  194         0  0
    1 41 13  109 .15384616  2
    1 42  8   77         0  0
    1 43 11   67         0  0
    2 45  7  198         0  0
    2 46  2 -101         1  2
    2 47 12  136 .16666667  2
    2 48 10    6         0  0
    2 49  7  157  .2857143  2
    2 50  9  100         0  0
    2 51  8  126         0  0
    2 52 11  320  .1818182  2
    2 53  6  111 .16666667  1
    2 54  9   47  .7777778  7
    2 55  8   51         0  0
    2 56  7  231         0  0
    2 57  7  154 .14285715  1
    2 58  7  112  .5714286  4
    2 59  4   79       .25  1
    2 60  8  150         0  0
    2 62 10  -14        .9  9
    2 63  8  107         0  0
    2 64 11  222  .0909091  1
    2 65  .  208         .  .
    2 66  9  101 .11111111  1
    3 67  6   -8         1  6
    3 68  9   87  .3333333  3
    3 69  7 -135         1  7
    3 70  8   66      .625  5
    3 71  9 -139         1  9
    3 72  2 -105         1  2
    3 73  6  147        .5  3
    3 74  .  -16         .  .
    3 76  6  -89         1  6
    3 77 10  -15         1 10
    3 78  8  -14         1  8
    3 79  7   78  .5714286  4
    3 80  9  141 .11111111  1
    3 81  8 -331         1  8
    3 82  8   95         0  0
    3 83  5  107         0  0
    3 84  8  105      .125  1
    3 85  1 -187         1  1
    3 86  9 -100         1  9
    3 87  7   76  .4285714  3
    end
    The first thing to note is that post_c, but not post_r, is overly dispersed:

    Code:
    . summarize post_c post_r, detail
    
                   post-implantation loss (count)
    -------------------------------------------------------------
          Percentiles      Smallest
     1%            0              0
     5%            0              0
    10%            0              0       Obs                  79
    25%            0              0       Sum of Wgt.          79
    
    50%            0                      Mean           1.759494
                            Largest       Std. Dev.      2.690035
    75%            2              9
    90%            7              9       Variance       7.236287
    95%            9              9       Skewness       1.646021
    99%           10             10       Kurtosis       4.627665
    
                   post-implantation loss (ratio)
    -------------------------------------------------------------
          Percentiles      Smallest
     1%            0              0
     5%            0              0
    10%            0              0       Obs                  79
    25%            0              0       Sum of Wgt.          79
    
    50%            0                      Mean           .2533763
                            Largest       Std. Dev.      .3703084
    75%     .4285714              1
    90%            1              1       Variance       .1371283
    95%            1              1       Skewness       1.237554
    99%            1              1       Kurtosis       2.899826
    The data are also heavily skewed to the left (this is normal for these types of data):

    histogram post_r.png


    Because the variance for post_r was similar to the mean, I applied a Poisson regression (with and without bwg as a covariate):

    Code:
    . poisson post_r i.group
    note: you are responsible for interpretation of noncount dep. variable
    
    Iteration 0:   log likelihood = -37.021363  
    Iteration 1:   log likelihood = -36.948318  
    Iteration 2:   log likelihood = -36.948232  
    Iteration 3:   log likelihood = -36.948232  
    
    Poisson regression                              Number of obs     =         79
                                                    LR chi2(3)        =      16.76
                                                    Prob > chi2       =     0.0008
    Log likelihood = -36.948232                     Pseudo R2         =     0.1849
    
    ------------------------------------------------------------------------------
          post_r |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           group |
              1  |  -.3544922   1.241554    -0.29   0.775    -2.787893    2.078908
              2  |   1.082488   .9223988     1.17   0.241    -.7253808    2.890356
              3  |   2.139165   .8451666     2.53   0.011     .4826688    3.795661
                 |
           _cons |  -2.542439   .7972026    -3.19   0.001    -4.104928    -.979951
    ------------------------------------------------------------------------------
    
    . estat ic
    
    Akaike's information criterion and Bayesian information criterion
    
    -----------------------------------------------------------------------------
           Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
    -------------+---------------------------------------------------------------
               . |         79 -45.32978  -36.94823       4    81.89646   91.37425
    -----------------------------------------------------------------------------
                   Note: N=Obs used in calculating BIC; see [R] BIC note.
    
    . poisson post_r i.group c.bwg
    note: you are responsible for interpretation of noncount dep. variable
    
    Iteration 0:   log likelihood = -34.704699  
    Iteration 1:   log likelihood = -34.236766  
    Iteration 2:   log likelihood = -34.235014  
    Iteration 3:   log likelihood = -34.235014  
    
    Poisson regression                              Number of obs     =         79
                                                    LR chi2(4)        =      22.19
                                                    Prob > chi2       =     0.0002
    Log likelihood = -34.235014                     Pseudo R2         =     0.2448
    
    ------------------------------------------------------------------------------
          post_r |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           group |
              1  |  -.2393556   1.242709    -0.19   0.847    -2.675021     2.19631
              2  |    .940722   .9248465     1.02   0.309    -.8719438    2.753388
              3  |   1.380574    .920795     1.50   0.134    -.4241513    3.185299
                 |
             bwg |   -.004207   .0017531    -2.40   0.016    -.0076431    -.000771
           _cons |  -1.992704   .8232568    -2.42   0.015    -3.606257     -.37915
    ------------------------------------------------------------------------------
    
    . estat ic
    
    Akaike's information criterion and Bayesian information criterion
    
    -----------------------------------------------------------------------------
           Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
    -------------+---------------------------------------------------------------
               . |         79 -45.32978  -34.23501       5    78.47003   90.31727
    -----------------------------------------------------------------------------
                   Note: N=Obs used in calculating BIC; see [R] BIC note.
    Inclusion of bwg as a covariate improved the model fit, and showed that it is an influence on the post-implantation loss outcome.

    But, a statistician who is involved in this project states that Poisson regression can only be used for count (integer) data, despite me pointing out Joseph Coveney's nicely-argued article on Statalist

    All the values, in fact, can even be less than 1!
    Since post_c is overly dispersed, a negative binomial regression with ln(is) was applied:

    Code:
    . gen log_is=ln(is)
    (2 missing values generated)
    
    . nbreg post_c i.group c.bwg, offset(log_is)
    
    Fitting Poisson model:
    
    Iteration 0:   log likelihood = -124.58556  
    Iteration 1:   log likelihood = -121.30488  
    Iteration 2:   log likelihood = -121.29454  
    Iteration 3:   log likelihood = -121.29454  
    
    Fitting constant-only model:
    
    Iteration 0:   log likelihood = -144.25986  
    Iteration 1:   log likelihood = -138.41096  
    Iteration 2:   log likelihood = -137.84527  
    Iteration 3:   log likelihood =  -137.8446  
    Iteration 4:   log likelihood =  -137.8446  
    
    Fitting full model:
    
    Iteration 0:   log likelihood =  -126.4518  
    Iteration 1:   log likelihood = -122.80367  
    Iteration 2:   log likelihood = -115.63584  
    Iteration 3:   log likelihood = -115.10791  
    Iteration 4:   log likelihood = -115.03336  
    Iteration 5:   log likelihood = -115.03299  
    Iteration 6:   log likelihood = -115.03299  
    
    Negative binomial regression                    Number of obs     =         79
                                                    LR chi2(4)        =      45.62
    Dispersion     = mean                           Prob > chi2       =     0.0000
    Log likelihood = -115.03299                     Pseudo R2         =     0.1655
    
    ------------------------------------------------------------------------------
          post_c |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           group |
              1  |   .0519224    .507331     0.10   0.918     -.942428    1.046273
              2  |   .9845104    .431245     2.28   0.022     .1392857    1.829735
              3  |   1.454757   .4485487     3.24   0.001     .5756181    2.333897
                 |
             bwg |  -.0053099   .0013785    -3.85   0.000    -.0080117   -.0026081
           _cons |  -2.036627   .3913169    -5.20   0.000    -2.803594    -1.26966
          log_is |          1  (offset)
    -------------+----------------------------------------------------------------
        /lnalpha |  -.7299079   .5161397                     -1.741523    .2817073
    -------------+----------------------------------------------------------------
           alpha |   .4819534   .2487553                      .1752532    1.325391
    ------------------------------------------------------------------------------
    LR test of alpha=0: chibar2(01) = 12.52                Prob >= chibar2 = 0.000
    
    . estat ic
    
    Akaike's information criterion and Bayesian information criterion
    
    -----------------------------------------------------------------------------
           Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
    -------------+---------------------------------------------------------------
               . |         79 -137.8446   -115.033       6     242.066   256.2827
    -----------------------------------------------------------------------------
                   Note: N=Obs used in calculating BIC; see [R] BIC note.
    When I ran -nbreg- against post_r I got, unsurprisingly, the same result as with -poisson-. But the result above doesn't make sense, either in terms of the biology or the earlier Poisson regression.

    Can anyone explain why the two approaches (one ratio, the other count with offset) differ so much? And which is the better model?

    I did consider zero-inflated models, but since the high proportion of zeros come from observations of zero rather than missing observations, then I don't think that they are applicable. This FAQ article seems to agree.
    Stata 14.2MP
    OS X
Working...
X