Predominant zeros, but not zero-inflated

Nigel Moore

Join Date: Apr 2016
Posts: 79

Predominant zeros, but not zero-inflated

30 Nov 2017, 00:21

Hello Statalisters.

I hope that you might be able to advise on a problem with many zero values that come from observations of zero response, not missing observations. This is a long post, but contains a lot of analysis thus far.

The problem concerns a developmental toxicity study, in which groups of animals were administered graded doses of a test substance. The highest doses were associated with an increase in post-implantation loss (lost embryos), but also severe body weight reduction. My hypothesis is that the former observation is related to the latter.

Post-Implantation loss can be expressed as a count value (i.e. the number of lost implants), but also, and probably more usefully, as a proportion of the total number of implantation sites. In the dataset below, these are designated post_c and post_r respectively, while bwg is bodyweight gain and is the number of implants:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(group id is) float(bwg post_r post_c)
0  1  7  194         0  0
0  2  8   42         0  0
0  3  4  -63         1  4
0  4  8  171         0  0
0  5  8   99         0  0
0  6  7  126         0  0
0  7  6  119         0  0
0  9  7  235         0  0
0 10  8  147         0  0
0 11  7  230         0  0
0 12  8  234      .125  1
0 13  6  187         0  0
0 14  8  149         0  0
0 15  3   64         0  0
0 16 10  174         0  0
0 17 12  162 .16666667  2
0 18  7  112         0  0
0 19 11  157  .1818182  2
0 20 10  233        .1  1
0 21 11   86         0  0
1 23  5  104         0  0
1 24  7  124         0  0
1 25  7  232         0  0
1 26 10  297         0  0
1 27  8   63         0  0
1 28  8  201         0  0
1 29  6  211         0  0
1 31  9  164         0  0
1 32 12  158       .25  3
1 33  9  123         0  0
1 34  9  264         0  0
1 35  8  216        .5  4
1 36  7  182         0  0
1 37  8  269         0  0
1 38  9  241         0  0
1 39  5   75        .2  1
1 40  8  194         0  0
1 41 13  109 .15384616  2
1 42  8   77         0  0
1 43 11   67         0  0
2 45  7  198         0  0
2 46  2 -101         1  2
2 47 12  136 .16666667  2
2 48 10    6         0  0
2 49  7  157  .2857143  2
2 50  9  100         0  0
2 51  8  126         0  0
2 52 11  320  .1818182  2
2 53  6  111 .16666667  1
2 54  9   47  .7777778  7
2 55  8   51         0  0
2 56  7  231         0  0
2 57  7  154 .14285715  1
2 58  7  112  .5714286  4
2 59  4   79       .25  1
2 60  8  150         0  0
2 62 10  -14        .9  9
2 63  8  107         0  0
2 64 11  222  .0909091  1
2 65  .  208         .  .
2 66  9  101 .11111111  1
3 67  6   -8         1  6
3 68  9   87  .3333333  3
3 69  7 -135         1  7
3 70  8   66      .625  5
3 71  9 -139         1  9
3 72  2 -105         1  2
3 73  6  147        .5  3
3 74  .  -16         .  .
3 76  6  -89         1  6
3 77 10  -15         1 10
3 78  8  -14         1  8
3 79  7   78  .5714286  4
3 80  9  141 .11111111  1
3 81  8 -331         1  8
3 82  8   95         0  0
3 83  5  107         0  0
3 84  8  105      .125  1
3 85  1 -187         1  1
3 86  9 -100         1  9
3 87  7   76  .4285714  3
end

The first thing to note is that post_c, but not post_r, is overly dispersed:

Code:

. summarize post_c post_r, detail

               post-implantation loss (count)
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                  79
25%            0              0       Sum of Wgt.          79

50%            0                      Mean           1.759494
                        Largest       Std. Dev.      2.690035
75%            2              9
90%            7              9       Variance       7.236287
95%            9              9       Skewness       1.646021
99%           10             10       Kurtosis       4.627665

               post-implantation loss (ratio)
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                  79
25%            0              0       Sum of Wgt.          79

50%            0                      Mean           .2533763
                        Largest       Std. Dev.      .3703084
75%     .4285714              1
90%            1              1       Variance       .1371283
95%            1              1       Skewness       1.237554
99%            1              1       Kurtosis       2.899826

The data are also heavily skewed to the left (this is normal for these types of data):

Because the variance for post_r was similar to the mean, I applied a Poisson regression (with and without bwg as a covariate):

Code:

. poisson post_r i.group
note: you are responsible for interpretation of noncount dep. variable

Iteration 0:   log likelihood = -37.021363  
Iteration 1:   log likelihood = -36.948318  
Iteration 2:   log likelihood = -36.948232  
Iteration 3:   log likelihood = -36.948232  

Poisson regression                              Number of obs     =         79
                                                LR chi2(3)        =      16.76
                                                Prob > chi2       =     0.0008
Log likelihood = -36.948232                     Pseudo R2         =     0.1849

------------------------------------------------------------------------------
      post_r |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       group |
          1  |  -.3544922   1.241554    -0.29   0.775    -2.787893    2.078908
          2  |   1.082488   .9223988     1.17   0.241    -.7253808    2.890356
          3  |   2.139165   .8451666     2.53   0.011     .4826688    3.795661
             |
       _cons |  -2.542439   .7972026    -3.19   0.001    -4.104928    -.979951
------------------------------------------------------------------------------

. estat ic

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
           . |         79 -45.32978  -36.94823       4    81.89646   91.37425
-----------------------------------------------------------------------------
               Note: N=Obs used in calculating BIC; see [R] BIC note.

. poisson post_r i.group c.bwg
note: you are responsible for interpretation of noncount dep. variable

Iteration 0:   log likelihood = -34.704699  
Iteration 1:   log likelihood = -34.236766  
Iteration 2:   log likelihood = -34.235014  
Iteration 3:   log likelihood = -34.235014  

Poisson regression                              Number of obs     =         79
                                                LR chi2(4)        =      22.19
                                                Prob > chi2       =     0.0002
Log likelihood = -34.235014                     Pseudo R2         =     0.2448

------------------------------------------------------------------------------
      post_r |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       group |
          1  |  -.2393556   1.242709    -0.19   0.847    -2.675021     2.19631
          2  |    .940722   .9248465     1.02   0.309    -.8719438    2.753388
          3  |   1.380574    .920795     1.50   0.134    -.4241513    3.185299
             |
         bwg |   -.004207   .0017531    -2.40   0.016    -.0076431    -.000771
       _cons |  -1.992704   .8232568    -2.42   0.015    -3.606257     -.37915
------------------------------------------------------------------------------

. estat ic

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
           . |         79 -45.32978  -34.23501       5    78.47003   90.31727
-----------------------------------------------------------------------------
               Note: N=Obs used in calculating BIC; see [R] BIC note.

Inclusion of bwg as a covariate improved the model fit, and showed that it is an influence on the post-implantation loss outcome.

But, a statistician who is involved in this project states that Poisson regression can only be used for count (integer) data, despite me pointing out Joseph Coveney's nicely-argued article on Statalist

All the values, in fact, can even be less than 1!

Since post_c is overly dispersed, a negative binomial regression with ln(is) was applied:

Code:

. gen log_is=ln(is)
(2 missing values generated)

. nbreg post_c i.group c.bwg, offset(log_is)

Fitting Poisson model:

Iteration 0:   log likelihood = -124.58556  
Iteration 1:   log likelihood = -121.30488  
Iteration 2:   log likelihood = -121.29454  
Iteration 3:   log likelihood = -121.29454  

Fitting constant-only model:

Iteration 0:   log likelihood = -144.25986  
Iteration 1:   log likelihood = -138.41096  
Iteration 2:   log likelihood = -137.84527  
Iteration 3:   log likelihood =  -137.8446  
Iteration 4:   log likelihood =  -137.8446  

Fitting full model:

Iteration 0:   log likelihood =  -126.4518  
Iteration 1:   log likelihood = -122.80367  
Iteration 2:   log likelihood = -115.63584  
Iteration 3:   log likelihood = -115.10791  
Iteration 4:   log likelihood = -115.03336  
Iteration 5:   log likelihood = -115.03299  
Iteration 6:   log likelihood = -115.03299  

Negative binomial regression                    Number of obs     =         79
                                                LR chi2(4)        =      45.62
Dispersion     = mean                           Prob > chi2       =     0.0000
Log likelihood = -115.03299                     Pseudo R2         =     0.1655

------------------------------------------------------------------------------
      post_c |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       group |
          1  |   .0519224    .507331     0.10   0.918     -.942428    1.046273
          2  |   .9845104    .431245     2.28   0.022     .1392857    1.829735
          3  |   1.454757   .4485487     3.24   0.001     .5756181    2.333897
             |
         bwg |  -.0053099   .0013785    -3.85   0.000    -.0080117   -.0026081
       _cons |  -2.036627   .3913169    -5.20   0.000    -2.803594    -1.26966
      log_is |          1  (offset)
-------------+----------------------------------------------------------------
    /lnalpha |  -.7299079   .5161397                     -1.741523    .2817073
-------------+----------------------------------------------------------------
       alpha |   .4819534   .2487553                      .1752532    1.325391
------------------------------------------------------------------------------
LR test of alpha=0: chibar2(01) = 12.52                Prob >= chibar2 = 0.000

. estat ic

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |        Obs  ll(null)  ll(model)      df         AIC        BIC
-------------+---------------------------------------------------------------
           . |         79 -137.8446   -115.033       6     242.066   256.2827
-----------------------------------------------------------------------------
               Note: N=Obs used in calculating BIC; see [R] BIC note.

When I ran -nbreg- against post_r I got, unsurprisingly, the same result as with -poisson-. But the result above doesn't make sense, either in terms of the biology or the earlier Poisson regression.

Can anyone explain why the two approaches (one ratio, the other count with offset) differ so much? And which is the better model?

I did consider zero-inflated models, but since the high proportion of zeros come from observations of zero rather than missing observations, then I don't think that they are applicable. This FAQ article seems to agree.

Stata 14.2MP
OS X

Tags: None

Announcement

Predominant zeros, but not zero-inflated